Problem

Design a Health Check System

Monitor whether each service in a fleet is up and responding correctly. Alert on-call engineers when a service goes down.

Submitted March 24, 2026

·@Yi

Understand requirementsDesign architecture

5/10

Iteration Score

Evaluated on stage progression, iteration quality, and depth of thinking

6/10

Goal Alignment Score

The user partially aligned with their goals but focused too much on implementation details early on.

·The user made some commitments about the health check system but repeated similar prompts without significant forward movement.

·There was a lack of exploration of tradeoffs and failure modes.

·The user jumped into implementation details too early without fully defining the architecture.

Prompt Chain · 5 prompts

User Prompt

Clarifying

I would like to design a health check system for software engineers, to allow them to monitor each microservice. It should be able to show the response time and all logs. If a service does not pass a health check or error messages shows up, it should alert engineers and show the logs

AI Response

511 in·125 out·636 total

It sounds like you have a good foundational idea for the health check system. To clarify further, what specific aspects of the health check do you want to monitor? For example, will you focus on metrics like uptime, response time, error rates, and specific logs? Also, how do you envision engineers interacting with this system—will it be via a dashboard, notifications, or both? Once we have a clearer picture of the key metrics and user interaction, we can start to define the scope and requirements for the system. Next: Clarify the specific metrics and user interactions you want to include in the health check system.

User Prompt

Constraints

There would be a health check every 10 seconds and a green light would be shown via a dashboard if the health check passes. It would also track requests that is coming in to the endpoint and would also show a green if the request is processed successfully. If the request is processed successfully but it took a significant longer time, of more than 30000ms, it would show a yellow. The date time and the log of that specific request would also be shown on another component of the dashboard. As for errors, if the error is a 300 or 400 error, it would return a yellow and the datetime and the logs of that request would also be shown in another component of the dashboard. If the error is a 500, the dashboard would show a red circle, datetime and logs will be shown in its own unique component. A graph showing the response time history would also be in the dashboard for easier readibility

AI Response

830 in·187 out·1017 total

You've provided a comprehensive overview of how the health check system will function. Now, let's focus on constraining your design. Consider the following aspects: 1. **Scale**: How many microservices will this health check system monitor concurrently? This will affect the architecture and performance requirements. 2. **Latency**: You mentioned checks every 10 seconds. Are there any requirements for how quickly alerts must be sent after a health check failure is detected? 3. **Assumptions**: Are there any assumptions about the infrastructure, such as whether the services are hosted on a cloud platform or on-premises? 4. **Requirements**: What technologies or frameworks do you plan to use for the dashboard? Will you need any specific integrations for logging or alerting? Defining these constraints will help in making informed decisions about the system architecture and components. Next: Define the scale, latency, assumptions, and requirements for your health check system.

User Prompt

Constraints

The health check should be monitoring between 5 to 10 microservices. To prevent latency, checks can be done less frequently, about every quarter of an hour, for services that has been deployed later than 1 week ago, and checks can be done more frequently, every 1 minute for services that are deployed no more than 3 days ago. However, health check logs should always be printing, every 10 seconds. The dashboard should have a main page, where a general graph showing all the response time and health checks of all the services, and compiled info and error logs for the microservices. There would also be tabs for each service to be clicked into for more detailed information and customizable features to be added into

AI Response

1170 in·196 out·1366 total

Great! You've established a clear scale and defined the frequency of checks based on deployment age. Now, it's time to commit to some assumptions and choose a direction for the design. Consider the following: 1. **Assumptions**: Assume that the health check system will need to handle network latency and potential downtime of the services being monitored. Will you also assume a certain level of reliability for the logging infrastructure? 2. **Technology Stack**: Decide on the technology stack you'll use for the dashboard and backend services. For instance, will you use a specific language, framework, or cloud service for hosting? 3. **Alerting Mechanism**: Commit to how alerts will be sent out. Will this be through email, SMS, or integration with a messaging platform like Slack? Choosing these key elements will help you move forward with the architecture and implementation details. Next: Commit to assumptions about reliability and technology stack, and decide on the alerting mechanism for your health check system.

User Prompt

Architecture

The backend would be python while frontend would be using typescript and react. you may recommend messaging platforms to me as i do not have extensive knowledge on those services. The health check will need to be able to handle network latency and large log outputs during peak hours as large amount of request comes in.

AI Response

1434 in·209 out·1643 total

Good choices on the technology stack! Python is great for backend development, and React with TypeScript will provide a robust frontend experience. Regarding messaging platforms for alerts, here are a few popular options: 1. **Slack**: Widely used in development teams, it allows you to send alerts directly to channels or users. 2. **Microsoft Teams**: Similar to Slack, it integrates well with other Microsoft services and is popular in corporate environments. 3. **Twilio**: If you want SMS alerts, Twilio is an excellent option for sending text messages. 4. **Email Services**: Consider using services like SendGrid or Amazon SES for sending email alerts. Now that you have a clearer picture of the technology and alerting mechanisms, let’s start building the structure of your health check system. Think about how you want to design the backend components that will handle health checks, logging, and alerting. Next: Outline the main components of the backend system, including how health checks, logging, and alerting will be structured.

User Prompt

Architecture

The backend would be python while frontend would be using typescript and react. Slack would be used to send the alerts to the users and channels, only when there is an error 500. The health check will need to be able to handle network latency and large log outputs during peak hours as large amount of request comes in.

AI Response

1716 in·248 out·1964 total

Great! Now that you've defined the backend technology and alerting mechanism, let's focus on building the structure of your backend components. Here are some key components to consider: 1. **Health Check Service**: This component will perform the health checks at the defined intervals (every minute or every 15 minutes) based on the deployment age. It will collect metrics such as response time and error codes. 2. **Logging Service**: This component will handle logging of all health check data, including successful checks, errors, and request logs. Ensure that it can efficiently manage large volumes of logs during peak times. 3. **Alerting Service**: This component will monitor the logs for errors, particularly 500 errors, and trigger alerts via Slack. It should also include a mechanism to prevent alert fatigue, such as rate limiting alerts for repeated failures. 4. **Dashboard API**: This will serve data to your React frontend, providing the necessary information for the dashboard, such as health check status, logs, and response time history. With these components defined, think about how they will interact with each other and how data will flow through the system. Next: Detail the interactions between the components and how data will flow within the health check system.

Comments · 0

You'll be signed in with Google — your comment will be saved.