Issues with system slowness and "bad gateway" errors while accessing DrChrono

Incident Report for DrChrono

Postmortem

Description

On 05/16, customers were receiving scattered errors throughout the application on different screens.

Timeline

All times are PST

Date/Time	Activity
2022-05-14 06:40	One of our virtual servers encounters a hardware failure on AWS and is shut down automatically.
2022-05-14 06:53	The affected instance is rebooted automatically and starts serving bad traffic. Some customers start getting affected.
2022-05-15 19:44	Our engineering team is notified with users having problems when they open any appointment.
2022-05-16 04:45	The support team raises the urgency of the issue due to the increase in customer traffic and support tickets received.
2022-05-16 05:42	A member of the DevOps team starts to investigate the issue.
2022-05-16 05:52	An incident is posted to the status page.
2022-05-16 06:55	The affected instance is removed out of the pool and request error rates drop sharply.
2022-05-16 07:07	Status page is updated to monitoring.
2022-05-16 09:58	Status page is updated to resolved.

Contributing Factor(s)

AWS failures are infrequent enough that the infrastructure is not as mature as it should be against them. Upon being restarted, the server should have initiated all necessary services to serve production traffic, but only one of two services was properly initiated. This was enough to pass our health checks (so traffic was sent to the instance) but not enough to properly serve traffic.

Stabilization Steps

The DevOps team took the affected instance out of the webserver pool.

Impact

Customers were receiving scattered errors on the platform.

Corrective Actions

The following DevOps tasks were created:

Ensure webservers serve traffic correctly after a forced reboot: This ensures that webservers come up healthy even after a hard reboot.
Enhance health checks to ensure applications are still up: This is a maintenance task so that noticeable bad instances do not get to serve production traffic.

Posted May 20, 2022 - 05:59 PDT

Resolved

This incident has been resolved. We apologize for the inconvenience caused today. A post-mortem will be available via this status page within the next week.

If you continue to experience issues related to the errors described here, please reach out to our support team with the details so we can troubleshoot with you.

Posted May 16, 2022 - 09:58 PDT

Monitoring

Our team has identified and implemented a fix for this issue. We are currently monitoring the results.

Posted May 16, 2022 - 07:07 PDT

Investigating

We are receiving reports from customers who are experiencing “bad gateway” or “create cash model” error as well as general slowness in accessing the platform. Our team is working on identifying the cause with utmost priority and will send another update later today with our progress.

Posted May 16, 2022 - 05:52 PDT

This incident affected: drchrono.com, drchrono iPad EHR, and drchrono iPad Check-In Kiosk Application.