Intermittent Downtime
Incident Report for DrChrono
Postmortem

Primary Database Outages

Summary

On 3/2/2021 and 3/4/2021, DrChrono experienced issues related to our primary database server’s data-saving mechanisms that disabled a significant portion of our EHR platform’s ability to handle customers' requests.

The issue has been identified and corrected. We do not expect to experience recurring issues of this nature in the future.

Timeline (EST, 24-hour clock)

2021-03-02 08:58: The DrChrono account management team reported that the DrChrono site was down.

2021-03-02 08:59: The site recovered, and upon investigation the DevOps team determined it was from a failed health check from the DNS provider.

2021-03-02 13:14: DrChrono employees started experiencing issues while loading pages.

2021-03-02 13:17: A long running query was identified by the DevOps team as a potential reason, and since the query had finished running and was no longer consuming resources, the site started recovering.

2021-03-02 13:21: Site started to have reduced availability again because of long running queries. Several queries were identified related to specific aspects of the platform that could have caused the issue.

2021-03-02 13:35: Since the new queries had finished, the site started recovering and remained stable throughout the day.

2021-03-03: The incident during this day is covered in a separate RCA: Root Cause Analysis 3/3/2021 & 3/8/2021.

2021-03-04 13:33: DevOps, engineering and executive team members began a Zoom meeting to discuss and troubleshoot the ongoing issues throughout the day.

2021-03-04 13:33: DevOps and engineering members began researching and verifying various internal settings in our primary database servers.

2021-03-04 13:45: While investigating the current issues on the primary database server, the server started recovering and the site was once again at normal availability.

2021-03-04 16:23: Another site outage began, and the investigation into the cause for the latest database issue was started.

2021-03-04 18:34: The site recovered and came back online. Investigation into the causes was ongoing over the next several days.

2021-03-08: The incident during this day is covered in a separate RCA: Root Cause Analysis 3/3/2021 & 3/8/2021.

2021-03-09: As the DevOps and engineering teams continued their investigations, it was determined that an internal setting on the primary database server was set incorrectly for the larger AWS servers we’re now utilizing. The value was adjusted. The teams have been monitoring since.

Contributing Factors

After a variable threshold is reached, the database server is designed to slow writes it accepts as an overload prevention mechanism. If the primary server continues receiving a large number of write requests, it can trigger a separate threshold that will stop all writes for a brief window to allow the server to catch up.

The internal settings of the database server combined with the query load from our normal traffic as well as jobs scheduled at the beginning of the month played a contributing role in the primary database hitting these thresholds.

The traffic and query loads highlighted the issues with the internal settings that were causing the sharp increase in disk IO times. This sharp increase in disk IO times caused the primary server to quickly reach the second threshold and stop accepting write requests until it was able to recover. This behavior is the reason for the up and down nature of the issue throughout both days.

Impact

Customers experienced intermittent but widespread outages during the time the issue first started and corrective action was taken. 

Corrective Actions

The root cause, identified as an internal system setting on the database server, was isolated and modified. This modification has been under both manual and automated monitoring since it was changed. Additionally, we are continuing to work with our database vendor to further tune our database servers to meet the increase in capacity as a result of our ongoing migration to AWS.

Posted Mar 17, 2021 - 13:39 PDT

Resolved
From approximately 10:00 am PST to 11:00 am PST many DrChrono customers experienced intermittent downtime with the system. We encountered a brief issue with one of our databases that caused disruptions and was the cause of the problems you experienced. We've resolved the issue and you should see improvements moving forward. Please let DrChrono support know if you experience any additional issues.
Posted Mar 02, 2021 - 12:00 PST