Site Outage
Incident Report for DrChrono
Postmortem

Database Stability / Concurrency Issues

Summary

From 1/6/2021 to 2/16/2021, DrChrono experienced stability issues with our primary database which occurred three times. The error would cause a temporary “read-only” outage for the DrChrono site. This “read-only” outage caused a cascade failure that lead to a site outage between 12 and 22 minutes. This issue has been identified and corrected; we do not expect to see additional occurrences.

Timeline (EST, 24-hour clock)

2021-01-06 09:40: Network traffic began to drop across all servers.

2021-01-06 09:42: Error rates began to increase for the application.

2021-01-06 09:45: Reports began identifying outage for some customers.

2021-01-06 09:45: Engineering began investigating.

2021-01-06 09:50: Widespread outage issues reported.

2021-01-06 10:02: Error rates began to decrease.

2021-01-06 10:03: Network traffic began to increase across all servers.

2021-01-06 10:04: Error rates stabilized to normal levels.

2021-01-06 10:30: Post mortem investigation began. Additional monitoring and logging were put in place.

2021-01-27 04:45: Widespread outage issues reported.

2021-01-27 05:16: DrChrono site was restored.

2021-01-27 08:32: Detailed investigation began, and increased logging and monitoring were put in place. Additional steps from previously increased logging/monitoring identified to mitigate error if it reoccurred.

2021-02-16 18:47: Widespread outage issues reported.

2021-02-16 18:55: Previously identified mitigation steps were used to reduce outage time.

2021-02-16 18:59: Site access restored. Status updated to “Monitoring”.

2021-02-16 19:30: Investigation deep dive began.

Contributing Factors

The incident was caused by a process crash and hang in the primary database server. The process hung prior to crashing, causing a delay in the automatic restart of the process to restore the DRC application.

During the investigation, the error was correlated to another error that was not customer-impacting in our replica fleet of database servers. With the combined information, the root cause was identified as a software concurrency bug related to the version of database software in use and the larger server/instance sizes in use in the AWS environment.

Impact

The DrChrono site and associated applications were down for approximately 12 to 22 minutes during the incidents.

Corrective Actions

Prior to issue resolution, increased monitoring and logging was deployed after each incident to:

  1. Identify the cause of the crash.
  2. Identify steps to mitigate the impact of the error if it occurred again.

Our database software vendor released a patch on 2/22/2021 that resolved this issue. The patch was deployed to a test server on 2/24/2021 to verify the fix. After verification, the path was deployed to our entire replica server fleet the week of 3/1/2021. After verification in the replica fleet, the patch was deployed to our primary database servers on 3/8/2021, resolving the issue.

Posted Mar 18, 2021 - 06:38 PDT

Resolved
This incident has been resolved.
Posted Jan 06, 2021 - 11:15 PST
Monitoring
Our team worked to resolve this morning's outage and we are now monitoring performance throughout the day. The outage was due to a network connectivity issue between our databases that caused a network interruption and took about 20 min for the system to recover. We appreciate your patience and understanding as we worked to get the site back up and running and deeply apologize for the disruption caused.
Posted Jan 06, 2021 - 09:30 PST
Update
We are continuing to investigate the cause of this morning's outage, but at this time the site is back up and running for most of our customer base.
Posted Jan 06, 2021 - 07:42 PST
Investigating
The DrChrono platform is currently experiencing a site outage. Our team is working hard to investigate this issue and we will provide an update soon as we have more information.
Posted Jan 06, 2021 - 06:53 PST
This incident affected: drchrono.com, drchrono iPad EHR, drchrono iPad Check-In Kiosk Application, and DrChrono Telehealth Platform.