From 1/6/2021 to 2/16/2021, DrChrono experienced stability issues with our primary database which occurred three times. The error would cause a temporary “read-only” outage for the DrChrono site. This “read-only” outage caused a cascade failure that lead to a site outage between 12 and 22 minutes. This issue has been identified and corrected; we do not expect to see additional occurrences.
2021-01-06 09:40: Network traffic began to drop across all servers.
2021-01-06 09:42: Error rates began to increase for the application.
2021-01-06 09:45: Reports began identifying outage for some customers.
2021-01-06 09:45: Engineering began investigating.
2021-01-06 09:50: Widespread outage issues reported.
2021-01-06 10:02: Error rates began to decrease.
2021-01-06 10:03: Network traffic began to increase across all servers.
2021-01-06 10:04: Error rates stabilized to normal levels.
2021-01-06 10:30: Post mortem investigation began. Additional monitoring and logging were put in place.
2021-01-27 04:45: Widespread outage issues reported.
2021-01-27 05:16: DrChrono site was restored.
2021-01-27 08:32: Detailed investigation began, and increased logging and monitoring were put in place. Additional steps from previously increased logging/monitoring identified to mitigate error if it reoccurred.
2021-02-16 18:47: Widespread outage issues reported.
2021-02-16 18:55: Previously identified mitigation steps were used to reduce outage time.
2021-02-16 18:59: Site access restored. Status updated to “Monitoring”.
2021-02-16 19:30: Investigation deep dive began.
The incident was caused by a process crash and hang in the primary database server. The process hung prior to crashing, causing a delay in the automatic restart of the process to restore the DRC application.
During the investigation, the error was correlated to another error that was not customer-impacting in our replica fleet of database servers. With the combined information, the root cause was identified as a software concurrency bug related to the version of database software in use and the larger server/instance sizes in use in the AWS environment.
The DrChrono site and associated applications were down for approximately 12 to 22 minutes during the incidents.
Prior to issue resolution, increased monitoring and logging was deployed after each incident to:
Our database software vendor released a patch on 2/22/2021 that resolved this issue. The patch was deployed to a test server on 2/24/2021 to verify the fix. After verification, the path was deployed to our entire replica server fleet the week of 3/1/2021. After verification in the replica fleet, the patch was deployed to our primary database servers on 3/8/2021, resolving the issue.