Site slowness
Incident Report for DrChrono
Postmortem

Database Replication Error Events

Summary

On 3/3/2021 and 3/8/2021, DrChrono experienced database replication error events that disabled a significant portion of our database replica servers. Some customers may have experienced lingering latency issues during the repair and remediation phases.

The issue has been identified and corrected; we do not expect to experience recurring issues of this nature in the future.

Timeline (EST, 24-hour clock)

2021-03-03 10:06 - Issues with replication automatically identified and paged to Ops Team.

2021-03-03 10:13 - Replication issues identified; first failed server removed from operational pool.

2021-03-03 10:40 - Remaining failed servers removed from the operational pool. A small portion of traffic remains directed at delayed servers to prevent overloading the remaining healthy servers.

2021-03-03 10:57 - Status Page updated to reflect a Major Incident.

2021-03-03 11:00 - First of the delayed servers begins remediation for the replication error.

2021-03-03 11:06 - First of the delayed servers is online and starting to take traffic.

2021-03-03 11:06 - Traffic is routed away from delayed servers; errors subside and DrChrono is fully functional.

2021-03-03 11:31 - DrChrono site and application are stable. All replicas are healthy and serving traffic.

2021-03-08 01:48 - Issues with replication automatically identified and paged to Ops Team.

2021-03-08 02:20 - All traffic moved to healthy servers - no impact for any customers.

2021-03-08 02:40 - Errors are identified as unrecoverable; restore processes are begun.

2021-03-08 08:35 - Sufficient healthy capacity has been restored to serve peak traffic for DrChrono.

2021-03-08 08:40 - Restoration on additional servers begun for additional capacity.

2021-03-08 09:30 - DrChrono begins period of peak traffic.

2021-03-08 09:55 - DrChrono receives reports of intermittent issues for some customers.

2021-03-08 11:15 - Additional capacity restoration identified as the cause of intermittent issues.

2021-03-08 11:15 - Additional capacity restoration terminated and rescheduled.

2021-03-08 11:15 - DrChrono latency and errors subside; site is fully operational.

2021-03-08 11:30 - Post Mortem investigation begins.

Contributing Factors

The Engineering Ops Team identified a configuration for our database servers that contributed to an uncommon nonrecoverable replication corruption error. DrChrono’s infrastructure is designed to protect the integrity of our data as a priority. When this condition was triggered on a portion of our servers:

  • The condition was immediately detected.
  • The impacted replica servers correctly halted all replication, effectively becoming read-only.
  • The impacted replica servers correctly continued to serve traffic that was current up to the time of replication halting, until removed from the replication availability pool. This preserves capacity and the health of non-impacted servers.
  • The primary write servers were not impacted.

The impacted replicas were rebuilt with fresh copies of data from our primary write servers to ensure integrity, health, and functionality.

Impact

Some customers experienced latency or an increase in errors during the restoration period when capacity in the data replication pool was reduced. No data was lost or corrupted on the primary servers. To ensure the healthy operation of our servers, full data restores were initiated on all pool members. This restore process exacerbated capacity issues on 3/8/2021, impacting our customers. The non-essential portion of the restore process was halted and rescheduled.

Corrective Actions

The root cause, a configuration element for expanded redundancy in the infrastructure, was identified and corrected. DrChrono’s monitoring and alerting infrastructure worked as designed to protect the integrity of our customers' data and minimize capacity impact. We do not expect to see this condition impact our infrastructure going forward.

Posted Mar 15, 2021 - 06:17 PDT

Resolved
This incident has been resolved. Our engineers are doing a review of what caused today’s incident and we will be sharing a postmortem/RCA on status.drchrono.com within the week.
Posted Mar 03, 2021 - 12:08 PST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Mar 03, 2021 - 11:09 PST
Identified
The issue has been identified and a fix is being implemented.
Posted Mar 03, 2021 - 10:37 PST
Investigating
Our team has identified an additional issue that is continuing to cause system responsiveness issues that we are currently investigating.
Posted Mar 03, 2021 - 09:58 PST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Mar 03, 2021 - 08:34 PST
Identified
Some customers may experience system slowness, issues saving appointments and clinical notes, inability to prescribe medications or send referrals. We’ve identified the issue and are working to resolve it.
Posted Mar 03, 2021 - 07:56 PST
This incident affected: drchrono.com, drchrono iPad EHR, drchrono iPad Check-In Kiosk Application, onpatient.com, and onpatient iPhone PHR.