Site Outage
Incident Report for DrChrono
Postmortem

Database Replication Error Events

Summary

On 3/3/2021 and 3/8/2021, DrChrono experienced database replication error events that disabled a significant portion of our database replica servers. Some customers may have experienced lingering latency issues during the repair and remediation phases.

The issue has been identified and corrected; we do not expect to experience recurring issues of this nature in the future.

Timeline (EST, 24-hour clock)

2021-03-03 10:06 - Issues with replication automatically identified and paged to Ops Team.

2021-03-03 10:13 - Replication issues identified; first failed server removed from operational pool.

2021-03-03 10:40 - Remaining failed servers removed from the operational pool. A small portion of traffic remains directed at delayed servers to prevent overloading the remaining healthy servers.

2021-03-03 10:57 - Status Page updated to reflect a Major Incident.

2021-03-03 11:00 - First of the delayed servers begins remediation for the replication error.

2021-03-03 11:06 - First of the delayed servers is online and starting to take traffic.

2021-03-03 11:06 - Traffic is routed away from delayed servers; errors subside and DrChrono is fully functional.

2021-03-03 11:31 - DrChrono site and application are stable. All replicas are healthy and serving traffic.

2021-03-08 01:48 - Issues with replication automatically identified and paged to Ops Team.

2021-03-08 02:20 - All traffic moved to healthy servers - no impact for any customers.

2021-03-08 02:40 - Errors are identified as unrecoverable; restore processes are begun.

2021-03-08 08:35 - Sufficient healthy capacity has been restored to serve peak traffic for DrChrono.

2021-03-08 08:40 - Restoration on additional servers begun for additional capacity.

2021-03-08 09:30 - DrChrono begins period of peak traffic.

2021-03-08 09:55 - DrChrono receives reports of intermittent issues for some customers.

2021-03-08 11:15 - Additional capacity restoration identified as the cause of intermittent issues.

2021-03-08 11:15 - Additional capacity restoration terminated and rescheduled.

2021-03-08 11:15 - DrChrono latency and errors subside; site is fully operational.

2021-03-08 11:30 - Post Mortem investigation begins.

Contributing Factors

The Engineering Ops Team identified a configuration for our database servers that contributed to an uncommon nonrecoverable replication corruption error. DrChrono’s infrastructure is designed to protect the integrity of our data as a priority. When this condition was triggered on a portion of our servers:

  • The condition was immediately detected.
  • The impacted replica servers correctly halted all replication, effectively becoming read-only.
  • The impacted replica servers correctly continued to serve traffic that was current up to the time of replication halting, until removed from the replication availability pool. This preserves capacity and the health of non-impacted servers.
  • The primary write servers were not impacted.

The impacted replicas were rebuilt with fresh copies of data from our primary write servers to ensure integrity, health, and functionality.

Impact

Some customers experienced latency or an increase in errors during the restoration period when capacity in the data replication pool was reduced. No data was lost or corrupted on the primary servers. To ensure the healthy operation of our servers, full data restores were initiated on all pool members. This restore process exacerbated capacity issues on 3/8/2021, impacting our customers. The non-essential portion of the restore process was halted and rescheduled.

Corrective Actions

The root cause, a configuration element for expanded redundancy in the infrastructure, was identified and corrected. DrChrono’s monitoring and alerting infrastructure worked as designed to protect the integrity of our customers' data and minimize capacity impact. We do not expect to see this condition impact our infrastructure going forward.

Posted Mar 15, 2021 - 06:17 PDT

Resolved
From approximately 6:30 am PST to 1:00 pm PST we experienced a throughput issue with one of our databases that caused customers to experience site slowness. This incident has been resolved. Our engineers are reviewing what caused today’s incident, and we will be sharing a postmortem/RCA on status.drchrono.com within the week or so.
Posted Mar 08, 2021 - 12:06 PST
Update
Our team has identified a throughput issue with our databases. We have identified the problem and are actively working to resolve it. However, customers may experience intermittent slowness throughout the day until the issue is resolved. We recognize the impact this has on your business and apologize for the ongoing interruptions. We'll continue to post updates here via the status page.
Posted Mar 08, 2021 - 07:59 PST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Mar 08, 2021 - 07:16 PST
Investigating
Some customers using the DrChrono platform are currently experiencing intermittent errors. Our team is working hard to investigate this issue and we will provide an update soon as we have more information.
Posted Mar 08, 2021 - 07:03 PST
This incident affected: drchrono.com, drchrono iPad EHR, drchrono iPad Check-In Kiosk Application, DrChrono Telehealth Platform, onpatient.com, and onpatient iPhone PHR.