On 3/3/2021 and 3/8/2021, DrChrono experienced database replication error events that disabled a significant portion of our database replica servers. Some customers may have experienced lingering latency issues during the repair and remediation phases.
The issue has been identified and corrected; we do not expect to experience recurring issues of this nature in the future.
2021-03-03 10:06 - Issues with replication automatically identified and paged to Ops Team.
2021-03-03 10:13 - Replication issues identified; first failed server removed from operational pool.
2021-03-03 10:40 - Remaining failed servers removed from the operational pool. A small portion of traffic remains directed at delayed servers to prevent overloading the remaining healthy servers.
2021-03-03 10:57 - Status Page updated to reflect a Major Incident.
2021-03-03 11:00 - First of the delayed servers begins remediation for the replication error.
2021-03-03 11:06 - First of the delayed servers is online and starting to take traffic.
2021-03-03 11:06 - Traffic is routed away from delayed servers; errors subside and DrChrono is fully functional.
2021-03-03 11:31 - DrChrono site and application are stable. All replicas are healthy and serving traffic.
2021-03-08 01:48 - Issues with replication automatically identified and paged to Ops Team.
2021-03-08 02:20 - All traffic moved to healthy servers - no impact for any customers.
2021-03-08 02:40 - Errors are identified as unrecoverable; restore processes are begun.
2021-03-08 08:35 - Sufficient healthy capacity has been restored to serve peak traffic for DrChrono.
2021-03-08 08:40 - Restoration on additional servers begun for additional capacity.
2021-03-08 09:30 - DrChrono begins period of peak traffic.
2021-03-08 09:55 - DrChrono receives reports of intermittent issues for some customers.
2021-03-08 11:15 - Additional capacity restoration identified as the cause of intermittent issues.
2021-03-08 11:15 - Additional capacity restoration terminated and rescheduled.
2021-03-08 11:15 - DrChrono latency and errors subside; site is fully operational.
2021-03-08 11:30 - Post Mortem investigation begins.
The Engineering Ops Team identified a configuration for our database servers that contributed to an uncommon nonrecoverable replication corruption error. DrChrono’s infrastructure is designed to protect the integrity of our data as a priority. When this condition was triggered on a portion of our servers:
The impacted replicas were rebuilt with fresh copies of data from our primary write servers to ensure integrity, health, and functionality.
Some customers experienced latency or an increase in errors during the restoration period when capacity in the data replication pool was reduced. No data was lost or corrupted on the primary servers. To ensure the healthy operation of our servers, full data restores were initiated on all pool members. This restore process exacerbated capacity issues on 3/8/2021, impacting our customers. The non-essential portion of the restore process was halted and rescheduled.
The root cause, a configuration element for expanded redundancy in the infrastructure, was identified and corrected. DrChrono’s monitoring and alerting infrastructure worked as designed to protect the integrity of our customers' data and minimize capacity impact. We do not expect to see this condition impact our infrastructure going forward.