DrChrono system slowness
Incident Report for DrChrono
Postmortem

At 2:15 PM ET on December 18th, a system hosted by our cloud provider became degraded and intermittently unavailable. This caused the capacity we had to serve requests to degrade. We had to pause application traffic and place a maintenance page to allow the system to recover. At 4:45 PM ET, we lifted the maintenance page, and the system returned to health. We saw traffic levels and background processing return to normal levels. 

The failing system in question is the same system that resulted in other degradations and outages in the last 90 days. In every case, the outage was not caused by a recent DrChrono release but by hardware failure occurring on an ad-hoc basis. Since the October 18th outage, the team has been working as quickly as possible to patch the system to prevent recurrence and, in parallel, to replace this system entirely. In late November, we fixed the system in place with the understanding and expectation that it would remove the memory leak that is the cause of degradation via assurances provided by our cloud provider. Unfortunately, the memory leak remains in our running version, as experienced on December 18th.  

We recognize it is unacceptable for this system to disrupt you and your patients. Fundamentally, with assurances from our cloud provider, we believed the patch would provide relief and time for the workstream to replace the existing system to complete their project without an additional event occurring. The patch did not offer that relief, and the team is working to further expedite the replacement workstream. The code for using the new system backing the background task processor was released to production on December 14th in beta mode for us to roll out and test slowly. The full rollout of the replacement is currently slated to be complete in the first half of January. The team is working quickly, including over the holidays, to expedite this further. We will communicate once this has been completed. Please know that this is our team’s number one priority. Fundamentally, we are replacing a vital piece of DrChrono's foundation. Significant testing is required to ensure you and your patients are supported appropriately.  

In the meantime, while the DrChrono application is supported by the system that we will replace, the infrastructure team has provisioned a spare cluster in case the existing cluster fails again. A contributing factor to these downtimes' length is that new cluster provisioning takes roughly one hour. Running a spare will help us move to a new cluster to restore service faster than waiting on server restarts or new cluster provisioning. The team is adding such an approach to their playbooks for all systems that support the DrChrono application. 

Corrective actions related to this event include: 

  • Further expediting the system replacement workstream – In progress 
  • Updating Maintenance Page wording to make it more straightforward that it was unplanned or emergency maintenance – In progress 
  • Developing a playbook to maintain the spare cluster and procedure to enable the light cluster – Complete as of December 19th 
  • Researching and implementing further current system upgrades if thought to further reduce the occurrence rate of memory leak issues - In progress 
  • Investigate and publish a plan with timelines regarding an “offline viewer” that can be used by DrChrono customers in the event of a degradation or outage to make schedule and clinical information available in a read-only mode – In progress with a plan published in the coming weeks. 

 

We architected our solution based on a cloud provider-managed service for our background task processors. This solution has served us well for many years. While cloud provider solutions are generally highly resilient and highly available, it is no longer the case for this service. Ultimately, your and your patients' experience is our responsibility and is taken very seriously. We are aggressively working towards delivering a solution that will provide consistent service. In addition, as the last corrective action above notes, we consider it an action item to provide our customers with an “offline viewer” to lessen the disruption and enable continued and safe care. We will share more on those plans in the coming weeks. We appreciate your patience as we roll out these changes and find more stable ground.

Posted Dec 20, 2023 - 15:26 PST

Resolved
This incident has been resolved.
Posted Dec 18, 2023 - 14:53 PST
Monitoring
A fix has been implemented and we are monitoring the results. A formal RCA will be made available via our status page in the coming days.
Posted Dec 18, 2023 - 13:55 PST
Update
We have made progress resolving the root issue and will have an ETA or full resolution soon.
Posted Dec 18, 2023 - 13:25 PST
Identified
The issue has been identified and we are currently working to restore service as soon as possible and will continue to post updates here.
Posted Dec 18, 2023 - 12:43 PST
Update
We are continuing to investigate this issue and are currently down for unplanned maintenance. We will continue to provide updates via our status page.
Posted Dec 18, 2023 - 11:49 PST
Investigating
We are currently investigating reports of intermittent slowness in the platform. We apologize for this inconvenience. Updates will be provided here as we have additional information to share.
Posted Dec 18, 2023 - 11:14 PST
This incident affected: drchrono.com and drchrono iPad EHR.