DrChrono System Performance

Incident Report for DrChrono

Postmortem

RCA for DrChrono application outage on 18-October-2023

‌

Issue Start Date/Time: 18-Oct-2023 at 10:50 AM ET

Issue Resolution Date/Time: 18-Oct-2023 at 6:30 PM ET

‌

Issue Summary:

‌

Customers experienced 503 errors and slow performance while logging into DrChrono web application from 10:50 AM ET on Oct 18th, 2023. The service was restored fully by 6:30 PM ET on Oct 18th, 2023.

‌

How were customers impacted?

‌

Customers could not log into or utilize the DrChrono web application, mobile application or public APIs.

‌

Root Cause:

‌

At 10:50 AM ET two systems hosted by our cloud provider became degraded and intermittently unavailable. Due to the interplay of the two systems, customers had an often severely degraded experience in the window mentioned above. This was not caused by a change made by the DrChrono team but was a point in time event due to apparent hardware failure.

‌

The first system that failed backs our audit log storage and prevented any action that would have been written to the audit log. This was determined to be caused by disk pressure and likely a bad disk in the cloud provider’s environment as latency for writes was very high, but the volume of writes was normal. The pressure created from this degradation caused the second system, our background task processor, to fail to complete tasks and build up work to a degree that deadlocked the background task processor. Of note, the number of tasks was within expected bounds and should not have caused this degradation but did nonetheless – and has been confirmed by our cloud provider to be a bug in the version they provide to us.

‌

To ensure writes are processed for all incoming requests and provide a high level of data integrity, all requests ensure that the background task processor is available. With the background task processor in an intermittent degraded state, all requests that were unable to reach the background task processor returned a 503 error.

‌

To that end, both systems needed repair to return to healthy service. Both systems were configured for high availability and had automatic failover. The systems’ attempts at automatically fixing the issue and cloud provider dashboards reporting as healthy masked the true issues for some time. Additionally, the background task processor is configured to add more instances to complete tasks upon degradation or building queues. In this instance, that made the performance of the background task processor counterintuitively worse due to the bug mentioned above.

‌

Due to masked root causes, significant investigation was performed by the DrChrono and cloud provider support teams. It took several hours to pinpoint the true root causes and resolutions necessary. Once found, resolutions were performed.

‌

Resolution(s):

At roughly 4:00pm ET, the audit log datastore was re provisioned with more CPU, RAM and storage – not to provide more resources but to force the instance shifting from failed to healthy hardware within the cloud provider. Within 15 minutes, the new instance was up, running and healthy. It was keeping up with the same volume of writes that was failing just previously.

‌

Unfortunately, the background task processor was yet in a failed state due to the bug mentioned above and pressure created from the audit log datastore while it was intermittently out of service. With page loads checking on background task processor health, DrChrono still had issues serving customer requests.

‌

At roughly 6:15pm ET, it was determined that reducing the number of servers performing tasks for the background task processor would put less pressure on the bug causing deadlocks. Within 10 minutes of reducing the number of servers, the background task processor recovered, and service returned to normal.

‌

We did receive reports of slowness for roughly 10 minutes, but we believe via the data collected that this was the system continuing to recover. Based on our own testing, monitoring and data we consider DrChrono to have returned to healthy service at 6:30pm ET.

‌

Throughout the path of investigation and resolution, we noted many opportunities to implement process, changes and enhanced monitoring to the system to prevent this from occurring in the future.

‌

Mitigation steps planned/taken:

‌

Review and implement configuration to set number of servers for the background task processor to an ideal level to avoid the bug experienced. This has been completed as of 10/18.
Replace the existing background task processor system with another system that handles our volumes of tasks more performantly and does not have defects that affect our workloads. This is in progress and is our highest priority open item. We expect this to be completed in the next 30 days and have temporary relief implemented via item #1 above.
Introduce a buffer for audit log writes so that if audit log storage is unavailable operations can continue and no data loss occurs. This is in progress, and we expect this to be completed in the next 60 days.
Improve alerts for issues experienced above and other issues that may occur in these secondary systems within the DrChrono infrastructure stack. This is in progress, and we expect this to be completed in the next 30 days.

Posted Oct 23, 2023 - 14:31 PDT

Resolved

DrChrono has resolved all issues. We apologize for any inconveniences this may have caused.

Posted Oct 19, 2023 - 09:50 PDT

Update

DrChrono is continuing to monitor system system performance today.

Posted Oct 19, 2023 - 05:08 PDT

Monitoring

We believe system performance has improved. We are no longer seeing 503 errors, however you may still be experiencing system slowness. We are continuing to investigate to bring a full resolution.

Posted Oct 18, 2023 - 15:45 PDT