General system slowness on the DrChrono platform

Incident Report for DrChrono

Postmortem

On 11/3/ users encountered difficulties accessing the website and other system functionalities.

Timeline

All times are PST Timezone.

Date/Time	Activity
2022-11-02 22:00:00	An Amazon Web Services system failure causes the primary database to be restarted. The DrChrono EHR platform becomes inaccessible.
2022-11-02 22:15:00	Amazon Web Services automatically restarts the affected server, but a metadata loss causes other servers to stop being able to serve production database traffic as well.
2022-11-02 22:15:00	An engineer on the DevOps team comes online after being notified of the issue from automated alarming systems and starts investigating.
2022-11-02 22:25:00	The root cause of the issue is identified and the engineer reroutes traffic to preserve system availability. The DrChrono EHR platform is now accessible, but running at reduced capacity due to the loss of extra servers.
2022-11-02 23:30:00	After the first attempts to recover the affected servers directly fail, the DevOps engineer initiates a backup and restore process to prepare capacity for the upcoming day traffic for one of the affected servers.
2022-11-03 01:30:00	The rescue process does not succeed for the additional servers and it's determined that they will also need to be restored from backups. The process is initiated.
2022-11-03 05:44:00	Status page was created to notify customers about “General system slowness on the DrChrono platform”.
2022-11-03 8:30:00	One of the replica servers finalizes phase 1/3 of backup restore; the next steps are engaged immediately.
2022-11-03 11:45:00	One of the replica servers completes phase 3/3. Additional replica server restoration processes are initiated.
2022-11-03 14:15:00	Additional replica servers' process restoration fails.
2022-11-03 16:27:00	Additional replica servers' process restoration was attempted but also fails.
2022-11-03 17:45:00	A backup process from the primary DB started.
2022-11-03 20:55:00	One additional replica server finishes process restoration.
2022-11-03 21:15:00	Spun up 3 more database replicas as a backup solution.
2022-11-03 22:30:00	Added an additional 3 database replicas.
2022-11-04 00:05:00	6 new database replicas are installed and configured.
2022-11-04 01:40:00	The backup from the primary database is finalized.
2022-11-04 01:45:00	The primary backup snapshot with fast restore option is started.
2022-11-04 02:10:00	The standby server started phase 1/3 of backup restore from the primary database.
2022-11-04 04:00:00	Restore process is completed.
2022-11-04 07:00:00	Status page was updated to resolved.
2022-11-04 07:07:00	One of the replica servers is removed. After reviewing the server response, the team found it was not at full capacity and will need a complete restoration.
2022-11-04 08:10:00	Tasks and message count display were blocked increasing the capacity of responses from the third server and primary.
2022-11-04 08:42:00	A status page was created to notify customers about the “Task and message counts are temporarily unavailable”.
2022-11-04 08:42:00	A replica server finalizes phase 2/3 of backup restore; the final phase is engaged.
2022-11-05 17:00:00	Replica restoration process and shutdown of other servers are completed. All DB servers were monitored for the next hours.
2022-11-06 09:50:00	Announced the system returned to full capacity and operational.
2022-11-07 06:33:00	Status page “Task and message counts are temporarily unavailable“ was resolved.

Contributing Factor(s)

Sentry alerts contain information to identify the current issue.

Stabilization Steps

A configuration change was made in the database to match the previous behavior.

Impact

Users experienced overall system slowness affecting major functionalities such as errors copying notes, locking notes, saving information, and the likes.

Corrective Actions

The Ops team has fixed the issue in the database table and will roll out updates and connections to the database gradually. Outlining the exact process for any database configuration changes and mitigation/fallback procedures will prevent a similar situation from happening again.

Posted Nov 14, 2022 - 11:37 PST

Resolved

Our team has confirmed that system issues related to site slowness have been resolved. We will be continuing maintenance and server optimization over the weekend. An RCA will be published on this site once available.

There may be instances where viewing saved information is not immediately present, or encounter issues with saving data. This is a temporary effect of our systems refreshing and should resolve on its own within the day. If the issue persists, please reach out to our support team so we can further troubleshoot with you.

Posted Nov 04, 2022 - 07:01 PDT

Monitoring

Our team has applied measures to correct the slowness issues customers have encountered today. Our server speeds are recovering and are expected to be restored to normal performance levels within the next few hours.

Posted Nov 03, 2022 - 14:00 PDT

Update

We are beginning to see improvements in site performance. Our engineering team is continuing efforts to restore performance fully. Thank you for your continued patience throughout this process.

Posted Nov 03, 2022 - 12:40 PDT

Update

We are continuing to work on a fix for this issue.

Posted Nov 03, 2022 - 09:12 PDT

Identified

Our team has identified an issue on the system that may impact the overall performance across the DrChrono platform. This may become more apparent during peak hours of the day.

We are working towards getting this resolved and will post updates as they become available.

Posted Nov 03, 2022 - 05:44 PDT

This incident affected: drchrono.com and drchrono iPad EHR.