OnPatient portal intermittent outage

Incident Report for DrChrono

Postmortem

OnPatient Outage

Summary

On 5/12/2021, DrChrono experienced issues related to our OnPatient product. This started with degraded performance and intermittent availability and escalated to a complete outage.

This underlying issue was identified and corrected. We also added additional capacity to help mitigate any further impacts caused by similar issues.

Timeline (EST, 24-hour clock)

Date/Time	Activity
2021-05-12 11:00	On-call engineer paged for OnPatient site outage, and ops team begins working the issue.
2021-05-12 11:10	Supporting middleware and services verified as working.
2021-05-12 11:12	Some underlying web services were restarted. This corrected the errors but OnPatient was still running slowly.
2021-05-12 11:22	Errors noted as returning.
2021-05-12 12:11	Status page for incident created.
2021-05-12 13:22	Potential issue with a specific database query was identified, and database service was restarted.
2021-05-12 14:10	An issue with the table structure was identified and a manual process to correct the structure was started.
2021-05-12 14:21	OnPatient portal brought down to decrease load on the database server while the correction was running.
2021-05-12 15:13	Status page updated to “Identified”.
2021-05-12 15:42	DrChrono team is unable to access the portal due to a technical issue with our hardware vendor to assist with identifying any underlying issues.
2021-05-12 16:44	Third-party vendor access was restored and support tickets were opened by the DrChrono team.
2021-05-12 17:30	Plans to resize the database server was put in place and a fresh backup of the data was taken.
2021-05-12 19:22	Backup of DB finished and verified. Resize of web and database servers initiated.
2021-05-12 19:37	Hardware vendor confirms underlying disk performance issues.
2021-05-12 19:52	Resize of Web instance completed, instance restored.

2021-05-13 00:25	DB instance resized and brought back up.
2021-05-13 00:25	Web services restored; OnPatient back up, end of incident.
2021-05-13 00:31	Status page updated to “Resolved”.

Contributing Factors

The underlying hardware from our vendor that powers our OnPatient database server started a disk consistency check at about 10:55 am EST. This consistency check, and/or other issues associated with the degraded disk performance affected the OnPatient database server.

A missing index contributed to slow queries against the OnPatient database. These queries were previously performant with sufficient disk I/O. However, the disk consistency check degraded performance sufficiently to expose issues with queries with the missing index.

Impact

Customers experienced a complete outage of the OnPatient platform for all customers, including telehealth.

Corrective Actions

Increased instance size for the OnPatient DB so that sufficient memory will be available to better survive future disk I/O issues. This resizing effort migrated the DB instance off the degraded hardware, which resolved the underlying disk I/O issues.

Engineering will ensure that the proper table structure is created for the affected table.

Posted May 26, 2021 - 13:38 PDT

Resolved

This incident has been resolved.

Posted May 12, 2021 - 17:31 PDT

Update

Thank you for your continued patience as our team works to restore access to the OnPatient portal. At this time, we are predicting service to be restored by 10:00 PM PST. We will continue to communicate updates here.

Posted May 12, 2021 - 14:30 PDT

Identified

The issue has been identified but we do not currently have an ETA on completion of the fix. We will post another update as to a timeframe as soon as we can.

Posted May 12, 2021 - 12:13 PDT

Investigating

We are currently investigating an issue that is intermittently preventing login to the OnPatient portal. This issue seems to have started at approximately 8:00 PST. We will provide additional information as soon as possible.

Posted May 12, 2021 - 09:09 PDT

This incident affected: onpatient.com and onpatient iPhone PHR.