On 5/12/2021, DrChrono experienced issues related to our OnPatient product. This started with degraded performance and intermittent availability and escalated to a complete outage.
This underlying issue was identified and corrected. We also added additional capacity to help mitigate any further impacts caused by similar issues.
Date/Time | Activity |
---|---|
2021-05-12 11:00 | On-call engineer paged for OnPatient site outage, and ops team begins working the issue. |
2021-05-12 11:10 | Supporting middleware and services verified as working. |
2021-05-12 11:12 | Some underlying web services were restarted. This corrected the errors but OnPatient was still running slowly. |
2021-05-12 11:22 | Errors noted as returning. |
2021-05-12 12:11 | Status page for incident created. |
2021-05-12 13:22 | Potential issue with a specific database query was identified, and database service was restarted. |
2021-05-12 14:10 | An issue with the table structure was identified and a manual process to correct the structure was started. |
2021-05-12 14:21 | OnPatient portal brought down to decrease load on the database server while the correction was running. |
2021-05-12 15:13 | Status page updated to “Identified”. |
2021-05-12 15:42 | DrChrono team is unable to access the portal due to a technical issue with our hardware vendor to assist with identifying any underlying issues. |
2021-05-12 16:44 | Third-party vendor access was restored and support tickets were opened by the DrChrono team. |
2021-05-12 17:30 | Plans to resize the database server was put in place and a fresh backup of the data was taken. |
2021-05-12 19:22 | Backup of DB finished and verified. Resize of web and database servers initiated. |
2021-05-12 19:37 | Hardware vendor confirms underlying disk performance issues. |
2021-05-12 19:52 | Resize of Web instance completed, instance restored. |
2021-05-13 00:25 | DB instance resized and brought back up. |
2021-05-13 00:25 | Web services restored; OnPatient back up, end of incident. |
2021-05-13 00:31 | Status page updated to “Resolved”. |
The underlying hardware from our vendor that powers our OnPatient database server started a disk consistency check at about 10:55 am EST. This consistency check, and/or other issues associated with the degraded disk performance affected the OnPatient database server.
A missing index contributed to slow queries against the OnPatient database. These queries were previously performant with sufficient disk I/O. However, the disk consistency check degraded performance sufficiently to expose issues with queries with the missing index.
Customers experienced a complete outage of the OnPatient platform for all customers, including telehealth.
Increased instance size for the OnPatient DB so that sufficient memory will be available to better survive future disk I/O issues. This resizing effort migrated the DB instance off the degraded hardware, which resolved the underlying disk I/O issues.
Engineering will ensure that the proper table structure is created for the affected table.