System performance and login issues

Incident Report for DrChrono

Postmortem

Inability to log onto DrChrono, receiving gateway errors, and data not saving Issues

Summary

Issues occurred on February 07, 2022 which produced the following issues for customers and practice groups: Unable to log onto DrChrono, receiving gateway errors, and data not saving

Timeline (PST, 24-hour clock)

All times are PST.

Date/Time	Activity
2022-02-03 08:56	Initial inflow of tickets for slowness on DrChrono are reported.
2022-02-03 08:59	DevOps investigates and determines replication lag occurred.
2022-02-03 09:10	Reports of bad gateway errors have been reported.
2022-02-03 09:15	DevOps investigates and determines replication lag is occurring in short spikes.
2022-02-03 09:54	Additional practice groups are reporting inability to access DrChrono.
2022-02-03 09:58	CloudFlare posts incident status for issues in Newark, New Jersey region.
2022-02-03 10:08	DevOps clears space on celery servers to help relieve additional load.
2022-02-03 10:53	Customer Success reports practice group access issues have been resolved.
2022-02-04 07:15	Customer Success reports tasks are not loading and are receiving bad gateway errors.
2022-02-04 07:19	Customer Success reports influx of customer tickets.
2022-02-04 07:31	Engineering helps troubleshoot the issue.
2022-02-04 07:48	Practice groups are reporting bad gateway errors.
2022-02-04 07:54	Engineering and DevOps investigate.
2022-02-04 08:11	Increase in influx of slowness and unresponsive tickets from customers.
2022-02-04 08:21	Engineering identifies a significant spike in tasks.
2022-02-04 08:21	Engineering investigates the issues.
2022-02-04 09:09	Engineering continues to investigate the spikes in tasks.
2022-02-04 09:31	Engineering investigates increased average query times.
2022-02-04 09:54	Engineering and DevOps form a call to investigate possible issues collectively.
2022-02-04 10:00	An issue with a new replica db server is identified. DevOps initiates remediation with the MariaDB service team.
2022-02-04 10:10	Faulty replica db is removed from the active load pool.
2022-02-04 10:10	Engineering, DevOps and Customer Success monitor the fix.
2022-02-04 12:55	Customer Success reports no additional tickets for the issue.
2022-02-04 12:57	DevOps removed an additional replica db server from the load balancing pool. DevOps creates a plan to create new replicas over the upcoming weekend and add those to the active load balancing pool.
2022-02-04 13:55	A practice group is reports issues with logging onto DrChrono.
2022-02-04 14:55	Engineering and DevOps investigate access issues.
2022-02-04 18:05	Issue with the practice group is resolved.
2022-02-05	DevOps works with the MariaDB team to create new replica database servers and incrementally increase traffic load.
2022-02-06	DevOps works with the MariaDB team to create new replica database servers and incrementally increase traffic load.
2022-02-07 06:39	Influx of tickets from customers reporting issues accessing DrChrono.
2022-02-07 06:40	DevOps investigates reports.
2022-02-07 06:45	Reports of large health providers being unable to access DrChrono are reported.
2022-02-07 06:50	DevOps continues to investigate.
2022-02-07 07:10	DevOps increases traffic load on database replicas to balance out traffic. A separate replica database server is removed from the pool - experiencing high I/O wait.
2022-02-07 07:16	Customer Success reports ajax errors are affecting customers.
2022-02-07 07:33	Status Page created to reflect investigating status.
2022-02-07 07:49	DevOps reintroduces previously removed database replica server.
2022-02-07 07:50	Engineering and DevOps investigates increasing ajax error rate.
2022-02-07 08:39	Status Page updated to reflect continued investigating status.
2022-02-07 08:50	Engineering and DevOps form a call and investigate issues.
2022-02-07 08:50-11:48	Engineering and DevOps continue to investigate possible theories relating to: large inefficient queries, overloading cron jobs, and database load issues.
2022-02-07 11:29	Status Page updated to reflect continued investigating status.
2022-02-07 11:49	Increased ticket count for site slowness are reported.
2022-02-07 11:48-13:13	Engineering and DevOps continue to investigate possible theories relating to: large inefficient queries, overloading cron jobs, and database load issues.
2022-02-07 13:14	Underlying issues have been identified. Fix has been applied. Fix is being actively monitored.
2022-02-07 13:20	Status Page updated to monitoring status.
2022-02-08 06:37	Status Page updated to resolved status.

Contributing Factor(s)

There were two contributing factors that caused the issues mentioned above. The first major factor was technical debt, one of the older database replication servers was experiencing issues handling traffic, this needed to be removed from the pool, and additional replication servers needed to be stood up. This set off DevOps & the MariaDB team on a task of creating new additional replication servers and introducing them to the load balancing pool. Another factor was a misconfiguration of the new replication servers, the misconfiguration led to replicas being introduced into the pool but not actively receiving traffic. Essentially this means DrChrono was running at max capacity for traffic load but was unable to efficiently handle traffic.

Stabilization Steps

Once the new replication servers were identified as not handling traffic load, reconfiguring them back into the load balancing pool ultimately balanced out the business hour traffic load.

Impact

This issue affected customers' ability to use and navigate the website. Namely, this affected: receiving bad gateway errors when traversing the site, and data (calendar, reminders, etc.) not saving.

Corrective Actions

Reconfiguration of new replication servers allowed the site to stabilize by balancing out the traffic load. Outlining the exact process for any database configuration changes and mitigation/fallback procedures will prevent a similar situation from happening again.

Posted Feb 11, 2022 - 14:20 PST

Resolved

We apologize for the system performance issues that we were experiencing yesterday. We have monitored the system since applying a fix and the platform performance has normalized. We will have a Root Cause Analysis available by week's end here on the status page. Please note that reminders scheduled to be sent today and onward should be delivered.

Posted Feb 08, 2022 - 06:37 PST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Feb 07, 2022 - 13:20 PST

Update

We are continuing to investigate the performance issues being experienced today in the DrChrono platform. Thank you for your patience. We apologize for the inconveniences caused.

Posted Feb 07, 2022 - 11:29 PST

Update

We are continuing to investigate this issue.

Posted Feb 07, 2022 - 08:39 PST

Investigating

We are currently investigating customer reports of the inability to log in to DrChrono, data not saving, bad gateway errors, and reminders not being delivered. We will provide another update here as soon as possible. Thank you for your patience.

Posted Feb 07, 2022 - 07:33 PST

This incident affected: drchrono.com, drchrono iPad EHR, and drchrono iPad Check-In Kiosk Application.