System Outage
Incident Report for DrChrono
Postmortem

RCA 20210216

Description

Primary database server became unavailable causing a site wide outage for 12 minutes. Site was restored after a server reboot.

Timeline

All times are EST

Date/Time Activity
2021-02-16 18:47 Issues reported with site unavailability.
2021-02-16 18:51 Issue identified as an increased load on the primary database.
2021-02-16 18:53 A clean stop of database processes was attempted.
2021-02-16 18:55 Database process was unable to stop cleanly, and server was rebooted.
2021-02-16 18:58 Crash recovery process started.
2021-02-16 18:59 Site access was restored. Status was updated to “monitoring”.
2021-02-16 19:02 Celery workers restarted.

Contributing Factor(s)

Primary database became unresponsive and couldn’t serve requests.

Stabilization Steps

Primary database server was restarted to recover from a process crash.

Impact

The site was unavailable for 12 minutes.

Corrective Actions

Operating system version was downgraded and an additional server was added to allow for clustering to limit impact in the event of a database failure.

Posted Feb 25, 2021 - 12:00 PST

Resolved
At 6:50pm ET we experienced an infrastructure issue, causing drchrono.com to be unresponsive for approximately 10 minutes. Our engineers have identified the issue and resolved it. Please let our support team know if you're still experiencing any issues with your site.
Posted Feb 16, 2021 - 16:00 PST