Site Slowness
Incident Report for DrChrono
Postmortem

General Site Slowness; Redis Node Issue

Summary

The site performance was degraded briefly because of a Redis node failure.

Timeline (PST, 24-hour clock)

All times are PST.

Date/Time Activity
2021-08-30 17:02:55 Support team started receiving tickets related to site performance.
2021-08-30 17:04:36 DevOps team identified elevated database activity as a potential cause.
2021-08-30 17:07:47 Databases had started recovering from the increase in activity.
2021-08-30 17:12:17 An increase in the Redis cache response times was noted and was investigated further.
2021-08-30 17:14:22 Site had recovered while the investigation into the Redis cache continued.
2021-08-30 17:23:29 Redis node failure identified.
2021-08-30 18:09:41 Identified potential issue with code interacting with the Redis cache.

Contributing Factor(s)

There was a Redis node failure that led to a decrease in performance. Additionally several spots in the code executing potentially unnecessary queries against the Redis cache were identified.

Stabilization Steps

The Redis node recovered on its own and site performance increased.

Impact

The majority of our customers would have experienced slowness from the increased database activity which is slower than the Redis cache.

Corrective Actions

Node recovery is a “self-healing” corrective action that occurred as expected. Additionally, the engineering team has an ongoing effort to identify and resolve potential areas in the codebase that could be problematic in a similar scenario. Finally, we will continue monitoring the infrastructure to expand the system's fault tolerance as necessary.

Posted Sep 09, 2021 - 13:56 PDT

Resolved
This incident has been resolved.
Posted Aug 31, 2021 - 04:50 PDT
Monitoring
This issue has been resolved. We are seeing normal response times and will continue to monitor the system. Please reach out to our support team if you continue to run into any issues.
Posted Aug 30, 2021 - 14:20 PDT
Investigating
We are currently investigating reports of sitewide slowness and trouble saving data that appear to have begun at approximately 2 pm PST. We will provide an update with additional information as soon as possible.
Posted Aug 30, 2021 - 14:09 PDT
This incident affected: drchrono.com, drchrono iPad EHR, and DrChrono Telehealth Platform.