Unable to update Account Settings, OnPatient Settings, Staff Permissions
Incident Report for DrChrono
Postmortem

Settings Update Issues

Summary

Issues occurred on September 23, 2021 which made Doctor Settings, Account Settings (CRM), Staff Permissions, or Onpatient Settings unable to save due to timeouts.

Timeline (EST, 24-hour clock)

All times are EST.

Date/Time Activity
2021-09-23 11:27:28 Started receiving initial reports of the issue.
2021-09-23 11:27:28 DevOps team started investigating potential causes.
2021-09-23 11:32:49 Possible related errors identified in logs.
2021-09-23 11:34:28 Possible issue with Celery identified. Started monitoring Celery.
2021-09-23 11:43:21 Problematic behavior with Celery observed again.
2021-09-23 11:49:01 Continued testing and observing Celery
2021-09-23 11:55:32 Engineering team identified a potential issue with the code.
2021-09-23 12:03:30 Account Management team identified an issue in Sentry that could also be related.
2021-09-23 12:09:44 Restarted the Celery queues on all of the servers.
2021-09-23 12:26:33 Engineering started implementing a code fix.
2021-09-23 13:00:57 Status Page created.
2021-09-23 13:14:02 Started deployment of the code fix to the staging environment.
2021-09-23 13:22:55 Started testing code fix in staging.
2021-09-23 14:20:27 Fixes deployed to staging verified and approved for deployment to production.
2021-09-23 14:22:22 Started deployment of the code fix to the production environment.
2021-09-23 15:17:29 Started testing code fix in production.
2021-09-23 15:38:00 Testing in production finished and fix verified. Status page updated.

Contributing Factor(s)

Display updates to information under Account Settings were delayed for ~5 minutes. This is due to all information being cached for 5 minutes, regardless of any changes. A fix for this existing issue was implemented where the cache was cleared after each account update. This overloaded Redis and caused the subsequent performance issues.

Stabilization Steps

To stabilize the platform a code issue was identified and a hotfix was created, deployed, and validated.

Impact

The majority of our customers would have experienced slowness from the increased database activity which is slower than the Redis cache.

Corrective Actions

Engineering fixed the identified code issue and the fix was deployed and verified.

Posted Sep 29, 2021 - 12:56 PDT

Resolved
This incident has been resolved.
Posted Sep 23, 2021 - 12:37 PDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Sep 23, 2021 - 12:18 PDT
Identified
The issue has been identified and a fix is being implemented.
Posted Sep 23, 2021 - 11:17 PDT
Investigating
We are currently investigating an issue preventing users from updating Account Settings, OnPatient Settings, and Staff Permissions. We will provide and update here with more information as soon as possible.
Posted Sep 23, 2021 - 10:00 PDT
This incident affected: drchrono.com.