Issues occurred on September 23, 2021 which made Doctor Settings, Account Settings (CRM), Staff Permissions, or Onpatient Settings unable to save due to timeouts.
All times are EST.
Date/Time | Activity |
---|---|
2021-09-23 11:27:28 | Started receiving initial reports of the issue. |
2021-09-23 11:27:28 | DevOps team started investigating potential causes. |
2021-09-23 11:32:49 | Possible related errors identified in logs. |
2021-09-23 11:34:28 | Possible issue with Celery identified. Started monitoring Celery. |
2021-09-23 11:43:21 | Problematic behavior with Celery observed again. |
2021-09-23 11:49:01 | Continued testing and observing Celery |
2021-09-23 11:55:32 | Engineering team identified a potential issue with the code. |
2021-09-23 12:03:30 | Account Management team identified an issue in Sentry that could also be related. |
2021-09-23 12:09:44 | Restarted the Celery queues on all of the servers. |
2021-09-23 12:26:33 | Engineering started implementing a code fix. |
2021-09-23 13:00:57 | Status Page created. |
2021-09-23 13:14:02 | Started deployment of the code fix to the staging environment. |
2021-09-23 13:22:55 | Started testing code fix in staging. |
2021-09-23 14:20:27 | Fixes deployed to staging verified and approved for deployment to production. |
2021-09-23 14:22:22 | Started deployment of the code fix to the production environment. |
2021-09-23 15:17:29 | Started testing code fix in production. |
2021-09-23 15:38:00 | Testing in production finished and fix verified. Status page updated. |
Display updates to information under Account Settings were delayed for ~5 minutes. This is due to all information being cached for 5 minutes, regardless of any changes. A fix for this existing issue was implemented where the cache was cleared after each account update. This overloaded Redis and caused the subsequent performance issues.
To stabilize the platform a code issue was identified and a hotfix was created, deployed, and validated.
The majority of our customers would have experienced slowness from the increased database activity which is slower than the Redis cache.
Engineering fixed the identified code issue and the fix was deployed and verified.