On May 21st and May 24th, 2021, the EHR web servers experienced an increase in errors and queue times, impacting site and application performance.
|2021-05-21 13:22||Sporadic and unspecific reports for general site slowness reported|
|2021-05-21 13:41||3rd party CDN reports having issues with connectivity for some regions|
|2021-05-21 13:58||Slowness issues identified as likely caused by a queue backup|
|2021-05-21 14:12||Queue backup resolved|
|2021-05-21 14:15||Issue appears to be resolved; Customer Escalation unable to reproduce previous issues|
|2021-05-21 15:35||Error rates are at normal levels|
|2021-05-21 16:35||No new tickets or reports of site slowness|
|2021-05-21 16:36||Site issues resolved; status page updated|
|2021-05-24 11:10||First reports of an increase in tickets for site slowness.|
|2021-05-24 11:12||Ops begins investigation for possible issues.|
|2021-05-24 11:20||Queue times identified as being higher than normal.|
|2021-05-24 11:30||High load on the primary databases is identified; a running report/import is suspected of being the issue.|
|2021-05-24 11:45||Users are still experiencing site slowness; queue times are still high.|
|2021-05-24 11:47||Status message posted for “Site Slowness”.|
|2021-05-24 11:52||Import jobs have increased load on work queues but are processing. The increased load on the work queues is thought to be contributing to perceived slowness for some operations.|
|2021-05-24 12:10||Causes for increased load against the primary databases are identified and remediation begins.|
|2021-05-24 13:30||Remediation for load on the database is complete, but site slowness persists.|
|2021-05-24 13:35||A separate cause for an increase in errors for the web servers is identified.|
|2021-05-24 13:47||Status page updated to “Identified”.|
|2021-05-24 14:00||Issue causing increase in errors is traced down to a code commit from 05/20/2021.|
|2021-05-24 14:10||A hotfix to mitigate the issue is prepared.|
|2021-05-24 14:30||Hotfix deployment begins.|
|2021-05-24 15:15||Deployment is complete; web queue times are back at baseline, errors have ceased.|
|2021-05-24 15:18||Status page updated to “Operational”.|
|2021-05-24 16:07||Incident declared resolved, status page updated.|
A code change made during a planned production deployment on May 20th, 2021 led to an increase in errors during peak traffic hours on May 24th, 2021. A corresponding increase in database load was incorrectly identified as a contributing factor, leading to increased time to resolution.
Most users would have experienced an increase in site slowness and errors while using the EHR application.
Once identified, a code hotfix was created and deployed to our production environment to mitigate the code causing the errors.