Issue Summary
Customers experienced delays with items such as lab orders and referrals showing up in the message center between 9 AM EST and 6:45 PM EST. Between 3:30 PM EST and 6:45 PM EST, customers also experienced intermittent 503 errors while attempting to log into the system as well as slow/sluggish behavior while using the web application. During the latter incident window, 5% of all requests resulted in 503 errors while a total of 22% of requests failed to complete successfully (some due to user canceling/refreshing).
How were customers impacted?
Customers experience delays in receiving items in the message center as well as general application slowness and at times, trouble logging in.
Root Cause
It was determined that a production hotfix deployed on Friday evening at 9 PM EST resulted in a flood of messages that overwhelmed our background processing capabilities for some activities in the application. As a result, our Amazon AWS-hosted queuing infrastructure tipped over under resource pressure from the increased load and took longer than expected to failover and recover. However, this flood was not experienced until hitting Monday traffic levels.
Resolution
The hotfix code was reverted and the message backlog was drained.
Mitigation steps planned/ taken