On 05/16, customers were receiving scattered errors throughout the application on different screens.
All times are PST
Date/Time | Activity |
---|---|
2022-05-14 06:40 | One of our virtual servers encounters a hardware failure on AWS and is shut down automatically. |
2022-05-14 06:53 | The affected instance is rebooted automatically and starts serving bad traffic. Some customers start getting affected. |
2022-05-15 19:44 | Our engineering team is notified with users having problems when they open any appointment. |
2022-05-16 04:45 | The support team raises the urgency of the issue due to the increase in customer traffic and support tickets received. |
2022-05-16 05:42 | A member of the DevOps team starts to investigate the issue. |
2022-05-16 05:52 | An incident is posted to the status page. |
2022-05-16 06:55 | The affected instance is removed out of the pool and request error rates drop sharply. |
2022-05-16 07:07 | Status page is updated to monitoring. |
2022-05-16 09:58 | Status page is updated to resolved. |
AWS failures are infrequent enough that the infrastructure is not as mature as it should be against them. Upon being restarted, the server should have initiated all necessary services to serve production traffic, but only one of two services was properly initiated. This was enough to pass our health checks (so traffic was sent to the instance) but not enough to properly serve traffic.
The DevOps team took the affected instance out of the webserver pool.
Customers were receiving scattered errors on the platform.
The following DevOps tasks were created: