Issues occurred on February 09, 2022 which caused customers intermittent disruption in their ability to send reminders for appointments.
All times are PST.
|2022-02-04 09:39||Customer Success first report reminders are not being sent out.|
|2022-02-04 09:45||Engineering investigates issue.|
|2022-02-07 06:39||Customers report reminders were not being sent out.|
|2022-02-07 06:46||Engineering investigates potential issue with cron job failing.|
|2022-02-07 06:50||DevOps increases memory limit for cron job.|
|2022-02-07 08:07||DevOps & Engineering monitor reminders queue.|
|2022-02-08 10:03||Customer Success report reminders are still not being sent out|
|2022-02-08 10:04||DevOps & Engineering investigate issue.|
|2022-02-08 10:13||DevOps & Engineering investigate possible Twilio issues with foreign phone number API.|
|2022-02-08 11:38||DevOps & Engineering continues investigating.|
|2022-02-08 11:38||DevOps & Engineerings attention shifts to a larger database replication issues.|
|2022-02-09 06:24||Customer Success report reminders are still not being sent out.|
|2022-02-09 06:55||Engineering investigates additional theories for reminders issue.|
|2022-02-09 07:16||DevOps restarts celery queues. Reminder logs show reminders are being sent out.|
|2022-02-09 07:32||Status Page created to reflect investigating status.|
|2022-02-09 07:36||Engineering continues to investigate possible Twilio issues.|
|2022-02-09 08:00||Engineering investigates additional cron job failures for reminders.|
|2022-02-09 08:42||Additional cron job memory increased.|
|2022-02-09 09:02||Additional cron job with increased memory immediately exits. DevOps investigates environment issue.|
|2022-02-09 09:09||DevOps & Engineering form a call to investigate possible issues collectively.|
|2022-02-09 09:10||DevOps manually runs process reminders job to process backlog.|
|2022-02-09 09:41||Failing cron job is migrated from RackSpace to AWS server.|
|2022-02-09 09:55||Cron job executes successfully. Backlog of reminders are sent out.|
|2022-02-09 10:30||DevOps & Engineering verification and troubleshooting continues.|
|2022-02-09 11:36||Status Page updated to reflect continued investigating status.|
|2022-02-09 10:31 - 13:00||Additional testing from Engineering, DevOps and Customer Success.|
|2022-02-09 13:19||DevOps & Engineering verify solution has resolved the issue. Backlog of reminders have been processed.|
|2022-02-09 13:38||Status Page updated to monitoring status.|
|2022-02-09 14:34||Status Page updated to resolved status.|
There were several contributing factors that caused the issues mentioned above. The first major factor was technical debt, one of the older database replication servers was experiencing issues handling traffic, this needed to be removed from the pool, and additional replication servers needed to be stood up. This setoff DevOps & the MariaDB team on a task of creating new additional replication servers and introducing them to the load balancing pool. Another factor was a misconfiguration of the new replication servers, this misconfiguration led to replicas being introduced into the pool but not actively receiving traffic. The above mentioned factors and other issues overshadowed and overloaded the engineering team's available bandwidth to properly investigate the reminders issue. The reminders issue timeline coincided with the previous incident report.
Migration of failing cron job from existing RackSpace server to AWS resolved the issue.
This issue affected customers' ability to send out reminders for appointments.
Migration of failing cron job and all critical cron jobs from existing RackSpace to AWS will prevent a similar issue from occurring. Additionally, setting up the proper notifications channels/processes for failing critical jobs is being addressed/actively worked on.