Reminders not being sent

Incident Report for DrChrono

Postmortem

Customers were experiencing intermittent issues with the ability to send appointment reminders.

Summary

Issues occurred on February 09, 2022 which caused customers intermittent disruption in their ability to send reminders for appointments.

Timeline (PST, 24-hour clock)

All times are PST.

Date/Time	Activity
2022-02-04 09:39	Customer Success first report reminders are not being sent out.
2022-02-04 09:45	Engineering investigates issue.
2022-02-07 06:39	Customers report reminders were not being sent out.
2022-02-07 06:46	Engineering investigates potential issue with cron job failing.
2022-02-07 06:50	DevOps increases memory limit for cron job.
2022-02-07 08:07	DevOps & Engineering monitor reminders queue.
2022-02-08 10:03	Customer Success report reminders are still not being sent out
2022-02-08 10:04	DevOps & Engineering investigate issue.
2022-02-08 10:13	DevOps & Engineering investigate possible Twilio issues with foreign phone number API.
2022-02-08 11:38	DevOps & Engineering continues investigating.
2022-02-08 11:38	DevOps & Engineerings attention shifts to a larger database replication issues.
2022-02-09 06:24	Customer Success report reminders are still not being sent out.
2022-02-09 06:55	Engineering investigates additional theories for reminders issue.
2022-02-09 07:16	DevOps restarts celery queues. Reminder logs show reminders are being sent out.
2022-02-09 07:32	Status Page created to reflect investigating status.
2022-02-09 07:36	Engineering continues to investigate possible Twilio issues.
2022-02-09 08:00	Engineering investigates additional cron job failures for reminders.
2022-02-09 08:42	Additional cron job memory increased.
2022-02-09 09:02	Additional cron job with increased memory immediately exits. DevOps investigates environment issue.
2022-02-09 09:09	DevOps & Engineering form a call to investigate possible issues collectively.
2022-02-09 09:10	DevOps manually runs process reminders job to process backlog.
2022-02-09 09:41	Failing cron job is migrated from RackSpace to AWS server.
2022-02-09 09:55	Cron job executes successfully. Backlog of reminders are sent out.
2022-02-09 10:30	DevOps & Engineering verification and troubleshooting continues.
2022-02-09 11:36	Status Page updated to reflect continued investigating status.
2022-02-09 10:31 - 13:00	Additional testing from Engineering, DevOps and Customer Success.
2022-02-09 13:19	DevOps & Engineering verify solution has resolved the issue. Backlog of reminders have been processed.
2022-02-09 13:38	Status Page updated to monitoring status.
2022-02-09 14:34	Status Page updated to resolved status.

Contributing Factor(s)

There were several contributing factors that caused the issues mentioned above. The first major factor was technical debt, one of the older database replication servers was experiencing issues handling traffic, this needed to be removed from the pool, and additional replication servers needed to be stood up. This setoff DevOps & the MariaDB team on a task of creating new additional replication servers and introducing them to the load balancing pool. Another factor was a misconfiguration of the new replication servers, this misconfiguration led to replicas being introduced into the pool but not actively receiving traffic. The above mentioned factors and other issues overshadowed and overloaded the engineering team's available bandwidth to properly investigate the reminders issue. The reminders issue timeline coincided with the previous incident report.

Stabilization Steps

Migration of failing cron job from existing RackSpace server to AWS resolved the issue.

Impact

This issue affected customers' ability to send out reminders for appointments.

Corrective Actions

Migration of failing cron job and all critical cron jobs from existing RackSpace to AWS will prevent a similar issue from occurring. Additionally, setting up the proper notifications channels/processes for failing critical jobs is being addressed/actively worked on.

Posted Feb 15, 2022 - 10:31 PST

Resolved

Thank you for your patience while we worked to investigate and resolve this issue. SMS, phone, and email reminders were all affected. All reminder types have resumed sending as of approximately 12:00 PST today.

Posted Feb 09, 2022 - 14:34 PST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Feb 09, 2022 - 13:38 PST

Update

We are continuing to investigate this issue.

Posted Feb 09, 2022 - 11:36 PST

Investigating

We are currently investigating reports of text, phone, and email reminders not being sent from the DrChrono platform. This appears to be an ongoing issue that has continued post the site performance incident resolved yesterday morning, February 8th. We will provide another update here as soon as we have additional information. Thank you for your patience.

Posted Feb 09, 2022 - 07:32 PST

This incident affected: drchrono.com.