Starting Friday, 3/4/22 we saw a slowdown in 835 & 837 file processing. On Tuesday, 3/8/22, the RCM team also reported ERA verifications were behind.
All times are PST
Date/Time | Activity |
---|---|
2022-03-04 | A misconfiguration on a user account on the server led to a large number of open files. The file limit was increased via configuration management tools. |
2022-03-08 11:00:00 | Our team conducted some investigation on our cron-job settings on various aspects, including # of connections, total packets, ulimit settings, max connections, celery worker max memory and time-limits, and checked out celery memory usage in NewRelic. We freed up some memory by killing old Django shell instances. |
2022-03-08 13:00:00 | DevOps verified that there were no servers decommissioned or added from our last deployment schedule (3/4/22), except for the addition and then removal of the Cloudwatch logging info in one of the cron-jobs. |
2022-03-08 14:00:00 | The team began investigating jump errors that were found beginning 2/28/22. This was later determined to be irrelevant to the issue. |
2022-03-08 15:00:00 | Debugged with added NewRelic traces. Did not find any relevant identifiers of the slowdown. |
2022-03-08 17:00:00 | Identified one of the batch files for 837s was constrained. |
2022-03-09 07:45:00 | While looking at the load of each process running on the server, we found the batch_get_medical_reports process for 835s was running extremely slow. We killed the process and it restarted with the next schedule. |
2022-03-09 07:48:00 | Reviewed file sizes for Emdeon per RCM team’s callout that Emdeon processing slowed down recently. We pulled out a file from Mar 4 that was unusually large and was holding up other files from processing, then restarted the process for Emdeon. |
2022-03-09 09:55:00 | Checked current memory usage and identified Django shell processes to clean up. |
2022-03-09 10:18:00 | Reviewed the values we've been using to run the billing cron job. It was found to be set to the lowest priority. DevOps updated this and set to the highest priority. |
2022-03-09 13:00:00 | Investigated how much time Sentry read timeout errors are taking and how much CPU they’re consuming Reviewed Sentry configuration. |
2022-03-11 15:00:00 | DevOps team were able to get the new AWS Practice cron server set up. Practice team tested the daily_update_patient_last_appt_date cron on the new server. |
2022-03-15 8:15:00 | Status page update was added with file processing progress. |
2022-03-15 9:58:00 | Status page updated to identified status. |
2022-03-15 10:00:00 | Disabled clam anti-virus daemon on cron-02 which was at ~202GB CPU. After that, we saw an increase in CPU usage for the ERA cron and an influx in Claim submissions (up by 3K claims in 2 hours from 10am-12 pm PT). |
2022-03-15 14:45:00 | Practice cron server on the AWS instance has been created. |
2022-03-16 08:30:00 | DevOps, Payments & Practice team members met to sync on the ERA processing and ERA & EHR Cron Servers statuses. DevOps pointed out we are not using all the memory and CPU available on production-cron-02. The team agreed to perform a hot fix to add New Relic traces for the ERA/835, 277 and 837 processing because this parallelized process was working adequately through March 4, 2022 before we started to see a significant slowdown in processing. |
2022-03-16 15:34:00 | Status page update was added with file processing progress |
2022-03-16 19:18:00 | Hotfix was deployed to production. |
2022-03-17 11:00:00 | Identified the reason why the cron job stopped reporting on 3/4. There was an outdated copy of the chef recipe that enabled the New Relic agent. The changed was made to address issues with hitting open file limits on the cron-02 server. |
2022-03-17 13:10:00 | Status page update was added with file processing progress. |
2022-03-17 14:25:00 | DevOps identified a connection issue with Redis due to a missing firewall rule. The rule was added and communication started working. We started to see the production-cron-02 server resources being used more. |
2022-03-17 | Stale configuration management changes were identified as the primary cause of the change in communication between the cron and Redis servers. |
2022-03-18 14:19:00 | Status Page updated to monitoring status with an update on file process progress. |
2022-03-21 11:57:00 | Status Page updated to resolved status. |