Incident Overview
On July 16th, 2025, the DrChrono application experienced a temporary service outage following a scheduled release the evening prior. The release itself was successful; however, during post-release monitoring the morning of July 17th we observed slightly elevated memory pressure across application servers. This memory pressure was not causing end user experience impact but was identified due to increased observability following the deployment. In response, a proactive configuration change was made to improve memory usage. Unfortunately, this adjustment unintentionally restricted the system’s ability to allocate sufficient resources for application processes, resulting in a temporary outage. Due to this occurring during core business hours, it took some time to restore enough resources to support application traffic, but services returned to normal operation once resources were restored.
How We Responded
The configuration change was reverted, and traffic was temporarily paused to allow the system to recover. Once the application was confirmed healthy traffic was resumed and the system became fully available.
Corrective and Preventative Actions
To prevent recurrence, we are taking the following steps:
Standard Operating Procedure (SOP) Enhancements: We are updating and reinforcing our internal SOPs to emphasize slow rollout and verification testing when applying infrastructure setting changes prior to rollout for all traffic – even when thought to be safe or simple.
Warm Resources on Standby: We have created and will continue to maintain a pool of separate warm servers so that we can restore previous configurations more quickly as well as spin up needed resources faster in cases of high traffic.
We know that many of you rely on DrChrono every day to support your operations. We sincerely apologize for this disruption and are committed to strengthening our systems to prevent it from happening again. Thank you for your patience and continued trust.