Starting around 11:45am EDT on Monday, we received alerts that our message queues were struggling with an abnormally large amount of queued outbound email. This initial high load caused intermittent performance issues, where the web app and APIs went up and down sporadically. From there, things started to cascade to the database. Starting at 11:58am EDT, all Help Scout services were unavailable.
Ultimately, the root cause of yesterday’s downtime was related a deploy that went out earlier in the morning. There was an underlying problem with the changes pushed, which caused a large amount of sanity test errors. Each error triggered an unexpected outgoing email, which led to the initial and rapid back-up of our messaging queues.
As the message queues continued to fill, they stopped responding to various backend services. These services were holding database connections open while waiting for a response from the message queue. Those responses never came, and as a result, a high CPU load on the queues eventually exhausted the database connection pools. Without the ability to process events in the message queues or complete database transactions, all customer facing services were unresponsive.
In total, the web app was down and unavailable for 55 minutes, while the help desk API and Docs API were down for 43 minutes. Mobile apps were down for 20 minutes. At 12:42pm EDT, all services were brought back online.
After a thorough post-incident analysis and review, we’ve identified a few areas where we’ll be focusing our energy over the next few weeks. Here’s what we’re working on: