What happened

Starting around 11:45am EDT on Monday, we received alerts that our message queues were struggling with an abnormally large amount of queued outbound email. This initial high load caused intermittent performance issues, where the web app and APIs went up and down sporadically. From there, things started to cascade to the database. Starting at 11:58am EDT, all Help Scout services were unavailable.

Ultimately, the root cause of yesterday’s downtime was related a deploy that went out earlier in the morning. There was an underlying problem with the changes pushed, which caused a large amount of sanity test errors. Each error triggered an unexpected outgoing email, which led to the initial and rapid back-up of our messaging queues.

As the message queues continued to fill, they stopped responding to various backend services. These services were holding database connections open while waiting for a response from the message queue. Those responses never came, and as a result, a high CPU load on the queues eventually exhausted the database connection pools. Without the ability to process events in the message queues or complete database transactions, all customer facing services were unresponsive.

In total, the web app was down and unavailable for 55 minutes, while the help desk API and Docs API were down for 43 minutes. Mobile apps were down for 20 minutes. At 12:42pm EDT, all services were brought back online.

What we’re doing about it

After a thorough post-incident analysis and review, we’ve identified a few areas where we’ll be focusing our energy over the next few weeks. Here’s what we’re working on:

Improving the code that handles connections between our various backend services and message queues.
Isolating parts of our infrastructure so that we can prevent message queue backups from causing wide spread damage.
Enhancing our internal tooling to allow more fine-grained control of the messages in the system.
Adding additional testing to our existing routines so that we can better validate new deployments, making sure that key services will hold strong during unexpected events like this one.

Posted Oct 24, 2017 - 15:50 EDT

Resolved

Performance is stable across the board, so we're closing this out. Email queues are empty and processing messages in a timely manner, as expected. We'll continue to monitor all services over the next few hours. We have a hunch about what caused today's downtime. We'll publish a post-mortem in the next 24 hours with a more complete description of what happened.

Posted Oct 23, 2017 - 13:31 EDT

Update

We're still battling a few things with the database and our message queues. It's possible that performance will take a hit again over the next few minutes.

Posted Oct 23, 2017 - 13:12 EDT

Update

Web-app performance is back to normal, and all services are stable at the moment. Folder counts, workflows, and search are still catching-up. It's going to take us some time to work through the email backlog. As we start to clear the message queues, you may receive large groups of email (including email notifications) all at once over the next half-hour or so. We'll continue to provide updates as we make progress.

Posted Oct 23, 2017 - 12:55 EDT

Monitoring

The web-app is back online, though page loads might be slow as we continue to recover. The help desk API, Docs API, Docs sites, and mobile apps are also back online. Workflows, inbound and outbound email, folder counts, search, and reports will be delayed until we're in the clear.

Posted Oct 23, 2017 - 12:43 EDT

Update

We're making progress, but it might be a bit longer before we're able to bring all services back online. We're very sorry for the downtime today, we know this hits hard on a Monday morning.

Posted Oct 23, 2017 - 12:39 EDT

Update

We're still working to narrow down a root cause of the outages and restore services. It's worth mentioning that during this time, inbound and outbound email is queued and delayed. All messages will be delivered as normal once we're back online.

Posted Oct 23, 2017 - 12:12 EDT

Update

We're working as quickly as possible to restore services. We'll continue to update here as we make progress.

Posted Oct 23, 2017 - 11:51 EDT

Investigating

Help Scout is currently unavailable. We're investigating.

Posted Oct 23, 2017 - 11:46 EDT