Between 12:33 - 1:19pm EDT on July 10, many Help Scout services were down or experiencing spotty performance. The web app, Docs, both public APIs, and the mobile app were all impacted. Officially we tracked 18 minutes of downtime.
It's been years since we saw an incident that impacted so many services. The only common component that they share is event queues. Almost anything that happens in Help Scout goes through an event queue, and we've not run into a problem that caused a problem like this before.
Yesterday, an action put hundreds of thousands of events on the queues very quickly. Our Ops team was notified immediately, and tried several things to restore connectivity, which were only sustained temporarily. At 1:02 EDT, the event queues were firewalled off for 12 minutes, to give them time to return to a healthy state. Help Scout services were down during this time. At 1:14 EDT, the firewall was removed and all services that feed event queues were restarted. By 1:19 EDT, service had been restored. All emails and other events were processed within 10 minutes.
No data was lost during the incident. All emails and other events were processed eventually.
The root cause was code-related. Shortly after the incident, the engineering team deployed a temporary fix that will prevent the issue from recurring. In the next couple of days, a permanent fix will be deployed.
Other infrastructure changes will be made to further limit the impact if a similar problem occurred with the event queues.