Connection Errors
Incident Report for Help Scout
Postmortem

What happened

Between 12:33 - 1:19pm EDT on July 10, many Help Scout services were down or experiencing spotty performance. The web app, Docs, both public APIs, and the mobile app were all impacted. Officially we tracked 18 minutes of downtime.

It's been years since we saw an incident that impacted so many services. The only common component that they share is event queues. Almost anything that happens in Help Scout goes through an event queue, and we've not run into a problem that caused a problem like this before.

Yesterday, an action put hundreds of thousands of events on the queues very quickly. Our Ops team was notified immediately, and tried several things to restore connectivity, which were only sustained temporarily. At 1:02 EDT, the event queues were firewalled off for 12 minutes, to give them time to return to a healthy state. Help Scout services were down during this time. At 1:14 EDT, the firewall was removed and all services that feed event queues were restarted. By 1:19 EDT, service had been restored. All emails and other events were processed within 10 minutes.

No data was lost during the incident. All emails and other events were processed eventually.

What we're doing about it

The root cause was code-related. Shortly after the incident, the engineering team deployed a temporary fix that will prevent the issue from recurring. In the next couple of days, a permanent fix will be deployed.

Other infrastructure changes will be made to further limit the impact if a similar problem occurred with the event queues.

Posted Jul 11, 2017 - 15:43 EDT

Resolved
As of 1:19pm EDT, connection failures were cleared up and all services have been restored. We'll publish a post-mortem in the next 24 hours with a more complete description of what happened and what we're going to do to prevent it in the future..

Very sorry for the trouble today! We'll learn from the mistake and get better. If you have any questions or concerns, please reach out.
Posted Jul 10, 2017 - 13:39 EDT
Monitoring
Engineering has identified the issue and is monitoring things to ensure everything is resolved. Incoming and outgoing mail queues are slowly draining, and should be caught up in the next ten minutes.
Posted Jul 10, 2017 - 13:23 EDT
Update
We're experiencing rolling problems with both the Docs and Help Desk APIs. The engineering team is currently working to fix the root cause, but both Docs and the Help Desk will continue to experience intermittent connectivity until it's fully resolved.
Posted Jul 10, 2017 - 13:05 EDT
Update
The app is currently down, but our engineering team is investigating and working to get it back up ASAP. We'll update as soon as we have additional information.
Posted Jul 10, 2017 - 12:41 EDT
Investigating
Our automated systems have detected connection problems with the web app and the Ops team has been notified. Stay tuned for further information.
Posted Jul 10, 2017 - 12:37 EDT