Server Error Impacting Customers
Incident Report for Help Scout
Postmortem

What happened

Between 9:36 and 10:00am EST on July 9, customers experienced errors in Help Scout when loading conversations, search results and some other functions of the web app.

One of our search shards failed, which caused the errors and impacted roughly 1/6 (17%) of web app page views in that time. We have redundancies in place to prevent an error like this from impacting customers, but also encountered a problem with the load balancer managing traffic between the shards.

After a little over 20 minutes, our Ops team was able to resolve the issue and get everything back to normal.

What we did about it

Within 24 hours, we replaced the load balancer with new software and added end-to-end health checks to the impacted service. These items were on the short-term roadmap, but we dropped everything to knock them out following the issue.

We feel very confident about our ability to detect this issue and prevent it from impacting customers in the future thanks to these changes.

Posted Jul 15, 2015 - 09:27 EDT

Resolved
Sorry about the trouble today! All services are back to normal. We'll have more to share about what happened a bit later.
Posted Jul 09, 2015 - 13:08 EDT
Monitoring
Updating the status of this incident to Monitoring ...
Posted Jul 09, 2015 - 10:04 EDT
Update
We're back up and running! We'll keep an eye on things and follow up with an explanation shortly.
Posted Jul 09, 2015 - 10:01 EDT
Investigating
We started experiencing some internal server errors at 9:36am ET and are investigating the problems now. Please stay tuned for updates.
Posted Jul 09, 2015 - 09:54 EDT