Internal Server Error Impacting Some Customers
Incident Report for Help Scout
Postmortem

What Happened

Today at 4:45pm EST, roughly 1/6th of Help Scout customers started seeing internal server errors on some pages. We had a couple of servers, a production node and it's redundant backup in the same shard, stop functioning. Our Ops team was not alerted to the issue because the instances were still up although they were reporting errors.

Once the issue was discovered at 8:30pm EST, the Ops team restarted the instances and restored service fully to the impacted customers by 8:55pm.

We don't believe any data to be missing from the impacted accounts. However, as a precaution, we're reindexing all data from the last 24 hours just to be sure. If you see anything abnormal in your account, please don't hesitate to reach out.

How we're going to resolve it moving forward

Two things need to be improved on our side to make sure this never happens again:

  1. We have to beef up our monitoring of this particular service and make sure we can report issues immediately even if the servers are up.
  2. We're going to expand our automated testing to account for situations that only impact a small percentage of customers. Since this issue didn't impact our account, it took longer to be alerted to the problem.
Posted May 31, 2015 - 21:44 EDT

Resolved
This incident has been resolved.
Posted May 31, 2015 - 21:15 EDT
Monitoring
Issues are now resolved. We'll come up with a full explanation before closing this one out.
Posted May 31, 2015 - 20:57 EDT
Identified
We've identified the issue and expect to have it resolved in the next 10 minutes or less.
Posted May 31, 2015 - 20:54 EDT
Investigating
We've been experiencing an elevated level of internal server errors and are looking into the issue now.
Posted May 31, 2015 - 20:35 EDT