Conversation pages timing out
Incident Report for Help Scout
Postmortem

This morning between 8:10am and 8:35am EST, a small number of Help Scout customers experienced connection failures that looked like this.

For the last few weeks, our Operations team has been slowly rolling out new infrastructure that supports all functionality built on search. About 75% of our customers were running on the new infrastructure this morning when a production server and it's redundant backup stopped reporting. They seemed to be running just fine, but our hosting provider said they were down. This caused roughly 1/6 of page loads hitting the new search infrastructure to see the error. A restart of both servers resolved the problem.

What we're doing about it

It's going to take a bit more research over the weekend to figure out exactly what happened and how we'll fix it. The redundancy plan didn't work out in this case and we're going to re-evaluate it.

Although the engineering team responded quickly to the problem, we did come up with some ways to improve alerts and monitoring of the new infrastructure.

We continue to monitor everything and have been back to normal since the brief issue this morning. Our apologies for the intermittent trouble.

Posted Mar 06, 2015 - 12:51 EST

Resolved
Todays issue has been resolved. It impacted roughly 17% of our customer base, which was using a new bit of infrastructure we've been gradually rolling out. We'll follow-up with a complete postmortem today once more information is available.
Posted Mar 06, 2015 - 08:43 EST
Monitoring
Problematic servers are being rebooted now. If you see an error page, please press refresh and the page should load properly. On this end we're seeing everything back to normal, but are continuing to monitor errors.
Posted Mar 06, 2015 - 08:35 EST
Identified
Still seeing issues related to a new search cluster only *some* of our customer base is using. We're working hard on resolving it.
Posted Mar 06, 2015 - 08:25 EST
Monitoring
Issue seems to have resolved itself, we're still investigating what happened.
Posted Mar 06, 2015 - 08:18 EST
Investigating
We're currently investigating connection issues with conversation pages and potentially some other pages.
Posted Mar 06, 2015 - 08:13 EST