Web App Connection Issues
Incident Report for Help Scout
Postmortem

Starting at 11:38 EST this morning, the Help Scout web application experienced sporadic connection failures, resulting in long page load times, and in some cases, pages failing to load at all. The off-and-on nature of this issue resulted in roughly 16 minutes of downtime.

The root cause was related to our centralized logging infrastructure. Centralized logging is a critical system we use to debug issues and monitor the overall health of Help Scout. It's designed to never impact customer-facing systems, but in today's case a portion of it did.

This morning was an issue we haven't seen before. We could connect to the centralized logging cluster, but were unable to push data into it due to a server error. The Ops team rebuilt and re-synced the cluster to get things back on track.

As part of the recovery, we pushed a change to so that this same type of error won't ever have a customer-facing impact again. We still have a bit more investigating to do on our end as to how the problem came about, but it won't impact your experience again.

We do everything we can do avoid these moments, and apologize for letting you down today. We learned a lot and will keep working to provide you with a reliable experience moving forward.

Posted Feb 01, 2016 - 17:13 EST

Resolved
We've got the all clear on this end and will follow-up with a postmortem in the next 24 hours.
Posted Feb 01, 2016 - 12:38 EST
Update
We believe all page timeouts and performance issues to be cleared up now, will continue to monitor everything. Inbound and outbound queues we're paused and we're clearing them out now, should be back to normal within 10-15 minutes.
Posted Feb 01, 2016 - 12:21 EST
Update
Spoke too soon on the previous update, we're still investigating page timeouts on some % of inbound requests.
Posted Feb 01, 2016 - 12:13 EST
Monitoring
We've managed to clear up the page timeouts and performance should be back to normal for the moment. We're still monitoring all systems to make sure they remain stable.
Posted Feb 01, 2016 - 12:04 EST
Update
The Ops team is still working to narrow down a root cause of the connection issues. We'll update you as soon as we figure out what's going on.
Posted Feb 01, 2016 - 11:57 EST
Investigating
We're currently looking in to reports of sluggish page loads and timeouts when accessing the web app.
Posted Feb 01, 2016 - 11:38 EST