Elevated Connection Errors

Incident Report for Help Scout

Postmortem

It's disappointing that we let you down twice in one week, folks. There's no doubt about the fact that we expect to provide you with a much more reliable experience than we have recently. Here's a brief overview of what happened and what we'll do about it moving forward.

Overview

At about 7:20am EST this morning, a high volume of database connections caused the web app to be unavailable, meaning that page loads timed out and/or would not load. This behavior continued for a combined time of roughly 35 minutes up to 1:30pm EST, about 9% of the status window. When it was not down, performance and all in-app functionality was normal for the most part.

In addition, the public help desk API was down for a total of 4 hours and 26 minutes. Most of that downtime was a proactive measure by the Ops team to take the API offline in an attempt to narrow down the problem.

Lastly, this incident was unrelated to the one experienced earlier this week. In both cases, no data was lost, only temporarily delayed for a few minutes at various times during the status window.

What happened?

Today's incident was very challenging to track down because it caused a chain reaction of services to fail and for the web app to be unresponsive at times.

The root of the problem ended up being related to our messaging queues. A very large number of events were in our queues, related to search indexing that was happening behind-the-scenes. This indexing was kicked off in preparation for some future search infrastructure upgrades.

While the message queues never failed, the volume of events caused the initial problem, which set in motion a chain reaction that ended with the database getting a surge of connections.

To resolve the problem, we cleared out the queue in question and stopped all indexing activity. This action did not and will not impact any customer accounts.

What we're doing about it

We added several things to our wish list as a result of today's incident each with varying degrees of priority. Of course any work that would prevent this sort of thing or better report the nature of the issue in the future will be top priority. The range of improvements includes changes to caching, rate limiting responses, message queues and interaction with third party services.

Most importantly, we want to apologize for what happened today. It's no way to end the work week. We look forward to doing much better and getting back to a very reliable experience for you and your teammates.

Posted Feb 05, 2016 - 15:12 EST

Resolved

After carefully validating all systems, we believe this to be resolved. All Ops team members will continue monitoring everything for the next several hours, but we have fingers crossed that we're in the clear. A postmortem will be coming within 24 hours or so.

Posted Feb 05, 2016 - 14:10 EST

Update

It's possible that we've identified the root cause of the problems today. Help Scout systems are looking much better now and we're bringing the public help desk API back online now.

Posted Feb 05, 2016 - 13:56 EST

Update

Our Ops team is hard at work getting the web app and public help desk API back to 100%, but we don't have any major updates at the moment. We'll continue doing everything we can.

Posted Feb 05, 2016 - 13:10 EST

Update

Our attempt to bring back the public help desk API failed, we had to take it offline again. We'll keep working on a solution.

Posted Feb 05, 2016 - 12:37 EST

Update

We've brought the public help desk API back online and will be monitoring all activity.

Posted Feb 05, 2016 - 12:12 EST

Monitoring

Taking the help desk API down temporarily seems to have stabilized connections for now. We'll be making some changes to the API and will bring it back up when it feels safe to do so.

Posted Feb 05, 2016 - 10:41 EST

Update

We are temporarily taking the help desk public API down in order to get connection volume stabilized and further narrow down the problem.

Posted Feb 05, 2016 - 09:50 EST

Update

Connections continue to go up and down and can cause Help Scout to time out. We're doing everything we can to mitigate the problem on this end.

Posted Feb 05, 2016 - 09:32 EST

Update

While performance remains stable for now, we're still seeing irregular connection volume and are working to resolve it. Our consumers and indexers are running at half speed, so folder counts, message previews and search results will likely be a little delayed.

Posted Feb 05, 2016 - 09:23 EST

Identified

Performance in the web app is back to normal for the moment and we're still investigating the source of dramatic spikes in database connections.

Posted Feb 05, 2016 - 08:45 EST

Investigating

We're experiencing an elevated level of connection errors again and are looking into the issue.

Posted Feb 05, 2016 - 08:35 EST