It's disappointing that we let you down twice in one week, folks. There's no doubt about the fact that we expect to provide you with a much more reliable experience than we have recently. Here's a brief overview of what happened and what we'll do about it moving forward.
At about 7:20am EST this morning, a high volume of database connections caused the web app to be unavailable, meaning that page loads timed out and/or would not load. This behavior continued for a combined time of roughly 35 minutes up to 1:30pm EST, about 9% of the status window. When it was not down, performance and all in-app functionality was normal for the most part.
In addition, the public help desk API was down for a total of 4 hours and 26 minutes. Most of that downtime was a proactive measure by the Ops team to take the API offline in an attempt to narrow down the problem.
Lastly, this incident was unrelated to the one experienced earlier this week. In both cases, no data was lost, only temporarily delayed for a few minutes at various times during the status window.
Today's incident was very challenging to track down because it caused a chain reaction of services to fail and for the web app to be unresponsive at times.
The root of the problem ended up being related to our messaging queues. A very large number of events were in our queues, related to search indexing that was happening behind-the-scenes. This indexing was kicked off in preparation for some future search infrastructure upgrades.
While the message queues never failed, the volume of events caused the initial problem, which set in motion a chain reaction that ended with the database getting a surge of connections.
To resolve the problem, we cleared out the queue in question and stopped all indexing activity. This action did not and will not impact any customer accounts.
We added several things to our wish list as a result of today's incident each with varying degrees of priority. Of course any work that would prevent this sort of thing or better report the nature of the issue in the future will be top priority. The range of improvements includes changes to caching, rate limiting responses, message queues and interaction with third party services.
Most importantly, we want to apologize for what happened today. It's no way to end the work week. We look forward to doing much better and getting back to a very reliable experience for you and your teammates.