Elevated Connection Errors

Incident Report for Help Scout

Postmortem

This morning at 4:30am EST, the Help Scout public API was hit with about 8,000 search queries in a very short time span. While we do have rate limiting on the API, it is by the hour and not by the minute.

The requests caused timeouts as our search infrastructure (which handles a lot more than just search) was under very heavy load. Engineers on our team were able to get the problem under control by 5:50am EST.

Between 8:54am - 9:02am EST, there was another problem. We spent the morning upgrading our search infrastructure to manage more capacity. When we put the new servers in, the load balancer did not recognize them for several minutes. That caused more issues for a short time.

How we're going to improve

The most obvious improvement we can make is to implement more stringent rate limiting guidelines that limit API requests by minute, as well as by hour. We're also going to build out much more finite controls that can be managed on a per company basis.
In this case, the problems did not bring Help Scout offline, but rendered several pages (Workflows folders, conversations, search results) useless. There should be a way for the application to degrade more gracefully in spite of a problem, thus allowing other functionality to work properly.
We have already been in the process of revamping our on call policies for engineers in case something goes wrong outside of normal hours as it did today. With the proper authority given to engineers in Europe at that hour, we possibly could have solved this issue much faster.
The second issue today involving upgraded hardware and the load balancer can be fixed by adjusting our internal processes. Amazon's ELBs are a bit quirky and we can avoid that problem moving forward now that we've learned from this mistake.

In closing

Help Scout hasn't lived up to what you expect in recent weeks and there's no one more aware of it than us. We can and will make good on the improvements we've talked about today. Our most important priority is creating a more reliable experience for you. We sincerely apologize for the troubles today and look forward to getting better.

Posted Jan 06, 2015 - 15:34 EST

Resolved

What we know so far is that from 4:30am - 5:50am EST, some pages in Help Scout were generating an "Operation Timed Out" error. We had on call engineers at the time, but no one on our support team was available at that hour, so our communication was not as good as it should have been. We're going to mark the issue resolved for now and start work on a postmortem that will explain everything in greater detail.

Posted Jan 06, 2015 - 08:00 EST

Monitoring

Everything is back to normal now and we had engineers working on the problem, but support is just now catching up. Looks like the problem lasted about 90 minutes and we're still looking into it.

Posted Jan 06, 2015 - 07:25 EST

Investigating

We're experiencing an elevated level of connection errors and are looking into the issue.

Posted Jan 06, 2015 - 07:19 EST