Database connection issues and application slowness (1h 15m downtime)

Incident Report for Help Scout

Resolved

9:14am EST - We're investigating connection issues with the website and will update you when we know more.

9:18am EST - Everything is back. We believe this and the last one were similar and have a good idea of what the problem is. Of course we'll work to resolve it as soon as possible. Sorry for the momentary trouble!

10:12am EST - Noticing more connection issues and are not able to reach the database, investigating.

10:14am EST - The issue cleared up quickly. Our systems are under high load right now and we're keeping a very close eye on things.

10:56am EST - We've been dealing with some slow performance from high load for the last 90 minutes or so. It seems like things are leveling off at the moment and we'll keep you posted of any further irregular activity.

12:45pm EST - Performance issues persist as a result of extraordinarily high load today. All hands are on deck to stabilize the service and we'll post any updates here.

1:15pm EST - Connections from about 12:40pm EST were mostly timing out and stabilized around 1:10pm EST. We continue to monitor the situation and do everything in our power to get performance back to normal.

1:30pm EST - It feels like we have today's issues solved, as the last 20 minutes have been quiet under normal load. Worldwide page load time is back down to 1.4 seconds, which is right where we usually are. Any other abnormal activity will be reported here as it happens.

1:50pm EST - We're down again, more info momentarily ...

2:10pm EST - Systems are back up but running more slowly again, we're continuing to investigate the problem.

2:30pm EST - We're back to normal performance levels once again ... more info is on the way as to what happened.

2:34pm EST - Back at 1:10pm EST when the system stabilized, we were able to get to that point by taking a bunch of web servers offline, upgrading them and restarting. In the chaos of getting things back up, we mistakenly misconfigured one of the settings having to do with local file storage. The servers quickly ran out of disk space, causing another outage. We had to take them offline one more time, make the settings change and restart to get things back to normal.

It still feels like the problem was resolved at 1:30pm, but we made a mistake in the heat of the moment trying to get everything back. Now that the mistake is resolved, we're really hoping any performance issues are behind us for the day.

6:37pm EST - We're experiencing slowness and timeouts again and are investigating.

7:57pm EST - In order to resolve the connectivity issues experienced throughout the day, we've scheduled emergency maintenance at 10:00pm EST for 30 minutes. We're continuing to monitor all systems and will post any updates here in the meantime.

9:37pm EST - Page load times are still hovering around 6 seconds, well above average. The engineering team is still planning for a 30-minute maintenance window starting in 23 minutes. We're also working on an email to customers explaining the events of today. More as we get going with the emergency maintenance here in a bit ...

10:27pm EST - Emergency maintenance has been completed. We've still got most of the team online and watching everything closely. At the moment all systems are go and average page load time is down to 1.3 seconds.

12:00am EST on July 17 - Things have been quiet since the emergency maintenance but we're keeping a close eye on all system operations through the evening.

Summary - We estimate total downtime for today at 1 hour, 15 minutes. Elevated page load times and overall sluggish performance lasted about 6 hours cumulatively. This was a really painful day for all of us and on behalf of the whole team, we sincerely apologize for the issues today.

Posted Jul 16, 2014 - 12:00 EDT