Outbound sending delays

Incident Report for Help Scout

Postmortem

Help Scout relies heavily on message queues for many in-app processes. Queues act as an intermediary behind the scenes, allowing different parts of Help Scout to communicate with each other at scale without impacting performance.

For example, when you send a reply, the reply is handed off to the message queue, then sent to the customer. Other features in Help Scout also rely on message queues, such as Workflows. When conditions are met and a workflow fires, those actions are queued and then performed on matching conversations.

Yesterday, we faced two separate, but semi-related problems with our message queues.

What happened

2:35 - 6:30pm EST
We started to notice connectivity issues around 2:35pm EST. These issues were caused by an abnormally high number of events put on the queues. Customers experienced page timeouts and overall sluggishness throughout the app. This issue was corrected by 3:00pm EST and we only saw about 5 minutes of downtime. However, we didn't quite understand where the problem started.

Our message queues responsible for sending outgoing email, updating folder counts and acting on workflows (among other things) were backed up with several hundred thousand events. It took a few hours to get the queues back to a normal level.

7:00pm - 4:30am EST
A little after 7pm EST, the same problem occurred. This time we ended up with several million events on the message queues.

We were able to identify and correct the problem rather quickly this time. The application did not experience any downtime or performance challenges in this window, but several things that depend on message queues were out of whack. The services responsible for updating folder views, folder counts, workflows, and search results fell drastically behind, as did outgoing email.

During this time, emails sent via Help Scout were not delivered to customers right away. Instead, they were delayed for 2-6 hours. The message queues caught up at 4:30am EST. No data was lost during this time period.

What we're doing about it

We've made a few short term adjustments to the way we handle our message queues. We've also added additional logging and notifications for specific services. Long term, we're reprioritizing infrastructure updates to make sure these processes run smoothly at all times.

We're separating message queues for outgoing email, so that in the event of a backup, outgoing email is not delayed.
We're already working on a project that will keep folder counts up-to-date in a more real-time, efficient manner.

We've learned valuable lessons from these events, and we're incredibly sorry for all of the stress and trouble we caused yesterday. We're off to make things better!

Posted Dec 19, 2014 - 17:53 EST

Resolved

We've resolved the root cause of the email backups. We'll have a complete postmortem with more information to share tomorrow. We're still processing a handful of delayed messages, however this should wrap up within the hour. Folder views, folder counts, and workflows should also be up-to-date shortly. On behalf of the whole team, we are sincerely sorry for all of the trouble we've had over the last few hours. Thank you for getting in touch and hanging with us today. We've got some work to do!

Posted Dec 19, 2014 - 00:45 EST

Update

We're a little over the halfway mark, continuing to process delayed messages.

Posted Dec 18, 2014 - 22:31 EST

Update

Delayed messages are still being processed without incident. We'll continue to update until we wrap this up.

Posted Dec 18, 2014 - 21:37 EST

Monitoring

Queued messages are being processed as quickly as possible, however, we're estimating that it may take between 4 to 6 hours before we're in the clear. Replies sent to customers are all accounted for, and will be delivered. Conversations will show in the appropriate folder as soon the queue clears. As a reminder, folder views, folder counts, workflows, and search results will likely be delayed until this has resolved.

Posted Dec 18, 2014 - 20:55 EST

Identified

We identified an issue internally and were able to make adjustments, bringing the Help Scout API back online. Users will continue to experience outbound email delays, inaccurate folder counts and displays, as well as workflows failing to fire. We're working to clear email backups now.

Posted Dec 18, 2014 - 20:17 EST

Update

We've temporarily disabled the Help Scout API as we continue to troubleshoot. API users will receive errors until connectivity is restored.

Posted Dec 18, 2014 - 19:50 EST

Update

We're still getting to the bottom of the queue backups. We'll continue to post regular updates as we have information to share.

Posted Dec 18, 2014 - 19:35 EST

Investigating

We're investigating a new issue with email delays that is unrelated to the incident reported earlier today. Messages sent via Help Scout may not be delivered right away, and folder views/counts may not update immediately after conversation actions.

Posted Dec 18, 2014 - 19:19 EST