Help Scout relies heavily on message queues for many in-app processes. Queues act as an intermediary behind the scenes, allowing different parts of Help Scout to communicate with each other at scale without impacting performance.
For example, when you send a reply, the reply is handed off to the message queue, then sent to the customer. Other features in Help Scout also rely on message queues, such as Workflows. When conditions are met and a workflow fires, those actions are queued and then performed on matching conversations.
Yesterday, we faced two separate, but semi-related problems with our message queues.
2:35 - 6:30pm EST
We started to notice connectivity issues around 2:35pm EST. These issues were caused by an abnormally high number of events put on the queues. Customers experienced page timeouts and overall sluggishness throughout the app. This issue was corrected by 3:00pm EST and we only saw about 5 minutes of downtime. However, we didn't quite understand where the problem started.
Our message queues responsible for sending outgoing email, updating folder counts and acting on workflows (among other things) were backed up with several hundred thousand events. It took a few hours to get the queues back to a normal level.
7:00pm - 4:30am EST
A little after 7pm EST, the same problem occurred. This time we ended up with several million events on the message queues.
We were able to identify and correct the problem rather quickly this time. The application did not experience any downtime or performance challenges in this window, but several things that depend on message queues were out of whack. The services responsible for updating folder views, folder counts, workflows, and search results fell drastically behind, as did outgoing email.
During this time, emails sent via Help Scout were not delivered to customers right away. Instead, they were delayed for 2-6 hours. The message queues caught up at 4:30am EST. No data was lost during this time period.
We've made a few short term adjustments to the way we handle our message queues. We've also added additional logging and notifications for specific services. Long term, we're reprioritizing infrastructure updates to make sure these processes run smoothly at all times.
We're separating message queues for outgoing email, so that in the event of a backup, outgoing email is not delayed.
We're already working on a project that will keep folder counts up-to-date in a more real-time, efficient manner.
We've learned valuable lessons from these events, and we're incredibly sorry for all of the stress and trouble we caused yesterday. We're off to make things better!