Google had a widespread networking outage that put our RabbitMQ instances into a bad state. The RabbitMQ hosts were queuing events, but many events were not being processed. Blameless currently relies on Nameko for transport and messaging, which relies on RabbitMQ for both. In short, if RabbitMQ is unavailable, services cannot communicate with each other and Blameless is unavailable.
We looked at the logs and noticed the slack slash command request was making it's way from postman to watchman but never processed by Blamo (internal service). This indicated the issue was either w/ Blamo or RabbitMQ. The Blamo logs showed quite a few RabbitMQ connection errors earlier in the day but the service seemed to have recovered (based on the logs). We looked at the RabbitMQ queue for the messages sent from watchman to blamo for the Slack slash command and it showed messages backing up in the queue. Next we restarted Blamo under the assumption that the service did not properly register and thus was unable to consume from the queues - this did not resolve the issue. Finally, we decided to restart RabbitMQ to prevent further prolonging the issue - RabbitMQ restart worked and the messages sent from watchman to blamo were processed
All RabbitMQ instances were bounced, restoring RabbitMQ to a stable state. Once the RabbitMQ was bounced all service was restored.