Loss of blamlessbot and blameless GUI response

Incident Report for Blameless

Postmortem

Google had a widespread networking outage that put our RabbitMQ instances into a bad state. The RabbitMQ hosts were queuing events, but many events were not being processed. Blameless currently relies on Nameko for transport and messaging, which relies on RabbitMQ for both. In short, if RabbitMQ is unavailable, services cannot communicate with each other and Blameless is unavailable.

We looked at the logs and noticed the slack slash command request was making it's way from postman to watchman but never processed by Blamo (internal service). This indicated the issue was either w/ Blamo or RabbitMQ. The Blamo logs showed quite a few RabbitMQ connection errors earlier in the day but the service seemed to have recovered (based on the logs). We looked at the RabbitMQ queue for the messages sent from watchman to blamo for the Slack slash command and it showed messages backing up in the queue. Next we restarted Blamo under the assumption that the service did not properly register and thus was unable to consume from the queues - this did not resolve the issue. Finally, we decided to restart RabbitMQ to prevent further prolonging the issue - RabbitMQ restart worked and the messages sent from watchman to blamo were processed

Resolution

All RabbitMQ instances were bounced, restoring RabbitMQ to a stable state. Once the RabbitMQ was bounced all service was restored.

Posted Jun 04, 2019 - 19:02 UTC

Resolved

This incident has been resolved. Postmortem to follow.

Posted Jun 03, 2019 - 04:06 UTC

Update

We are restarting the messaging service across all clusters.

Posted Jun 03, 2019 - 02:50 UTC

Investigating

We are currently investigating a report by one of our customers that /blameless commands had stopped working. this was reported at 1718PT. We also have additional reports that GUI operations are timing out. We have seen this behavior within our own infrastructure. We are investigating. How some outages by some of our upstream providers might be having an effect. More info as we know it.

Posted Jun 03, 2019 - 02:40 UTC

This incident affected: Blameless.io.