Jump to content

Incidents/20150519-RESTBase

From Wikitech

Summary

The deployment of https://gerrit.wikimedia.org/r/#/c/198433/ brought down the RESTBase service for around 30 minutes, due to the parallel execution of Puppet agents across the RESTBase eqiad cluster. During this period, all of RESTBase's users were affected - VE, OCG, CXServer.

Timeline

Conclusions

This particular patch required the creation of new keyspaces in Cassandra, which does not handle parallel keyspace and table creation well. As all of the RESTBase services were restarted automatically by Puppet almost at the same time, they were fighting for this creation process, effectively rendering the cluster unavailable. Thus, when there is such a code/config change, gradual application of changes, rolling restarts and periodic checks of the previous steps are imperative.

Actionables