Jump to content

Incidents/20151026-MediaWiki

From Wikitech
Revision as of 18:22, 26 October 2015 by EBernhardson (talk | contribs)

Summary

During the SF morning SWAT deploy I sent out patches that moved the Search eventlogging schema from CirrusSearch repository to the WikimediaEvents repository. Upon deployment ResourceLoader started emitting an error about duplicate module registration. This occured because the patch for the CirrusSearch repository that removed the schema should have been deployed before the change that adds it to the WikimediaEvents repository. This was not noticed immediately because they are not included in `fatalmonitor` from fluorine which was open in my shell to monitor deployment issues. After a couple minutes this was noticed in https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor and the revert process started.

``` MWException from line 331 of /srv/mediawiki/php-1.27.0-wmf.3/includes/resourceloader/ResourceLoader.php: ResourceLoader duplicate registration error. Another module has already been registered as schema.Search ```

Timeline

  • 15:13 ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents/WikimediaEvents.php: Move search schema from cirrussearch -> wikimediavents (duration: 00m 19s)
  • 15:13 varnish starts reporting errors to logstash
  • 15:14 Reports in #wikimedia-operations of the site going down
  • 15:18 Revert the three patches to WikimediaEvents in gerrit
  • 15:21 ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents: rollback (duration: 00m 18s)
  • 15:22 All error graphs return to normal

Conclusions

The fatalmonitor on fluorine is the easiest monitor to keep on screen while working from a laptop with minimal screen space, but does not contain all information about fatals on the site. Perhaps this could integrate 5xx reporting from graphite or logstash. Rolling back changes in gerrit takes much too long when there are multiple patch sets and time is of the essence, this should be done on tin directly and fixed up in the deployment branches after the production issue has been fixed. Finally, ResourceLoader should not fatal the site due to configuration issues. The problem should be logged and the site should continue on serving requests the best it can.

Actionables

  • Status:    Unresolved Include information about 5xx rate in fatalmonitor (bug T116627)
  • Status:    Unresolved ResourceLoader should not fatal the site due to configuration issues (bug T116628)