Incidents/20151014-MediaWiki
Appearance
(Redirected from Incident documentation/20151014-MediaWiki)
Summary
A MediaWiki config change that should have been a no-op was synced and caused pages to appear with no content (and get cached) as well as Special:Random throwing exceptions.
Timeline
- 17:56: Chad asks Legoktm if he as any objections to merging gerrit:232966 (written back in August). Legoktm quickly looks over it and says no objections
- 17:59: !log demon@tin Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 17s)
- 18:01: First report in #wikimedia-operations that something is wrong. "A database query error has occurred. This may indicate a bug in the software." and "I can't see any page"
- 18:03: Legoktm checks fatalmonitor and sees a bunch of 303 Compilation failed: two named subpatterns have the same name at offset 263 in /srv/mediawiki/php-1.27.0-wmf.2/includes/MagicWord.php on line 960
- 18:03: Luke081515 reports the problem in a Phabricator task (phab:T115505)
- 18:04-05: Confusion over whether twentyafterfour had started the train deploy (he hadn't)
- 18:06: Legoktm reverts config change in gerrit, while Chad deploys it: !log demon@tin Synchronized wmf-config: (no message) (duration: 00m 19s)
- 18:07: enwiki s1 slave lag triggers alert, paging ops (unrelated)
- 18:07: People are still reporting blank pages, purging them fixes it
- ... Discussion about what pages are cached, people trying to find a test case
- 18:11: Andre spots that https://en.wikipedia.org/wiki/GNU_General_Public_License is broken, Legoktm realizes that we need parser cache purges
- 18:14: Ori proposes using the RejectParserCacheValue hook to empty the cache and purge varnish
- 18:17: Legoktm determines that the outage was caused by double loading of the Disambiguator extension
- 19:11: Ori deploys hook to purge blank pages
Conclusions
- It's probably safer to do these mass extension changes one at a time
- We should leave some room before train deploys so we can pin-point what deploy caused issues?
- MagicWord.php should have thrown exceptions instead of silently failing.
Actionables
Explicit next steps to prevent this from happening again as much as possible, with Phabricator tasks linked for every step.
- Status: Done MagicWord.php should check return value of preg_* calls instead of assuming they succeed (bug T115514)