Page MenuHomePhabricator

SiteStatsInit::edits and SiteStatsInit::articles query storm
Closed, ResolvedPublic

Description

S2 slaves were hit by a storm of these on svwiki:

SELECT /* SiteStatsInit::edits 127.0.0.1 */ COUNT(*) FROM revision LIMIT 1

Then minutes later a very similar storm of these:

SELECT /* SiteStatsInit::articles ... */ COUNT(DISTINCT page_id) FROM page, pagelinks WHERE page_namespace = '0' AND page_is_redirect = '0' AND (pl_from=page_id) LIMIT 1

The queries cam from wikiuser and various client IPs, implying they were not run from any maintenance script or job. Clients did include a lot of msnbot and googlebot, but not exclusively.

Both queries should be run asynchronously and cached somewhere (which based on IRC conversation between hoo/Nemo/aude, is actually the expected behavior).

Event Timeline

Springle raised the priority of this task from to Needs Triage.
Springle updated the task description. (Show Details)
Springle subscribed.
Aklapper triaged this task as High priority.Apr 8 2015, 1:53 PM

Change 202732 had a related patch set uploaded (by MaxSem):
Use PoolCounter for SiteStats updates

https://gerrit.wikimedia.org/r/202732

Change 202766 had a related patch set uploaded (by Aaron Schulz):
Prevent SiteStats::loadAndLazyInit from reloading in $wgMiserMode

https://gerrit.wikimedia.org/r/202766

Sample backtrace:

#0 /srv/mediawiki/php-1.25wmf23/includes/db/Database.php(1189): DatabaseBase->reportQueryError()
#1 /srv/mediawiki/php-1.25wmf23/includes/db/Database.php(1729): DatabaseBase->query()
#2 /srv/mediawiki/php-1.25wmf23/includes/db/Database.php(1397): DatabaseBase->select()
#3 /srv/mediawiki/php-1.25wmf23/includes/SiteStats.php(330): DatabaseBase->selectField()
#4 /srv/mediawiki/php-1.25wmf23/includes/SiteStats.php(378): SiteStatsInit->articles()
#5 /srv/mediawiki/php-1.25wmf23/includes/SiteStats.php(87): SiteStatsInit::doAllAndCommit()
#6 /srv/mediawiki/php-1.25wmf23/includes/SiteStats.php(54): SiteStats::loadAndLazyInit()
#7 /srv/mediawiki/php-1.25wmf23/includes/SiteStats.php(147): SiteStats::load()
#8 /srv/mediawiki/php-1.25wmf23/extensions/Scribunto/engines/LuaCommon/SiteLibrary.php(77): SiteStats::pages()
#9 /srv/mediawiki/php-1.25wmf23/extensions/Scribunto/engines/LuaCommon/LuaCommon.php(450): Scribunto_LuaSiteLibrary->register()
#10 /srv/mediawiki/php-1.25wmf23/extensions/Scribunto/engines/LuaCommon/LuaCommon.php(124): Scribunto_LuaEngine->instantiatePHPLibrary()
#11 /srv/mediawiki/php-1.25wmf23/extensions/Scribunto/engines/LuaCommon/LuaCommon.php(197): Scribunto_LuaEngine->load()
#12 /srv/mediawiki/php-1.25wmf23/extensions/Scribunto/engines/LuaCommon/LuaCommon.php(847): Scribunto_LuaEngine->getInterpreter()
#13 /srv/mediawiki/php-1.25wmf23/extensions/Scribunto/engines/LuaCommon/LuaCommon.php(864): Scribunto_LuaModule->getInitChunk()
#14 /srv/mediawiki/php-1.25wmf23/extensions/Scribunto/common/Hooks.php(113): Scribunto_LuaModule->invoke()
#15 /srv/mediawiki/php-1.25wmf23/includes/parser/Parser.php(3768): ScribuntoHooks::invokeHook()
#16 /srv/mediawiki/php-1.25wmf23/includes/parser/Parser.php(3502): Parser->callParserFunction()
#17 /srv/mediawiki/php-1.25wmf23/includes/parser/Preprocessor_Hash.php(1094): Parser->braceSubstitution()
#18 /srv/mediawiki/php-1.25wmf23/extensions/ParserFunctions/ParserFunctions_body.php(106): PPFrame_Hash->expand()
#19 /srv/mediawiki/php-1.25wmf23/includes/parser/Parser.php(3768): ExtParserFunctions::ifObj()
#20 /srv/mediawiki/php-1.25wmf23/includes/parser/Parser.php(3502): Parser->callParserFunction()
#21 /srv/mediawiki/php-1.25wmf23/includes/parser/Preprocessor_Hash.php(1094): Parser->braceSubstitution()
#22 /srv/mediawiki/php-1.25wmf23/extensions/ParserFunctions/ParserFunctions_body.php(104): PPFrame_Hash->expand()
#23 /srv/mediawiki/php-1.25wmf23/includes/parser/Parser.php(3768): ExtParserFunctions::ifObj()
#24 /srv/mediawiki/php-1.25wmf23/includes/parser/Parser.php(3502): Parser->callParserFunction()
#25 /srv/mediawiki/php-1.25wmf23/includes/parser/Preprocessor_Hash.php(1094): Parser->braceSubstitution()
#26 /srv/mediawiki/php-1.25wmf23/includes/parser/Parser.php(3644): PPFrame_Hash->expand()
#27 /srv/mediawiki/php-1.25wmf23/includes/parser/Preprocessor_Hash.php(1094): Parser->braceSubstitution()
#28 /srv/mediawiki/php-1.25wmf23/includes/parser/Parser.php(3319): PPFrame_Hash->expand()
#29 /srv/mediawiki/php-1.25wmf23/includes/parser/Parser.php(634): Parser->replaceVariables()
#30 /srv/mediawiki/php-1.25wmf23/includes/api/ApiExpandTemplates.php(110): Parser->preprocess()
#31 /srv/mediawiki/php-1.25wmf23/includes/api/ApiMain.php(1037): ApiExpandTemplates->execute()
#32 /srv/mediawiki/php-1.25wmf23/includes/api/ApiMain.php(394): ApiMain->executeAction()
#33 /srv/mediawiki/php-1.25wmf23/includes/api/ApiMain.php(367): ApiMain->executeActionWithErrorHandling()
#34 /srv/mediawiki/php-1.25wmf23/api.php(88): ApiMain->execute()
#35 /srv/mediawiki/w/api.php(3): include()
#36 {main}

Change 202766 merged by jenkins-bot:
Prevent SiteStats::loadAndLazyInit from reloading in $wgMiserMode

https://gerrit.wikimedia.org/r/202766

From IRC:

14:12 #wikimedia-operations: < MaxSem> springle, re that outage: is there a way to find out the contents of site_stats before that storm? i.e. was it empty or was there some bogus value?

S2 24h delayed replication on dbstore1001 is currently paused a few hours before the outage with binlogs (relay logs) up to an including the event. Do we still need to investigate this aspect, or does @aaron's stack trace and gerrit set above indicate we have the problem nailed?

From IRC:

14:12 #wikimedia-operations: < MaxSem> springle, re that outage: is there a way to find out the contents of site_stats before that storm? i.e. was it empty or was there some bogus value?

S2 24h delayed replication on dbstore1001 is currently paused a few hours before the outage with binlogs (relay logs) up to an including the event. Do we still need to investigate this aspect, or does @aaron's stack trace and gerrit set above indicate we have the problem nailed?

It might be worth seeing the last INSERT to the table to see if any thing particularly odd happened.

dbstore1001 stopped ~18 hours before the outage:

wmf dbstore1001 3306 root svwiki> select from_unixtime(1428431995);
+---------------------------+
| from_unixtime(1428431995) |
+---------------------------+
| 2015-04-07 18:39:55       |
+---------------------------+
1 row in set (0.25 sec)
wmf dbstore1001 3306 root svwiki> select * from site_stats;
+-----------+----------------+----------------+------------------+----------------+----------+-----------+-----------------+
| ss_row_id | ss_total_views | ss_total_edits | ss_good_articles | ss_total_pages | ss_users | ss_images | ss_active_users |
+-----------+----------------+----------------+------------------+----------------+----------+-----------+-----------------+
|         1 |        1052264 |       29691046 |          1961398 |        4573696 |   431103 |         0 |            3094 |
+-----------+----------------+----------------+------------------+----------------+----------+-----------+-----------------+

Lots of UPDATEs during the next 18h:

UPDATE /* SiteStatsUpdate::tryDBUpdateInternal ... */  `site_stats` SET ss_total_edits=ss_total_edits+1

The next INSERT was at 2015-04-08 12:36:57, after the outage began, when several dozen of these ran nigh simultaneously:

INSERT /* SiteStatsInit::refresh ... */ INTO `site_stats` (ss_row_id,ss_total_edits,ss_good_articles,ss
_total_pages,ss_users,ss_images) VALUES ('1','29682848','1961414','4573781','431159','0') ON DUPLICATE KEY UPDATE
 ss_row_id = '1',ss_total_edits = '29682848',ss_good_articles = '1961414',ss_total_pages = '4573781',ss_users = '
431159',ss_images = '0';

We can use the S2 master binlog to find the last insert before the outage, but whatever it was resulted in the above site_stats state (which looks right?).

I wonder if the pending stats delta key in memcached for one type (articles) fell out of cache or was re-hashed elsewhere but not another one (pages). The sanity check code might see them being mismatched and would trigger a rebuild. In any case, that will not happen now, and the crons will fix it eventually afaik.

the crons will fix it eventually afaik

So should this task be closed? Or is there anything left to sort out (with lower priority than currently)?

aaron claimed this task.

Change 202732 abandoned by MaxSem:
Use PoolCounter for SiteStats updates

Reason:
Superseded by other stuff.

https://gerrit.wikimedia.org/r/202732