Page MenuHomePhabricator

Convert eqiad imagescalers to HHVM, Trusty
Closed, ResolvedPublic

Description

In our quest to convert the whole cluster to trusty and HHVM, we've got to the imagescalers.

From what I can see from puppet/the current image scalers is that we have the following packages custom-built at the moment:

  • ffmpeg2theora (disabled libav multithreading)
  • librsvg (external resources loading protection)
  • libvips31 (backport)
  • libvpx1 (backport)

We also need to evaluate the performance of HHVM under the peculiar imagescalers load, which is probably going to be non-optimal.

Finally, testing should be very accurate to assess any rendering differences on trusty vs precise.

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 196173 merged by Giuseppe Lavagetto:
mediawiki: install fonts metric-compatible with Calibri and Cambria

https://gerrit.wikimedia.org/r/196173

@Joe, do we need to push this to next week?

@Joe, do we need to push this to next week?

At minimum. Giuseppe ran into segfaults with Tim's output buffer patch and that's blocking the rollout of a new package, which the scalers need.

OK, let's schedule to a specific week once we have confidence on the ETA.

Ok so, after a lot of battles with HHVM 3.6.1, the mysterious 503s on the imagescalers continued.

One possible cause is the fact that apparently some images require more memory to be converted on trusty:

curl -H 'Host: commons.wikimedia.org' 'http://mw1152/w/thumb_handler.php/1/17/HawkridgeBarton_Chittlehampton_NorthDevon.PNG/1000px-HawkridgeBarton_Chittlehampton_NorthDevon.PNG' -I
HTTP/1.1 500 Internal server error
Date: Wed, 06 May 2015 15:13:17 GMT
Server: Apache
X-Powered-By: HHVM/3.6.1
X-Content-Type-Options: nosniff
Cache-control: no-cache
X-MW-Thumbnail-Renderer: mw1152
Connection: close
Content-Type: text/html; charset=utf-8

which results in the kernel recording

May  6 15:13:20 mw1152 kernel: [4775538.759268] memory: usage 307200kB, limit 307200kB, failcnt 50
May  6 15:13:20 mw1152 kernel: [4775538.759269] memory+swap: usage 0kB, limit 18014398509481983kB, failcnt 0
May  6 15:13:20 mw1152 kernel: [4775538.759270] kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
May  6 15:13:20 mw1152 kernel: [4775538.759271] Memory cgroup stats for /mediawiki/job/15592: cache:0KB rss:307200KB rss_huge:202752KB mapped_file:0KB writeback:0KB inactive_anon:0KB active_anon:307132KB inactive_file:0KB active_file:0KB unevictable:0KB
May  6 15:13:20 mw1152 kernel: [4775538.759282] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
May  6 15:13:20 mw1152 kernel: [4775538.759334] [15592]    33 15592     5222      426      16        0             0 bash
May  6 15:13:20 mw1152 kernel: [4775538.759336] [15594]    33 15594     2867      167      11        0             0 timeout
May  6 15:13:20 mw1152 kernel: [4775538.759337] [15595]    33 15595    93743    77414     186        0             0 convert
May  6 15:13:20 mw1152 kernel: [4775538.759339] Memory cgroup out of memory: Kill process 15595 (convert) score 1010 or sacrifice child
May  6 15:13:20 mw1152 kernel: [4775538.768545] Killed process 15595 (convert) total-vm:374972kB, anon-rss:307004kB, file-rss:2652kB
May  6 15:15:01 mw1152 CRON[15698]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)

Also, I tested that at smaller resulting sizes this doesn't cause the 500 to be emitted.

This is the mediawiki error log: P612

I'm still unsure this is really the cause of the larger number of 503s I see.

greg moved this task to This week: May 4-8 on the Roadmap workboard.
greg moved this task to This week: May 11-15 on the Roadmap workboard.
greg moved this task to May 18-22 on the Roadmap workboard.

What ya'll think now? :)

What ya'll think now? :)

We should probably not schedule this until we resolve the issue Giuseppe spotted in T84842#1265055.

Change 213228 had a related patch set uploaded (by Ori.livneh):
mediawiki: Touch /etc/wikimedia-image-scaler on image scalers

https://gerrit.wikimedia.org/r/213228

Change 213228 merged by Ori.livneh:
mediawiki: Touch /etc/wikimedia-image-scaler on image scalers

https://gerrit.wikimedia.org/r/213228

Since the latest test I did after ori found we needed that file was unsuccessful, and I have zero time to work on this at the moment, I'll release this ticket so that somebody else can grab it.

As matanya clearly stated "We've been waiting forever for this".

Joe removed Joe as the assignee of this task.Jun 2 2015, 1:42 PM
Jdforrester-WMF assigned this task to Joe.
Jdforrester-WMF edited projects, added Performance-Team; removed Patch-For-Review.
Jdforrester-WMF subscribed.

Bah, edit conflicts.

For the past several weeks, this task was blocked on the fact that the HHVM renderer (mw1152) would cause 5xx spikes whenever it was pooled. The issue appears to have been a corruption in the cached byte-code for /w/404.php which caused Varnish to be unable to parse 404 responses generated by HHVM on mw1152. These responses were transformed by varnish into 503s. Touching the file fixed it.

The remaining task is to re-image the other scalers.

Krenair renamed this task from Convert Imagescalers to HHVM, Trusty to Convert eqiad imagescalers to HHVM, Trusty.Jul 4 2015, 12:31 AM
Krenair subscribed.

(codfw imagescalers are all HHVM+Trusty already, so we're just left with mw1153-1160)

Change 223331 had a related patch set uploaded (by Giuseppe Lavagetto):
imagescalers: reimage mw1153 with HAT

https://gerrit.wikimedia.org/r/223331

Change 223331 merged by Giuseppe Lavagetto:
imagescalers: reimage mw1153 with HAT

https://gerrit.wikimedia.org/r/223331

Change 224594 had a related patch set uploaded (by Giuseppe Lavagetto):
imagescalers: reimage mw1154, mw1155 to HAT

https://gerrit.wikimedia.org/r/224594

Change 224594 merged by Giuseppe Lavagetto:
imagescalers: reimage mw1154, mw1155 to HAT

https://gerrit.wikimedia.org/r/224594

Change 225285 had a related patch set uploaded (by Giuseppe Lavagetto):
imagescalers: re-image mw115[6-8] to trusty, HHVM

https://gerrit.wikimedia.org/r/225285

Change 225285 merged by Giuseppe Lavagetto:
imagescalers: re-image mw115[6-8] to trusty, HHVM

https://gerrit.wikimedia.org/r/225285

Krenair changed the task status from Stalled to Open.Jul 17 2015, 7:24 AM

FTR, I am going to depool all remaining Zend imagescaler today to test any outstanding problems with those. If none arise, I'm going to upgrade the last three tomorrow.

In T84842#1466863, @Joe wrote:

FTR, I am going to depool all remaining Zend imagescaler today to test any outstanding problems with those. If none arise, I'm going to upgrade the last three tomorrow.

Great. Sounds like a good plan. Thanks for pushing forward.

New imagescalers fun found today:

an high level of 5xx were going on today, so I intercepted a few urls that were returning a 503 - for example, I requested

https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/Casquette-IMG_0922.jpg/600px-Casquette-IMG_0922.jpg

from my browser and got a 503:

...
If you report this error to the Wikimedia System Administrators, please include the details below.
Request: GET http://upload.wikimedia.org/wikipedia/commons/thumb/0/01/Casquette-IMG_0922.jpg/600px-Casquette-IMG_0922.jpg, from 10.64.32.81 via cp1048 cp1048 ([10.64.32.100]:3128), Varnish XID 3696382267
Forwarded for: 79.58.168.196, 10.64.32.81, 10.64.32.81
Error: 503, Service Unavailable at Wed, 22 Jul 2015 16:21:18 GMT

while if I do a plain curl I get the correct behaviour, a 302:

curl -I https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/Casquette-IMG_0922.jpg/600px-Casquette-IMG_0922.jpg
HTTP/1.1 302 Found
Server: nginx/1.9.3
Date: Wed, 22 Jul 2015 16:30:29 GMT
Content-Type: text/html
Content-Length: 0
Connection: keep-alive
X-Content-Type-Options: nosniff
Expires: Thu, 23 Jul 2015 04:22:53 GMT
Vary: X-Forwarded-Proto
Location: https://upload.wikimedia.org/wikipedia/commons/thumb/9/99/Casquette_d%27administrateur_colonial_-IMG_0922.jpg/600px-Casquette_d%27administrateur_colonial_-IMG_0922.jpg
Cache-Control: no-cache

So I tought this could have to do with compression. But subsequent requests were successful from the browser as well, so no idea what is going on here - maybe one imagescaler doing something wrong?

2015-07-23 00:06:38 mw1153 commonswiki exception ERROR: [d68c4bb1] /w/thumb_handler.php/2/2e/Mirage_III_A_01_Mus%0Aee_du_Bourget_P1020118.JPG/424px-%0AMirage_III_A_01_Musee_du_Bourget_P1020118.JPG   MWException from line 171 of /srv/mediawiki/php-1.26wmf15/includes/objectcache/ObjectCache.php: CACHE_ACCEL requested but no suitable object cache is present. You may want to install APC. {"exception":"[object] (MWException(code: 0): CACHE_ACCEL requested but no suitable object cache is present. You may want to install APC. at /srv/mediawiki/php-1.26wmf15/includes/objectcache/ObjectCache.php:171)"}

Seen from mw115[2-8]

2015-07-23 00:06:38 mw1153 commonswiki exception ERROR: [d68c4bb1] /w/thumb_handler.php/2/2e/Mirage_III_A_01_Mus%0Aee_du_Bourget_P1020118.JPG/424px-%0AMirage_III_A_01_Musee_du_Bourget_P1020118.JPG   MWException from line 171 of /srv/mediawiki/php-1.26wmf15/includes/objectcache/ObjectCache.php: CACHE_ACCEL requested but no suitable object cache is present. You may want to install APC. {"exception":"[object] (MWException(code: 0): CACHE_ACCEL requested but no suitable object cache is present. You may want to install APC. at /srv/mediawiki/php-1.26wmf15/includes/objectcache/ObjectCache.php:171)"}

Seen from mw115[2-8]

Ummm, why/how do we have HHVM boxes without APC enabled?

2015-07-23 00:06:38 mw1153 commonswiki exception ERROR: [d68c4bb1] /w/thumb_handler.php/2/2e/Mirage_III_A_01_Mus%0Aee_du_Bourget_P1020118.JPG/424px-%0AMirage_III_A_01_Musee_du_Bourget_P1020118.JPG   MWException from line 171 of /srv/mediawiki/php-1.26wmf15/includes/objectcache/ObjectCache.php: CACHE_ACCEL requested but no suitable object cache is present. You may want to install APC. {"exception":"[object] (MWException(code: 0): CACHE_ACCEL requested but no suitable object cache is present. You may want to install APC. at /srv/mediawiki/php-1.26wmf15/includes/objectcache/ObjectCache.php:171)"}

Seen from mw115[2-8]

See T106743: Exceptions due to APC missing on some scalers due to Gadgets caching

Change 226658 had a related patch set uploaded (by Ori.livneh):
Add ProxyPass rule for thumb_handler.php

https://gerrit.wikimedia.org/r/226658

Change 226738 had a related patch set uploaded (by Ori.livneh):
Follow-up for Ie17cb06: add thumb_handler.php ProxyPass rule to all vhosts

https://gerrit.wikimedia.org/r/226738

Change 226738 merged by Ori.livneh:
Follow-up for Ie17cb06: add thumb_handler.php ProxyPass rule to all vhosts

https://gerrit.wikimedia.org/r/226738

Change 227160 had a related patch set uploaded (by Ori.livneh):
Re-introduce ProxyPass rule for thumb_handler.php, with corrected docroots

https://gerrit.wikimedia.org/r/227160

Change 227160 merged by Ori.livneh:
Re-introduce ProxyPass rule for thumb_handler.php, with corrected docroots

https://gerrit.wikimedia.org/r/227160

I think I found what is the latest gotcha with imagescalers:

curl -H 'X-Forwarded-Proto: https' -H 'Host: commons.wikimedia.org' 'http://mw1152/w/thumb_handler.php/7/75/John_Bauer-Tyr_and_Fenrir.jpg/180px-John_Bauer-Tyr_and_Fenrir.jpg' -v
* About to connect() to mw1152 port 80 (#0)
*   Trying 10.64.16.132... connected
> GET /w/thumb_handler.php/7/75/John_Bauer-Tyr_and_Fenrir.jpg/180px-John_Bauer-Tyr_and_Fenrir.jpg HTTP/1.1
> User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3
> Accept: */*
> X-Forwarded-Proto: https
> Host: commons.wikimedia.org
> 
< HTTP/1.1 302 Found
< Date: Wed, 29 Jul 2015 09:56:29 GMT
< Server: Apache
< X-Powered-By: HHVM/3.6.1
< X-Content-Type-Options: nosniff
< Cache-control: no-cache
< Expires: Wed, 29 Jul 2015 21:56:29 GMT
< Vary: X-Forwarded-Proto
< Location: https://upload.wikimedia.org/wikipedia/commons/thumb/1/18/Tyr_and_Fenrir-John_Bauer.jpg/180px-Tyr_and_Fenrir-John_Bauer.jpg
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=utf-8

while on non-HHVM ones

curl -H 'X-Forwarded-Proto: https' -H 'Host: commons.wikimedia.org' 'http://mw1159/w/thumb_handler.php/7/75/John_Bauer-Tyr_and_Fenrir.jpg/180px-John_Bauer-Tyr_and_Fenrir.jpg' -v
* About to connect() to mw1159 port 80 (#0)
*   Trying 10.64.16.139... connected
> GET /w/thumb_handler.php/7/75/John_Bauer-Tyr_and_Fenrir.jpg/180px-John_Bauer-Tyr_and_Fenrir.jpg HTTP/1.1
> User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3
> Accept: */*
> X-Forwarded-Proto: https
> Host: commons.wikimedia.org
> 
< HTTP/1.1 302 Found
< Date: Wed, 29 Jul 2015 09:58:24 GMT
< Server: Apache
< X-Content-Type-Options: nosniff
< Cache-control: no-cache
< Location: https://upload.wikimedia.org/wikipedia/commons/thumb/1/18/Tyr_and_Fenrir-John_Bauer.jpg/180px-Tyr_and_Fenrir-John_Bauer.jpg
< Expires: Wed, 29 Jul 2015 21:58:24 GMT
< Vary: X-Forwarded-Proto
< Content-Length: 0
< Content-Type: text/html
<

In this case we have the same problem we had with 404s: thumb.php doesn't set the content-length, which makes apache assume the transfer-encoding is chunked, which for a 0 sized body makes varnish barf.

Change 227676 had a related patch set uploaded (by Giuseppe Lavagetto):
Add Content-Length header to thumb.php redirects

https://gerrit.wikimedia.org/r/227676

Change 227732 had a related patch set uploaded (by BryanDavis):
Add Content-Length header to thumb.php redirects

https://gerrit.wikimedia.org/r/227732

Change 227733 had a related patch set uploaded (by BryanDavis):
Add Content-Length header to thumb.php redirects

https://gerrit.wikimedia.org/r/227733

Change 227732 merged by jenkins-bot:
Add Content-Length header to thumb.php redirects

https://gerrit.wikimedia.org/r/227732

Change 227733 merged by jenkins-bot:
Add Content-Length header to thumb.php redirects

https://gerrit.wikimedia.org/r/227733

Change 227676 merged by jenkins-bot:
Add Content-Length header to thumb.php redirects

https://gerrit.wikimedia.org/r/227676

Change 227967 had a related patch set uploaded (by Giuseppe Lavagetto):
imagescalers: convert the last two servers to HAT

https://gerrit.wikimedia.org/r/227967

Change 227967 merged by Giuseppe Lavagetto:
imagescalers: convert the last two servers to HAT

https://gerrit.wikimedia.org/r/227967

Just for cross-reference, a regression was found with rsvg - svgs > 10 MB (there's about 9000 such files) don't render. We'd need to upgrade to a later version of rsvg to fix, see T111815