[00:00:29] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [00:13:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.041 seconds [00:21:47] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 241 seconds [00:24:58] binasher: filed https://bugzilla.wikimedia.org/show_bug.cgi?id=41090 for the SVG regeneration issue [00:25:48] thanks! [00:26:27] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 3 seconds [00:46:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:58:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.951 seconds [01:33:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:41:11] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 233 seconds [01:47:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.302 seconds [01:51:56] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [02:21:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:35] !log LocalisationUpdate completed (1.21wmf1) at Wed Oct 17 02:26:34 UTC 2012 [02:26:51] Logged the message, Master [02:37:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.045 seconds [02:46:32] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [02:49:54] !log LocalisationUpdate completed (1.21wmf2) at Wed Oct 17 02:49:53 UTC 2012 [02:50:08] Logged the message, Master [03:11:54] RECOVERY - Puppet freshness on srv220 is OK: puppet ran at Wed Oct 17 03:11:30 UTC 2012 [03:20:35] RECOVERY - Puppet freshness on srv297 is OK: puppet ran at Wed Oct 17 03:20:24 UTC 2012 [03:33:20] RECOVERY - Puppet freshness on sq42 is OK: puppet ran at Wed Oct 17 03:33:04 UTC 2012 [04:15:10] New patchset: Tim Starling; "Periodically restart leaky Swift proxies so that I don't have to" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28325 [04:16:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28325 [04:18:04] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28325 [04:19:32] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [04:19:32] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [04:22:02] New patchset: Tim Starling; "Suppress cron output" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28326 [04:23:00] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28326 [04:30:15] j^: 16 15:15:12 < notpeter> jeremyb: paravoid yes, I'm waiting on a porting of a patch to librsvg. other than that, the test box was preforming just fine. [04:54:30] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [05:20:08] PROBLEM - Swift HTTP on ms-fe1003 is CRITICAL: HTTP CRITICAL - No data received from host [05:20:42] TimStarling: ^ [05:20:46] ;) [05:45:18] do we even serve stuff from over there yet? I thought not [05:45:50] apergos: idk, just saw there was recent restartings [05:47:15] it's been in a "warning" state since very soon after that "critical" report [05:47:32] I guess critical -> warning transitions don't go to IRC [05:48:23] the only thing happening in the lgos over there is getting the monitoring test file [05:48:52] yeah I guess they don't [06:04:59] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:06:30] so, anyone checking out teh search page? [06:06:35] yes [06:06:51] cool [06:06:58] good morning apergos :) [06:07:06] morning [06:07:26] need me to do anything ? if not, i'll go back to sleep [06:07:49] no. go sleep [06:08:02] thanks for asking though [06:08:08] thanks :) i guess i'll see you again when you go to sleep [06:08:22] have a good (and hopefully no more paging) morning [06:08:22] thank you [06:08:26] PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out [06:09:22] !log restarted lucene search on search1015 [06:09:32] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [06:09:34] Logged the message, Master [06:09:47] RECOVERY - Lucene on search1015 is OK: TCP OK - 0.027 second response time on port 8123 [06:19:33] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [06:41:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:44:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.151 seconds [07:07:17] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [07:13:20] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [07:18:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:31:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.960 seconds [07:37:17] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [07:59:21] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.000 second response time on port 11000 [08:05:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:20:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.406 seconds [08:34:17] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [08:55:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:07:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.614 seconds [09:12:05] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 186 seconds [09:43:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:44:33] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28210 [09:56:12] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [09:56:12] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [09:59:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.022 seconds [10:01:17] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [10:04:33] New patchset: Hashar; "remove executable flags on noc files" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28334 [10:06:59] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [10:30:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:48:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.029 seconds [11:19:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:25:59] New patchset: Hashar; "make labs TMH setup more like production" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28007 [11:26:19] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28007 [11:31:49] New review: Hashar; "deployed on beta hopefully" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28007 [11:33:28] New patchset: Mark Bergsma; "First attempt to support passed (high) range requests on Varnish backends" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28213 [11:33:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.068 seconds [11:34:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28213 [12:01:57] New patchset: Hashar; "extension assets not available on beta" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28337 [12:02:21] New review: Hashar; "Needed by beta labs." [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/28337 [12:02:25] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28337 [12:08:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:24:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.523 seconds [12:26:10] New patchset: J; "role::applicationserver::common depends on upload" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28341 [12:27:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28341 [12:34:10] New patchset: Hashar; "beta: blacklist socks proxy listed in sorbs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28342 [12:34:27] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28342 [12:44:29] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:45:50] RECOVERY - Apache HTTP on mw20 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.315 second response time [12:47:11] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [12:50:37] New review: Demon; "On further thought, this won't quite work on production yet. Need to sort out some way to deal with ..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/25508 [12:56:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:00:27] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:02:02] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.949 second response time [13:06:59] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:08:29] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.243 second response time [13:11:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.436 seconds [13:11:03] otsr-wiki has been loading very slow for a whike [13:11:29] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:12:51] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [13:12:59] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:13:41] taking ages indeed [13:14:20] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:14:37] 19 s here [13:17:20] RECOVERY - Apache HTTP on mw28 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.062 second response time [13:17:39] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.423 second response time [13:20:47] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:29] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:29] PROBLEM - Apache HTTP on mw21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:47] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [13:23:47] RECOVERY - Apache HTTP on mw21 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.747 second response time [13:23:56] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:24:23] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:25:29] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.064 second response time [13:25:35] PROBLEM - Apache HTTP on mw24 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:25:35] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:25:44] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.165 second response time [13:25:56] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:27:05] RECOVERY - Apache HTTP on mw24 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.503 second response time [13:27:05] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.412 second response time [13:27:05] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:27:05] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.359 second response time [13:27:23] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [13:30:05] RECOVERY - Apache HTTP on mw54 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.644 second response time [13:30:59] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:31:44] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:32:20] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [13:33:15] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.082 second response time [13:33:32] PROBLEM - Apache HTTP on mw24 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:33:42] PROBLEM - Apache HTTP on srv229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:34:53] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:34:53] RECOVERY - Apache HTTP on mw24 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.557 second response time [13:34:53] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:36:14] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 46872 bytes in 6.581 seconds [13:36:26] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.545 second response time [13:36:41] RECOVERY - Apache HTTP on srv229 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.620 second response time [13:37:20] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:20] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:38] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.377 second response time [13:38:38] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.736 second response time [13:40:07] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:41:20] RECOVERY - Apache HTTP on mw38 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.380 second response time [13:42:06] PROBLEM - Apache HTTP on srv210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:50] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:50] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:59] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:59] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:11] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [13:44:20] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.009 second response time [13:44:20] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.152 second response time [13:44:31] RECOVERY - Apache HTTP on mw18 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.954 second response time [13:45:05] RECOVERY - Apache HTTP on srv210 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.455 second response time [13:45:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:41] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:26] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:44] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:20] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:20] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:47] RECOVERY - Apache HTTP on mw55 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.095 second response time [13:48:41] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [13:48:41] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [13:48:41] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.912 second response time [13:49:44] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.556 second response time [13:50:20] PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:05] PROBLEM - Apache HTTP on mw19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:26] RECOVERY - Apache HTTP on mw19 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.280 second response time [13:53:21] RECOVERY - Apache HTTP on mw25 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.064 second response time [13:57:05] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:58:17] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:58:17] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:58:26] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.302 second response time [13:58:41] New patchset: Matthias Mullie; "Init Wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28238 [13:59:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.613 seconds [14:00:32] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:01:17] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.047 second response time [14:01:17] RECOVERY - Apache HTTP on mw28 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [14:01:17] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:01:58] mark: ping [14:02:02] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 46879 bytes in 0.453 seconds [14:02:07] something's wrong [14:02:26] New patchset: Demon; "Gerrit hook cleanup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28351 [14:02:28] and paging's probably broken -- I didn't get anything [14:02:42] hmm I don't eeither [14:02:44] latency is through the roof, so are 500s [14:02:48] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39606 bytes in 3.721 seconds [14:02:48] but maybe it's not "critical" [14:03:03] a lot of parsercache misses [14:03:10] apergos: LVS alerts always are afaik [14:03:23] PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28351 [14:03:32] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:59] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:04:44] RECOVERY - Apache HTTP on mw26 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.893 second response time [14:05:02] RECOVERY - Apache HTTP on mw38 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.101 second response time [14:05:06] not for https [14:06:06] that's not https [14:06:15] not all of them at least [14:06:17] indeed [14:06:19] the problem is with appservers [14:06:24] their load has spiked [14:06:38] graphite shows increased latency and 500s, probably because of the contention [14:06:59] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.142 second response time [14:07:11] a lot of BannerListLoader/BannerLoader hits [14:07:13] New review: Hashar; "Gallium has puppet client 2.7.6 which lack the puppet/util/color class :-( Will hopefully get fixed..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/28351 [14:07:25] 10% of hits that is [14:07:47] where did you see that? [14:08:07] sampled-1000 [14:08:10] and grep [14:08:11] PROBLEM - Apache HTTP on mw21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:08:12] on emery [14:08:21] poor man's log analysis [14:08:22] New patchset: Demon; "Suppress IRC notification on patchset-created for drafts (DO NOT MERGE UNTIL 2.5)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28352 [14:08:59] higher external store get rate according to graphite [14:09:08] by a factor 2-3 [14:09:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28352 [14:09:33] indeed [14:09:45] PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:09:47] New review: Demon; "jenkinsbot is ready to go (see comment here from it). Have it disabled until this is merged so we do..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/28351 [14:10:40] bannerloader/listloader are mostly cache hits [14:11:02] yeah that's probably mwalker's purge script [14:11:02] RECOVERY - Apache HTTP on mw26 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.433 second response time [14:11:09] misses are a 0.6% of hits [14:11:11] RECOVERY - Apache HTTP on mw21 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.530 second response time [14:11:22] er? [14:11:25] i notice that tim has done something on pc1 [14:11:29] parser cache [14:12:04] New review: Hashar; "disregard my comment. Chad has set up the job to use a shell instead of the rake script :-]" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/28351 [14:13:15] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28221 [14:13:44] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:14:47] seems like mysql is not listening on its port on pc1? [14:15:08] ah it's doing recovery [14:15:14] RECOVERY - Apache HTTP on mw52 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.069 second response time [14:15:30] huh [14:15:44] 121017 13:54:59 InnoDB: Database was not shut down normally! [14:15:44] InnoDB: Starting crash recovery. [14:15:46] it's at 85% [14:15:58] 88%.. [14:16:33] 90 [14:16:33] I think we should just wait [14:17:57] whoops [14:17:59] it has done recovery before, and then crashed [14:18:30] many times [14:19:08] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:19:26] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:19:44] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:20:11] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [14:20:11] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [14:20:38] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:20:38] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:20:38] RECOVERY - Apache HTTP on mw17 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.976 second response time [14:21:14] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.049 second response time [14:21:59] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.255 second response time [14:22:09] PROBLEM - Apache HTTP on srv212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:22:36] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.036 second response time [14:23:32] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:23:38] RECOVERY - Apache HTTP on srv212 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.089 second response time [14:23:47] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.548 second response time [14:23:47] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:24:59] RECOVERY - Apache HTTP on mw28 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.192 second response time [14:26:05] New patchset: jan; "Add "role::mediawiki-update::labs" for updating labs-MW-installations" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28355 [14:26:56] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.894 second response time [14:27:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28355 [14:29:11] PROBLEM - Apache HTTP on srv228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:30:23] PROBLEM - Apache HTTP on srv233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:30:34] RECOVERY - Apache HTTP on srv228 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.713 second response time [14:30:41] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:31:53] RECOVERY - Apache HTTP on srv233 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.997 second response time [14:32:20] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:32:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:23] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:41] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:50] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [14:33:50] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.178 second response time [14:34:53] !log midom synchronized wmf-config/InitialiseSettings.php [14:35:05] Logged the message, Master [14:35:20] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.585 second response time [14:36:32] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.671 second response time [14:36:41] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:38:04] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [14:40:17] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:40:26] PROBLEM - Apache HTTP on srv229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:11] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:38] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [14:41:48] RECOVERY - Apache HTTP on srv229 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.098 second response time [14:42:44] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.419 second response time [14:44:56] PROBLEM - Apache HTTP on srv245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:45:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.972 seconds [14:46:28] RECOVERY - Apache HTTP on srv245 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.592 second response time [14:46:35] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:46:44] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:47:29] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:47:29] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:47:56] RECOVERY - Apache HTTP on mw49 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.103 second response time [14:48:14] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.142 second response time [14:49:17] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.225 second response time [14:49:35] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:50:56] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.023 second response time [14:51:32] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:52:08] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.712 second response time [14:53:02] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.439 second response time [14:55:17] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [14:55:17] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:56:48] RECOVERY - Apache HTTP on mw17 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.745 second response time [14:57:50] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:59:28] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.457 second response time [15:02:20] New patchset: Mark Bergsma; "First attempt to add Range support to Varnish in streaming mode" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/28361 [15:07:17] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:38] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.569 second response time [15:09:34] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:34] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:11:02] RECOVERY - Apache HTTP on mw17 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [15:12:33] RECOVERY - Apache HTTP on mw54 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.936 second response time [15:17:29] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:18:50] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.055 second response time [15:20:03] PROBLEM - Apache HTTP on mw19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:21:23] RECOVERY - Apache HTTP on mw19 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.235 second response time [15:22:26] PROBLEM - Apache HTTP on srv233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:47] RECOVERY - Apache HTTP on srv233 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.519 second response time [15:23:57] !log reedy synchronized php-1.21wmf1/extensions/EducationProgram/ [15:24:06] Logged the message, Master [15:25:53] PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:53] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:23] RECOVERY - Apache HTTP on mw26 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.374 second response time [15:28:26] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:53] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.042 second response time [15:29:56] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 62554 bytes in 0.807 seconds [15:31:46] !log reedy synchronized php-1.21wmf2/extensions/EducationProgram [15:31:58] Logged the message, Master [15:35:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [15:35:56] PROBLEM - Apache HTTP on srv201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:36:03] !log switched donate.wikimedia.org to geodns [15:36:13] Logged the message, Master [15:37:27] RECOVERY - Apache HTTP on srv201 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.393 second response time [15:39:32] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:41:02] RECOVERY - Apache HTTP on mw54 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.716 second response time [15:45:41] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:01] !log reedy Started syncing Wikimedia installation... : Rebuilding localisation cache for educationprogram [15:47:09] Logged the message, Master [15:48:52] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.205 second response time [15:50:20] PROBLEM - check_swap on indium is CRITICAL: SWAP CRITICAL - 80% free (6060 MB out of 7628 MB) [15:53:38] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:55:17] PROBLEM - check_swap on indium is CRITICAL: SWAP CRITICAL - 72% free (5468 MB out of 7628 MB) [15:58:08] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.258 second response time [15:59:02] !log increasing weight on mw17-mw59 by 33.3% [15:59:13] Logged the message, notpeter [16:00:14] PROBLEM - check_swap on indium is CRITICAL: SWAP CRITICAL - 64% free (4854 MB out of 7628 MB) [16:00:28] Ganglia seems upset... bits esams isn't showing at all [16:00:53] one of the eqiad esams boxes is marked as down.. [16:01:06] and the lvs graphs have seemingly just stopped [16:01:32] !log reedy synchronized php-1.21wmf1/cache/l10n/ [16:01:42] Logged the message, Master [16:05:38] PROBLEM - check_swap on indium is CRITICAL: SWAP CRITICAL - 56% free (4254 MB out of 7628 MB) [16:08:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:10:17] PROBLEM - check_swap on indium is CRITICAL: SWAP CRITICAL - 48% free (3646 MB out of 7628 MB) [16:12:35] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:12:35] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:13:03] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:13:03] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:14:02] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 46870 bytes in 0.627 seconds [16:14:02] RECOVERY - Apache HTTP on mw52 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.851 second response time [16:15:11] !log reedy synchronized php-1.21wmf2/cache/l10n/ [16:15:14] PROBLEM - check_swap on indium is CRITICAL: SWAP CRITICAL - 40% free (3047 MB out of 7628 MB) [16:15:21] Logged the message, Master [16:15:59] RECOVERY - Apache HTTP on mw18 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [16:16:09] RECOVERY - Apache HTTP on mw49 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.446 second response time [16:18:42] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:18:59] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:19:17] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:19:17] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:19:26] PROBLEM - Varnish HTCP daemon on cp1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:19:44] PROBLEM - Apache HTTP on mw58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:19:44] PROBLEM - Varnish HTTP mobile-backend on cp1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:19:58] ok, who killed cp1042 [16:20:11] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [16:20:20] PROBLEM - check_swap on indium is CRITICAL: SWAP CRITICAL - 33% free (2444 MB out of 7628 MB) [16:20:20] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.013 second response time [16:20:38] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.688 second response time [16:21:03] !log restarting varnish on cp1042 , gone into swapdeath spiral [16:21:16] Logged the message, Mistress of the network gear. [16:21:59] PROBLEM - Apache HTTP on mw59 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:22:55] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:23:11] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:23:29] RECOVERY - Apache HTTP on mw59 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.418 second response time [16:23:29] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:23:29] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.358 second response time [16:23:38] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:23:38] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:14] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:23] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.051 second response time [16:24:23] RECOVERY - Apache HTTP on mw58 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.036 second response time [16:24:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.016 seconds [16:25:01] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:01] RECOVERY - Apache HTTP on mw55 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.160 second response time [16:25:01] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.302 second response time [16:25:01] RECOVERY - Apache HTTP on mw52 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.627 second response time [16:25:08] PROBLEM - Apache HTTP on srv209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:17] PROBLEM - check_swap on indium is CRITICAL: SWAP CRITICAL - 25% free (1840 MB out of 7628 MB) [16:25:35] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39615 bytes in 6.013 seconds [16:26:39] RECOVERY - Apache HTTP on srv209 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.872 second response time [16:27:14] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:23] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:35] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:50] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.768 second response time [16:28:17] PROBLEM - Apache HTTP on mw45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:26] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:53] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [16:28:53] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.793 second response time [16:29:47] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.053 second response time [16:29:47] RECOVERY - Apache HTTP on mw45 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.492 second response time [16:30:05] RECOVERY - Varnish traffic logger on cp1042 is OK: PROCS OK: 3 processes with command name varnishncsa [16:30:14] PROBLEM - check_swap on indium is CRITICAL: SWAP CRITICAL - 17% free (1256 MB out of 7628 MB) [16:30:14] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [16:30:14] RECOVERY - Varnish HTCP daemon on cp1042 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [16:30:23] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:30:51] RECOVERY - Varnish HTTP mobile-backend on cp1042 is OK: HTTP OK HTTP/1.1 200 OK - 698 bytes in 0.166 seconds [16:31:35] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:33:05] RECOVERY - Apache HTTP on mw54 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.178 second response time [16:33:42] RECOVERY - Apache HTTP on mw18 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.961 second response time [16:33:42] PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:39] PROBLEM - Apache HTTP on mw59 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:39] PROBLEM - Apache HTTP on mw47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:39] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.712 second response time [16:35:11] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:35:21] PROBLEM - check_swap on indium is CRITICAL: SWAP CRITICAL - 9% free (658 MB out of 7628 MB) [16:36:05] RECOVERY - Apache HTTP on mw47 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.729 second response time [16:36:05] RECOVERY - Apache HTTP on mw59 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.046 second response time [16:36:44] RECOVERY - Apache HTTP on mw17 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.886 second response time [16:37:55] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:37:55] PROBLEM - Apache HTTP on mw51 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:38:02] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:38:02] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:38:23] RECOVERY - Apache HTTP on mw25 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.475 second response time [16:38:29] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:39:23] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.093 second response time [16:39:23] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.394 second response time [16:40:01] RECOVERY - Apache HTTP on mw20 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.655 second response time [16:40:17] PROBLEM - check_swap on indium is CRITICAL: SWAP CRITICAL - 1% free (67 MB out of 7628 MB) [16:40:17] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:41:12] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.981 second response time [16:41:12] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:41:40] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:41:47] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 46879 bytes in 0.152 seconds [16:42:32] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:42:32] RECOVERY - Apache HTTP on mw51 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.322 second response time [16:42:32] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.531 second response time [16:43:53] RECOVERY - Apache HTTP on mw38 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.413 second response time [16:44:11] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:44:31] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:44:31] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:44:38] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.555 second response time [16:45:07] New patchset: Mark Bergsma; "First attempt to add Range support to Varnish in streaming mode" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/28372 [16:45:14] RECOVERY - check_swap on indium is OK: SWAP OK - 100% free (7617 MB out of 7628 MB) [16:45:42] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:45:59] RECOVERY - Apache HTTP on mw49 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [16:45:59] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.770 second response time [16:47:56] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:21] * AaronSchulz wonders what LeslieCarr is talking about [16:48:40] AaronSchulz: i don't think you want to know [16:48:41] RECOVERY - Apache HTTP on mw55 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.341 second response time [16:49:18] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39615 bytes in 0.902 seconds [16:50:00] !log midom synchronized wmf-config/InitialiseSettings.php [16:50:09] domas: what did you disable? [16:50:14] Logged the message, Master [16:50:32] enabled [16:50:47] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:50:47] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:51:01] * AaronSchulz does a whois to confirm that this is the real domas [16:51:19] ;-p [16:51:38] maybe you enable a thing that disables other things [16:51:59] true that [16:51:59] RECOVERY - Apache HTTP on mw52 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.430 second response time [16:52:08] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [16:52:14] servers flapping could be fixed [16:52:21] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.852 second response time [16:52:21] but I guess that would need some work [16:52:26] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:54:00] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.006 second response time [16:54:05] PROBLEM - Apache HTTP on mw29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:55:26] PROBLEM - Apache HTTP on mw59 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:55:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:56:38] New patchset: Kaldari; "Turning on Echo for en.wiki on labs beta cluster." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28374 [16:56:47] RECOVERY - Apache HTTP on mw59 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.022 second response time [16:56:47] PROBLEM - Apache HTTP on mw24 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:57:05] RECOVERY - Apache HTTP on mw29 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.824 second response time [16:57:19] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28374 [16:57:23] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:58:17] RECOVERY - Apache HTTP on mw24 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.597 second response time [16:58:53] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.471 second response time [17:00:26] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:00:27] New patchset: CSteipp; "Initial WikiVoyage config for labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28375 [17:01:44] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.067 second response time [17:02:23] New patchset: Hashar; "beta: honor default env" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28377 [17:02:29] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:03:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28377 [17:04:44] PROBLEM - Apache HTTP on mw45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:04:44] New review: Andrew Bogott; "Is this class a duplicate of the mediawiki-install class with 'latest' instead of 'present'? If so,..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/28355 [17:04:56] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:47] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:52] New patchset: Hashar; "beta: honor default env" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28377 [17:06:14] RECOVERY - Apache HTTP on mw45 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [17:06:14] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [17:06:23] PROBLEM - Apache HTTP on mw47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:51] New review: Hashar; "Patchset 2, actually pass the VERBOSE env variable :-]" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/28377 [17:06:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28377 [17:06:59] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:59] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.409 second response time [17:07:20] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.373 second response time [17:07:48] RECOVERY - Apache HTTP on mw47 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.617 second response time [17:08:11] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:08:11] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [17:08:38] RECOVERY - Apache HTTP on mw20 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.936 second response time [17:09:23] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:09:44] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:09:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.479 seconds [17:10:44] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.992 second response time [17:12:05] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:34] PROBLEM - Apache HTTP on mw51 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:41] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.667 second response time [17:12:41] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.634 second response time [17:12:50] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:02] New patchset: Mark Bergsma; "First attempt to add Range support to Varnish in streaming mode" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/28379 [17:13:45] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/28361 [17:13:53] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:59] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/28372 [17:14:02] RECOVERY - Apache HTTP on mw51 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.776 second response time [17:14:14] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [17:14:20] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:20] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.224 second response time [17:15:05] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.468 second response time [17:15:26] PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:51] RECOVERY - Apache HTTP on mw49 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.568 second response time [17:16:53] RECOVERY - Apache HTTP on mw26 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [17:17:02] RECOVERY - Apache HTTP on mw38 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.693 second response time [17:20:47] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:21:05] PROBLEM - Apache HTTP on mw29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:21:14] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:12] PROBLEM - Apache HTTP on mw44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:27] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:27] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:35] RECOVERY - Apache HTTP on mw29 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.254 second response time [17:22:44] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.632 second response time [17:23:47] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [17:23:47] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.054 second response time [17:23:47] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.096 second response time [17:24:51] !log Installed experimental Varnish package with range support in streaming mode on cp1021 [17:25:01] Logged the message, Master [17:26:25] mark .. awesome :-) [17:27:00] RECOVERY - Apache HTTP on mw44 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.958 second response time [17:28:26] PROBLEM - Apache HTTP on srv228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:29:47] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:29:47] RECOVERY - Apache HTTP on srv228 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.268 second response time [17:30:23] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:31:08] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.626 second response time [17:31:35] PROBLEM - Apache HTTP on mw24 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:32:47] PROBLEM - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:33:05] RECOVERY - Apache HTTP on mw24 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.051 second response time [17:33:32] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.941 second response time [17:33:59] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:34:17] RECOVERY - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39423 bytes in 8.429 seconds [17:34:44] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:35:29] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.516 second response time [17:36:14] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 47063 bytes in 0.686 seconds [17:36:14] PROBLEM - Apache HTTP on mw47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:45] RECOVERY - Apache HTTP on mw47 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.202 second response time [17:38:13] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:32] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.255 second response time [17:41:11] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:38] PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:43:08] RECOVERY - Apache HTTP on mw25 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.730 second response time [17:44:02] PROBLEM - Apache HTTP on mw19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:44:03] PROBLEM - Apache HTTP on mw51 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:44:11] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:44:29] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:44:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:35] RECOVERY - Apache HTTP on mw51 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.453 second response time [17:45:50] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.280 second response time [17:45:50] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.802 second response time [17:45:50] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:50] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.518 second response time [17:46:18] PROBLEM - Apache HTTP on mw33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:35] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:35] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:45] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:13] RECOVERY - Apache HTTP on mw19 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.148 second response time [17:47:20] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.834 second response time [17:47:29] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:29] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:33] i'm seeing a lot of 503's on en.m.wikipedia.org [17:47:47] RECOVERY - Apache HTTP on mw33 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.791 second response time [17:47:48] chrismcmahon mentioned that this was reported a while ago, but is this something that is actively being worked on? [17:47:56] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [17:48:05] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.294 second response time [17:48:05] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.393 second response time [17:48:52] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.532 second response time [17:48:52] preilly know anything about hte 503s? ^ [17:49:08] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.824 second response time [17:49:48] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:50:10] LeslieCarr, mutante know anything about the 503s we're seeing on en.m.wikipedia.org? [17:50:21] RobH, mark ^ [17:51:16] awjr: nope, i can poke around [17:51:36] LeslieCarr: thanks, i've heard they've been happening for at elast a couple of hours but i can't tell if anyone's looking into it [17:52:00] there was cp1042 which had varnish freak out but that was a bit ago [17:52:38] awjr: looks like more varnish swapdeath [17:52:44] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 46877 bytes in 0.202 seconds [17:52:52] usually root cause of that is when they have had some momentary lack of connectivity to their backends [17:53:09] !log restarting varnish on cp1041 [17:53:21] Logged the message, Mistress of the network gear. [17:53:35] heh [17:53:48] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:54:03] LeslieCarr: are there no alarms for when hella 503s happen? [17:54:18] not unless watchmouse has it i believe [17:54:32] we don't really have a monitoring system that's very close -- if you find one, we would love to be able to evaluate it [17:55:15] awjr: how does it look now ? [17:55:15] i can dig it [17:55:18] lemme check [17:55:35] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:35] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:38] still getting 503s [17:55:44] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:03] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:12] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:29] LeslieCarr: does having the XID from the errors help? [17:56:47] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.224 second response time [17:56:56] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:56] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.317 second response time [17:57:14] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:57:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.214 seconds [17:57:23] it does not really help me as i'm not the best at debugging these things when it's not super obvious [17:57:23] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.047 second response time [17:57:32] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.088 second response time [17:57:41] PROBLEM - Varnish HTTP mobile-frontend on cp1041 is CRITICAL: Connection refused [17:57:59] PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:58:05] well there we go ... [17:58:26] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:58:44] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.803 second response time [17:58:48] PROBLEM - Apache HTTP on mw24 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:58:48] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:58:53] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.069 second response time [17:59:21] RECOVERY - Varnish HTTP mobile-frontend on cp1041 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [17:59:59] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.050 second response time [17:59:59] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.602 second response time [18:00:04] awjr: now ? [18:00:13] !log restarted varnish-frontend on cp1041 [18:00:14] RECOVERY - Apache HTTP on mw17 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.271 second response time [18:00:16] LeslieCarr: looking good so far [18:00:23] Logged the message, Mistress of the network gear. [18:00:32] oh noes [18:00:36] nooo just got one [18:00:42] i just got another 503 (clicking through manually) [18:00:44] yeah :( [18:01:03] can someone who knows varnish better help please ? [18:01:11] RECOVERY - Apache HTTP on mw26 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.717 second response time [18:01:28] i mean, my next step is just restarting the boxes ;) don't think that's the right thing to do though [18:01:44] RECOVERY - Apache HTTP on mw24 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.877 second response time [18:01:53] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:02:11] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:02:36] well there's some processes in swapdeath on cp1041 [18:02:42] i'm going to try killing them all and restarting them [18:03:18] PROBLEM - Apache HTTP on mw44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:03:18] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.776 second response time [18:03:33] RECOVERY - Apache HTTP on mw18 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.317 second response time [18:03:50] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:04:02] PROBLEM - Apache HTTP on mw58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:04:30] PROBLEM - Apache HTTP on mw19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:05:02] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:05:02] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:05:20] RECOVERY - Apache HTTP on mw58 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.279 second response time [18:05:20] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.074 second response time [18:05:29] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:05:32] !log killed and restarted all varnish processes on cp1041 [18:05:44] Logged the message, Mistress of the network gear. [18:05:47] RECOVERY - Apache HTTP on mw19 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.328 second response time [18:06:05] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:06:41] RECOVERY - Apache HTTP on mw17 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.080 second response time [18:06:41] RECOVERY - Apache HTTP on mw54 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.122 second response time [18:06:59] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.624 second response time [18:07:36] LeslieCarr: i seem to be consistently getting 503s now [18:07:44] RECOVERY - Apache HTTP on mw44 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.730 second response time [18:07:58] LeslieCarr: nm now im getting actual pages [18:08:06] !log reedy synchronized php-1.21wmf2/includes/api/ApiEditPage.php [18:08:11] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.259 second response time [18:08:19] Logged the message, Master [18:08:25] with the occasional 503 still [18:09:05] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:09:05] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:09:05] RECOVERY - Apache HTTP on mw28 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.029 second response time [18:09:22] i have no idea [18:09:46] htop shows a bunch of varnish workers "waiting for net" [18:10:35] RECOVERY - Apache HTTP on mw38 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.610 second response time [18:10:35] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.094 second response time [18:10:35] PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:10:45] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:10:57] dang [18:11:29] PROBLEM - Apache HTTP on mw59 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:11:55] and a bunch of processes are trying to take up 201G of memory [18:11:56] ! [18:12:05] RECOVERY - Apache HTTP on mw26 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [18:12:09] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.005 second response time [18:12:10] awjr: 503 from varnish [18:12:17] LeslieCarr: can we just take cp1041 out of the pool? [18:12:26] or are the other varnish boxes seeing the same thing? [18:12:38] New review: jan; "The class "role::mediawiki-update::labs" adds beside the "latest" to git::clone the exec for update...." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/28355 [18:12:41] cp1042 is [18:12:41] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:12:54] and cp1043 [18:12:59] :( [18:13:01] and cp1044 [18:13:02] what has changed? [18:13:07] so yeah all of them [18:13:07] no bueno [18:13:07] not sure [18:13:10] anyone know where asher is ? [18:13:36] i wonder when the problem actually started; we did a deployment yesterday afternoon [18:13:36] * AaronSchulz doesn't [18:13:54] but didnt see any issues immediately after [18:14:32] RECOVERY - Apache HTTP on mw59 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [18:14:32] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:14:34] paravoid: ping [18:14:56] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:15:43] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.058 second response time [18:15:50] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.041 second response time [18:15:50] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:16:26] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.231 second response time [18:17:02] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:17:11] PROBLEM - Apache HTTP on mw45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:17:22] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.697 second response time [18:17:50] hi [18:17:59] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Closed, special and private to 1.21wmf2 [18:18:12] Logged the message, Master [18:18:12] LeslieCarr: swapdeath on cp1041? [18:18:12] LeslieCarr: you know about the parser cache issue from earlier, right? [18:18:14] what makes you think that? [18:18:29] I just got a redirect loop error for officewiki again... [18:19:03] LeslieCarr: i don't think there is any more, it seemed like there was before, sluggish, uptime showed load at like 1000, top showed swap usage super high [18:19:16] haha i meant to point that last one at you mark :) [18:19:25] are you talking about cp1041? [18:19:30] robla: somewhat ? [18:19:35] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:20:02] before i restarted varnish , now it just seems like on all the machines htop shows the processes trying to use impossibly high memory [18:20:03] basically, the Apaches are now getting hit harder than normal because the disk-backed parser cache got emptied [18:20:13] ah [18:20:18] you didn't restart any others yet, right (please don't)? [18:20:28] lemme look at SAL [18:20:41] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:20:47] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:20:49] LeslieCarr: can you tell me exactly what you did? [18:20:56] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:20:58] just cp1041 at 1805 UTC [18:21:09] oh cp1042 at 16:21 utc [18:21:09] and you're saying that the others are in swapdeath as well? [18:21:19] please log that [18:21:24] doesn't seem like swapdeath as i know it [18:21:27] i had logged those [18:22:08] i tried "service varnish stop", "service varnishhtcpd stop" and those both failed to kill all the processes [18:22:17] then i did a kill -9 on the PID's of varnish that were still showing up [18:22:17] RECOVERY - Apache HTTP on mw17 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.303 second response time [18:22:36] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.833 second response time [18:22:51] then started varnish and varnishhtcpd and varnish-frontend using their respective service commands [18:22:52] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: metawiki back to 1.21wmf1 [18:23:05] Logged the message, Master [18:23:11] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.062 second response time [18:23:20] RECOVERY - Apache HTTP on mw45 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [18:23:38] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [18:24:05] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:24:23] Reedy: let's hold off on deploying further until ops gives the all clear [18:25:08] PROBLEM - Apache HTTP on mw44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:25:17] don't want to introduce more randomness into the situation yet, especially with varnish front-ends going down [18:25:22] but ganglia shows no signs of high load or memory at all? [18:25:35] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.587 second response time [18:25:53] mark we are continuing to see sporadic 503s on the mobile site [18:26:00] I believe you [18:26:05] mark: http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Application%20servers%20pmtpa&m=load_one&r=hour&s=by%20name&hc=4&mc=2&st=1350497771&g=load_report&z=large&c=Application%20servers%20pmtpa [18:26:15] LeslieCarr: you were not looking at the 'VIRT' (virtual memory) in top right? [18:26:29] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:26:38] RECOVERY - Apache HTTP on mw44 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.836 second response time [18:26:43] you can see a load spike in ganglia though it's not as high as reported by uptime and the memory usage doesn't appear to jive with actual usage [18:26:44] doh :-/ [18:26:46] yeah virt [18:26:57] actual reserved is much more reasonable at about 8.5 g [18:27:04] no, resident [18:27:11] ok [18:27:15] I don't think it's varnish misbehaving [18:27:19] let's investigate the backend responses [18:27:53] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.021 second response time [18:28:44] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.811 second response time [18:30:05] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:31:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:31:44] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:32:11] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:32:21] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 238 seconds [18:32:21] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:33:05] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.469 second response time [18:33:05] RECOVERY - Apache HTTP on mw28 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.405 second response time [18:33:32] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:33:45] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.049 second response time [18:33:45] RECOVERY - Apache HTTP on mw20 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.493 second response time [18:34:00] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:34:56] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.047 second response time [18:35:11] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [18:35:29] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39668 bytes in 7.877 seconds [18:35:42] 15 FetchError c no backend connection [18:35:56] preilly: pong [18:35:56] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:36:21] paravoid: do you know redis at all? [18:36:23] PROBLEM - Apache HTTP on mw47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:36:35] woosters: I suppose I could have just asked that question in IRC, so you could paste the URL as the answer :) [18:36:44] paravoid: do you have a few moments to look at redis on silver? [18:36:51] * robla asked CT about the memory pool graph [18:37:04] http://ganglia.wikimedia.org/2.2.0/graph_all_periods.php?c=MySQL%20pmtpa&h=pc1.pmtpa.wmnet&v=0&m=mysql_innodb_free_space&r=day&z=default&jr=&js=&st=1350498944&vl=Mbytes&ti=mysql_innodb_free_space&z=large [18:37:10] I've played with it a bit in the past, yes [18:37:17] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.156 second response time [18:37:18] what do we use it for and what's the problem? [18:37:45] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:38:13] paravoid: it's used for VUMI [18:38:18] paravoid: and it won't start [18:38:29] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:38:37] paravoid: notpeter is looking at it right now [18:38:41] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:38:41] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:39:37] paravoid: but I thought that you might have some other ideas [18:39:51] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 46870 bytes in 0.177 seconds [18:39:59] PROBLEM - Apache HTTP on mw33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:40:08] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.635 second response time [18:40:47] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.043 second response time [18:41:02] RECOVERY - Apache HTTP on mw47 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.874 second response time [18:41:34] paravoid: it's now barfing on its conf [18:41:38] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:41:47] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.092 second response time [18:41:57] mark: remember the other day when I asked about squid being made to ignore everything after a specific querystring? I have another approach to ask about [18:41:58] fixed [18:42:11] The following redis.conf and CONFIG GET / SET parameters changed: [18:42:11] * hash-max-zipmap-entries, now replaced by hash-max-ziplist-entries * hash-max-zipmap-value, now replaced by hash-max-ziplist-value [18:42:15] from "Migrating from 2.4 to 2.6" [18:42:34] # dpkg-query -W redis-server [18:42:35] redis-server 2:2.6.0-rc7-wmf1 [18:43:07] basically mobile is sending 503s because the apaches are too busy, indeed [18:43:21] paravoid: awesome [18:43:36] New patchset: Platonides; "(Bug 41121) As production started using /mnt/upload7, make such folder available in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28391 [18:43:49] reprepro changes: [18:43:49] add precise-wikimedia deb universe amd64 redis-server 2:2.6.0-rc7-wmf1 -- pool/universe/r/redis/redis-server_2.6.0-rc7-wmf1_amd64.deb [18:43:56] Sep 21st [18:44:03] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:44:11] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:44:21] !log racking and setting up replacements for ms-be6/7 [18:44:27] no entry in SAL, but these are frequently skipped [18:44:31] RECOVERY - Apache HTTP on mw33 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.046 second response time [18:44:33] Logged the message, Master [18:44:38] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:44:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28391 [18:44:47] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.752 second response time [18:44:48] I'm going to take a guess and say that Asher installed a new version [18:44:55] for his redis experiments [18:45:05] where? [18:45:09] apt [18:45:23] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.071 second response time [18:45:24] yeah. don't ensure latest on something like mysql or redis [18:45:29] s/installed/uploaded/ [18:45:32] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.054 second response time [18:45:49] "redis-server": [18:45:50] ensure => "latest"; [18:45:59] manifests/mobile.pp [18:46:08] RECOVERY - Apache HTTP on mw49 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.012 second response time [18:46:13] ;) [18:46:27] although I prefer systems to be up-to-date, puppet is probably the wrong way to do it [18:46:34] cmjohnson1: the machines are here? [18:46:35] yay [18:47:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.293 seconds [18:47:23] shouldn't ensure latest for anything that will impact production if it goes down [18:47:28] no [18:47:29] agreed [18:47:29] RECOVERY - Puppet freshness on stat1 is OK: puppet ran at Wed Oct 17 18:47:16 UTC 2012 [18:47:31] ensure on a specific version [18:47:56] that would make puppet break in this scenario [18:48:04] danke Platonides!!!! [18:48:04] paravoid 2 of them [18:48:06] it was fun when thumbnails went down periodically when puppet had ensure latest on nginx [18:48:20] paravoid: no it wouldn't [18:48:32] so, how are we going to name the R720xd? [18:48:37] are we keeping the same names? [18:48:42] I'd like us to [18:48:46] New review: Andrew Bogott; "I don't much like having list of which components and extensions are installed duplicated... so that..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/28355 [18:48:46] would be easiest [18:48:50] need to replace them one by one anyway [18:49:01] might as well go in the same rack spaces, same hostnames, same everything [18:49:07] indeed [18:49:21] but it's better than I initially thought [18:49:31] so we have 4 boxes that are offline now [18:49:45] we can put up 4 new ones immediately into production [18:49:45] and let it replicate [18:50:02] PROBLEM - Apache HTTP on mw19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:02] PROBLEM - Apache HTTP on mw58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:03] maybe remove 1 old one at the same time too [18:50:30] paravoid: we are keeping the same names [18:50:41] then we only have to put up 4 more into the swift cluster, since we're keeping 4 for the evaluation [18:50:45] yes [18:50:56] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:05] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 17 seconds [18:51:23] I am going to replace all the borked ones first...once they are working and added to the cluster, we can remove the c2100's that are in production now [18:51:23] RECOVERY - Apache HTTP on mw19 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [18:51:31] the working c2100's are the last phase [18:51:42] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:55] are you gonna wipe disks? [18:51:59] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:52:21] mark: LeslieCarr: are you still doing active futzing with the cluster, or are you just planning to let the current timeouts play out? [18:52:21] robla: i can't really do much about it I think [18:52:26] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [18:52:34] robla: i haven't touched anything since mark came online [18:53:04] RECOVERY - Apache HTTP on mw58 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.021 second response time [18:53:06] varnish is having issues contacting the apaches, so when it can't get backend objects, there's a whole string of 503s for a while [18:53:07] k....so....the next question is: should we go forward with our regularly scheduled 1.21wmf2 deployment to commons and other non-WP sites? [18:53:11] RECOVERY - Apache HTTP on mw38 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.871 second response time [18:53:21] probably not [18:53:32] it's likely it'll load the apaches more [18:53:37] and in any case makes it harder to find issues now [18:53:42] !log olivneh synchronized php-1.21wmf1/extensions/E3Experiments/lib/event/eventlog.js [18:53:51] we can't see if it's caused by the pc emptyness or the new code [18:53:54] Logged the message, Master [18:54:15] what's unfortunate is that the mobile varnish cluster has a pretty bad cache hit rate [18:54:16] mark: all the disk are going to be wiped [18:54:31] * cmjohnson1 assumes mark was talking to me [18:54:35] cmjohnson1: yes [18:54:46] cmjohnson1: I was thinking, perhaps just wiping the system drives (2) per ssystem would be enough [18:54:47] Reedy: see ^ [18:54:56] !log fixed pc1 innodb_free_space logging on pc1 via "GRANT SELECT ON `parsercache`.`pc000` TO 'dbstats'@'localhost';" - (it had only been granted access to the objectcache table which no longer exists) [18:55:03] looks like we're probably going to need to postpone [18:55:04] Logged the message, Master [18:55:10] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.369 second response time [18:55:33] i feel safer with a full wipe....the OS is on the ssd's on most of them but rather not take the chance [18:55:33] https://gdash.wikimedia.org/dashboards/pcache/ [18:55:52] We're still gettng 60%+ cache hit rate [18:55:53] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:55:54] cmjohnson1: ok ;) [18:56:35] Reedy: until the Apaches settle down, though, we're going to have a tough time spotting wmf2 induced problems [18:56:35] New review: jan; "For example when you develop an MW-extension and do want to have a auto update for core so you can t..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/28355 [18:57:23] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:57:31] I think maybe we need to commandeer a spot on tomorrow's deployment calendar [18:58:34] or....we could do it a little earlier in the day tomorrow [18:58:54] RECOVERY - Apache HTTP on mw18 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.041 second response time [18:58:54] PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:59:31] mark: just read back scroll; did i interpret right that there's nothing much to be done for the 503s aside waiting for the apaches to chill out? [18:59:50] pretty much [19:00:06] the only reason the main site is faring slightly better is because there's more in cache [19:00:18] roger [19:00:33] Reedy: would 8am-10am PDT tomorrow work for you? [19:00:38] * robla does tz math [19:00:47] paravoid: is redis now working on silver? [19:00:55] preilly: the daemon started [19:01:08] I haven't tried any operations, I presume the rest is okay [19:01:12] mark is there something we can do to prevent this in the future? also - can we add monitoring/alarming for 503s from the mobile site? this has happened before but response time to the issue is generally slow due to no one realizing it was even happening [19:01:17] Reedy: 4pm-6pm UK [19:01:26] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:01:38] paravoid: can you restart supervisord [19:02:02] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.086 second response time [19:02:20] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:02:51] awjr: I'm pretty sure we can. this is what the problem was: http://ganglia.wikimedia.org/2.2.0/graph_all_periods.php?c=MySQL%20pmtpa&h=pc1.pmtpa.wmnet&v=0&m=mysql_innodb_free_space&r=day&z=default&jr=&js=&st=1350498944&vl=Mbytes&ti=mysql_innodb_free_space&z=large [19:03:05] 0 == bad [19:03:11] oh [19:03:23] owch [19:03:32] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:03:41] RECOVERY - Apache HTTP on mw25 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.681 second response time [19:03:51] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:03:51] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:03:51] RECOVERY - Apache HTTP on mw54 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.642 second response time [19:03:57] that is a remarkable spike. [19:04:20] PROBLEM - Apache HTTP on mw58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:04:36] chrismcmahon: it's not really a spike. We just started graphing this stat the last time this happened. [19:04:49] which was early July [19:05:03] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [19:05:20] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [19:05:20] RECOVERY - Apache HTTP on mw20 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.549 second response time [19:05:47] RECOVERY - Apache HTTP on mw58 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.250 second response time [19:05:58] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:59] still need to finish this: https://rt.wikimedia.org/Ticket/Display.html?id=2108 [19:06:05] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.311 second response time [19:07:26] RECOVERY - Apache HTTP on mw38 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.095 second response time [19:07:56] Reedy: does 4pm-6pm UK time work for a new deployment window? presuming you can either be slightly absent from the Wikidata meeting if things are dragging out [19:08:05] * Damianz imagies chrismcmahon with a greek face yelling 'spikey' like in the mask [19:08:08] s/greek/green/ [19:08:21] robla: should be fine [19:09:08] thanks [19:09:32] ok...going afk-ish for a bit [19:09:59] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:11:29] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.565 second response time [19:11:51] preilly: done [19:11:55] paravoid: thanks [19:13:08] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:14:00] !log reedy synchronized wmf-config/CommonSettings.php 'Increase $wgMaxImageArea to 1.4e7' [19:14:11] Logged the message, Master [19:16:08] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.047 second response time [19:16:19] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:59] PROBLEM - Apache HTTP on mw33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:19:52] !log reedy synchronized wmf-config/CommonSettings.php 'Increase $wgMaxImageArea to 2.5e7 on testwiki' [19:20:06] Logged the message, Master [19:20:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:20:56] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.830 second response time [19:20:56] RECOVERY - Apache HTTP on mw33 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.082 second response time [19:21:21] we may have a way anyway [19:21:33] i was testing earlier with changing how varnish does its backend health testing [19:21:37] and got no effect [19:21:44] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:21:52] but that turns out to be because the changes didn't take effect with VCL reloads... apparently need a full varnish restart [19:22:01] on the one box I tried it seems to behave a bit better [19:22:44] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:44] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:44] PROBLEM - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:06] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [19:24:06] RECOVERY - Apache HTTP on mw49 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.792 second response time [19:24:14] RECOVERY - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 42269 bytes in 0.141 seconds [19:24:14] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:17] New patchset: Mark Bergsma; "Remove the health probe on single backend appservers, it's doing more harm than good" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28398 [19:26:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28398 [19:27:14] RECOVERY - Apache HTTP on mw20 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.920 second response time [19:27:23] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:27:50] RECOVERY - Apache HTTP on mw38 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.903 second response time [19:28:18] PROBLEM - Apache HTTP on mw47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:28:32] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28398 [19:28:32] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28213 [19:29:02] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:32] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.207 second response time [19:30:32] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:31:08] PROBLEM - Apache HTTP on mw19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:31:08] PROBLEM - Apache HTTP on mw58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:32:11] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:00] PROBLEM - Apache HTTP on mw44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:23] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:32] RECOVERY - Apache HTTP on mw54 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.127 second response time [19:33:32] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.878 second response time [19:33:42] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:42] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:42] PROBLEM - Apache HTTP on mw33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:42] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:57] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: private and special wikis back to 1.21wmf1 [19:34:11] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:11] RECOVERY - Apache HTTP on mw19 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.497 second response time [19:34:11] Logged the message, Master [19:34:44] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:44] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.869 second response time [19:35:03] RECOVERY - Apache HTTP on mw17 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.944 second response time [19:35:11] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.036 second response time [19:35:11] RECOVERY - Apache HTTP on mw33 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.119 second response time [19:35:11] RECOVERY - Apache HTTP on mw49 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.608 second response time [19:35:38] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.160 second response time [19:35:47] RECOVERY - Apache HTTP on mw58 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.537 second response time [19:36:26] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:37] RECOVERY - Apache HTTP on mw44 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [19:37:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [19:39:15] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.089 second response time [19:39:32] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.027 second response time [19:39:33] RECOVERY - Apache HTTP on mw47 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.011 second response time [19:39:50] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.439 second response time [19:42:36] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:14] RECOVERY - Apache HTTP on mw52 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [19:45:59] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:47:30] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.353 second response time [19:47:58] New patchset: Asher; "route mobile api.php reqs to the api apaches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28431 [19:49:08] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28431 [19:53:11] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:54:17] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:54:32] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.141 second response time [19:55:56] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 47065 bytes in 8.887 seconds [19:55:57] strange [19:56:02] we're not getting pages at all [19:56:13] but I just tried sending a test message and it worked [19:56:29] PROBLEM - Apache HTTP on mw44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:14] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [19:57:14] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [19:58:02] RECOVERY - Apache HTTP on mw44 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.955 second response time [20:02:11] PROBLEM - Apache HTTP on mw33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:33] RECOVERY - Apache HTTP on mw33 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.052 second response time [20:05:20] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:06:16] New review: Hashar; "I would prefer we use upload7 as well :-)" [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/28391 [20:06:50] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 46872 bytes in 0.137 seconds [20:06:50] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:35] PROBLEM - Apache HTTP on mw58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:55] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:08:23] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.929 second response time [20:09:06] RECOVERY - Apache HTTP on mw58 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.036 second response time [20:09:23] RECOVERY - Apache HTTP on mw28 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.051 second response time [20:09:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:26] PROBLEM - Apache HTTP on mw29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:11:13] New patchset: Asher; "that was supposed to go in vcl_recv" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28438 [20:11:56] RECOVERY - Apache HTTP on mw29 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.430 second response time [20:12:23] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28438 [20:12:27] New patchset: Platonides; "(Bug 41121) Use upload7 as folder in labs, given that production started using /mnt/upload7" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28391 [20:12:51] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:13:20] PROBLEM - Apache HTTP on mw33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:13:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28391 [20:13:53] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:14:20] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.271 second response time [20:14:36] New review: Hashar; "upload7 everywhere please :-]" [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/28391 [20:14:47] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:14] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.332 second response time [20:15:59] PROBLEM - Apache HTTP on mw24 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:16:17] RECOVERY - Apache HTTP on mw17 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.035 second response time [20:16:24] New patchset: Platonides; "(Bug 41121) Use upload7 as folder in labs, given that production started using /mnt/upload7" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28391 [20:16:38] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:17:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28391 [20:17:29] RECOVERY - Apache HTTP on mw24 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.102 second response time [20:17:49] RECOVERY - Apache HTTP on mw33 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.644 second response time [20:18:05] RECOVERY - Apache HTTP on mw49 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.775 second response time [20:19:14] New review: Hashar; "/mnt/upload6 should still point to /data/project/upload6 I guess :-]" [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/28391 [20:20:03] grr puppet - err: Could not retrieve catalog from remote server: Connection reset by peer - SSL_connect [20:20:11] PROBLEM - Apache HTTP on mw58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:47] PROBLEM - Apache HTTP on mw59 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:33] RECOVERY - Apache HTTP on mw58 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.309 second response time [20:22:08] RECOVERY - Apache HTTP on mw59 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.391 second response time [20:23:11] New patchset: Platonides; "Use upload7 everywhere." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28442 [20:23:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.037 seconds [20:23:25] <^demon> !log restarted gerrit service, was missing documentation from UI [20:23:33] Logged the message, Master [20:24:36] New patchset: Hashar; "(Bug 41121) Use upload7 as folder in labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28391 [20:24:41] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:25:18] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:25:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28391 [20:25:45] New review: Hashar; "PS4 makes this change simply create the upload7 directory." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/28391 [20:26:02] RECOVERY - Apache HTTP on mw38 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.238 second response time [20:26:13] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:26:20] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:26:41] RECOVERY - Apache HTTP on mw55 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.753 second response time [20:27:23] hey dear ops, the beta project would need some love to get a symlink from /mnt/upload7 to /data/project/upload7 . The change is at https://gerrit.wikimedia.org/r/#/c/28391/ ;) [20:27:37] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28391 [20:27:41] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39615 bytes in 1.014 seconds [20:27:41] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 62359 bytes in 0.133 seconds [20:27:41] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:28:29] New patchset: Hashar; "beta: honor default env" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28377 [20:28:35] hashar: Ryan already merged it, but I wouldn't [20:28:48] why are you depending on production paths anyway? [20:28:48] preilly: ok, the mobile varnish change is finally all pushed [20:28:49] paravoid: why not? [20:28:52] no more 503's [20:29:02] thanks Ryan_Lane :) [20:29:05] change it in the MW config and use whatever's suitable for labs [20:29:07] binasher: wh00t! [20:29:11] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 47065 bytes in 0.664 seconds [20:29:16] http://bit.ly/Ra5i70 [20:29:31] Change merged: Ryan Lane; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28442 [20:29:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28377 [20:29:45] why would you be depended on production's mountpoints? [20:30:13] binasher: it's missing a pointer with the label "Asher" on top of it [20:30:13] an arrow [20:30:13] whatever you call that [20:30:50] PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:30:56] paravoid: we want beta to match production as closely as possible. So people will get a similar error message (i.e. referencing upload7) ;-D [20:31:34] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: mediawikiwiki testwiki and test2wiki back to 1.21wmf2 [20:31:44] so you draw the line in the symlink instead of changing a variable in the MW configuration? :) [20:31:46] Logged the message, Master [20:32:07] but whatever, it isn't worth arguing about it [20:32:11] RECOVERY - Apache HTTP on mw25 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.121 second response time [20:40:21] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: test2wiki to 1.21wmf2 [20:40:34] Logged the message, Master [20:42:22] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Fix tetwiki [20:42:30] Logged the message, Master [20:46:26] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [20:47:02] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [20:47:10] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: test [20:47:22] Logged the message, Master [20:48:19] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: revert test [20:48:31] Logged the message, Master [20:52:14] binasher: there is a bug fix for MF we'd like to push out today if possible but i believe it will require a varnish cache flush. is it safe to do this afternoon? [20:54:41] PROBLEM - udp2log log age for lucene on oxygen is CRITICAL: CRITICAL: log files /a/log/lucene/lucene.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [20:54:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:58:46] !log nuking wmfsocial mailing list [20:58:58] Logged the message, Master [21:09:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.597 seconds [21:10:08] RECOVERY - udp2log log age for lucene on oxygen is OK: OK: all log files active [21:11:20] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [21:18:04] New patchset: CSteipp; "Initial WikiVoyage config for labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28375 [21:19:06] binasher ^^ [21:20:41] awjr: it would be better if it could wait until tomorrow [21:21:04] binasher: no problem [21:21:04] thanks [21:29:35] binasher: curl -i -a http://upload.wikimedia.org/wikipedia/commons/thumb/3/34/Flag_of_El_Salvador.svg/25px-Flag_of_El_Salvador.svg.png [21:29:51] I'm around btw :) [21:29:54] kind of funny, but seriously, I''m not sure why that is happening [21:31:01] AaronSchulz: how was that reported? is it widespread? [21:31:19] https://bugzilla.wikimedia.org/show_bug.cgi?id=41113 [21:31:34] look at the original. [21:31:44] oh wait [21:32:09] oh, is this the same actual issue as the oldimage db lock timeouts? [21:32:23] so, this is a swift 404 body [21:32:30] but without the actual swift error message that we added recently [21:32:37] lots of thumbs do work, just not 25px [21:32:40] and by recently, I mean a month ago [21:32:40] (for those files) [21:32:51] and the swift 404 body was served to squid as a 200 [21:32:53] or three weeks ago [21:33:39] indeed [21:33:50] if you request it from swift directly, the image works [21:34:03] yay for caching [21:34:22] hm [21:34:23] Last-Modified: Wed, 17 Oct 2012 21:33:19 GMT [21:34:37] maybe it's "out of control" too ;) [21:34:38] so, it didn't exist and was just generated then? [21:35:02] binasher: https://bugzilla.wikimedia.org/show_bug.cgi?id=41130 :D [21:35:15] one of the best summaries [21:35:57] You never know, the image scalers might just jump out and strangle you any day. [21:36:20] and now it works [21:36:32] so, I have on my backlog the 404 body/200 header [21:38:21] I don't see how swift could generate a 200 with a 404 body [21:38:35] the "The resource could not be found." message is in webob's code in the HTTPNotFound exception [21:38:41] yeah, I was scratching my head on that [21:40:14] New patchset: CSteipp; "Initial WikiVoyage config for labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28375 [21:40:35] RECOVERY - MySQL Slave Delay on es1001 is OK: OK replication delay NULL seconds [21:40:35] unless... [21:40:43] except urllib2.HTTPError,status: [21:40:46] [...] [21:40:49] else: [21:40:49] resp = webob.exc.HTTPNotFound('Unexpected error %s' % status) [21:40:52] resp.body = "".join(status.readlines()) [21:40:57] resp.status = status.code [21:41:39] so, if HTTPError is raised, status.code == 200 and status.readlines() = "" that /may/ explains it [21:41:53] but status.code = 200 and raised, difficult to believe [21:43:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:45:14] PROBLEM - MySQL Slave Delay on es1001 is CRITICAL: CRIT replication delay 2265529 seconds [21:45:17] Hi. [21:45:22] Who's in charge of Swift/media handling? [21:45:31] shoot [21:45:49] https://bugzilla.wikimedia.org/show_bug.cgi?id=41130#c0 [21:46:05] we were just talking about this [21:46:07] I guess "generation" isn't really broken. But a lot of old thumbs are stuck. [21:46:17] Oh, sorry. I'll read scrollback. [21:47:33] https://upload.wikimedia.org/wikipedia/commons/thumb/2/2c/Broom_icon.svg/22px-Broom_icon.svg.png seems like the most straight-forward test case. [21:47:56] AaronSchulz: the other one had Content-Disposition: inline;filename*=UTF-8''Flag_of_El_Salvador.svg.png [21:48:48] AaronSchulz: could MW had passed-through a 404 for the original back? [21:49:02] So it seems like there are competing issues here: (1) ?action=purge on the file description page no longer regenerates thumbnails; and (2) reuploading isn't properly purging thumbnails. [21:49:02] New patchset: MaxSem; "Solr replication" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26571 [21:49:31] Fixing (1) would alleviate the pain of (2). [21:49:58] Brooke: you forgot "returning a 404 html in a 200 reply" [21:50:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26571 [21:50:47] paravoid: I'm not sure that matters as much. Eliminate the 404s and who cares what the response code is? ;-) [21:50:48] paravoid: let me think about that [21:51:24] Brooke: well, the point is that something's broken there [21:51:27] so MW does HEAD -> send headers -> stream file body...I guess one could imagine the HEAD getting 200, then the file getting deleted, then the stream might return the error? [21:54:37] https://bugzilla.wikimedia.org/show_bug.cgi?id=41130#c2 [21:56:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.805 seconds [21:56:30] so [21:56:39] I'm looking at http://upload.wikimedia.org/wikipedia/commons/thumb/2/2c/Broom_icon.svg/22px-Broom_icon.svg.png [21:56:46] the reply has Content-Type: image/png [21:56:52] and Content-Disposition: inline;filename*=UTF-8''Broom_icon.svg.png [21:56:55] there no 22px thumb in swift [21:57:10] I checked with list, not a GET, since that would generate it [21:57:51] so I think MW returned that 404 body to swift and swift happily passed it through to squids [21:58:17] I believe it's mw since it's a localized message [21:58:17] yep, thumb.php does indeed add disposition [21:58:26] apergos: what do you mean localized? [21:58:30] apergos: ? [21:58:33] that is just a swift error [21:58:36] I get it in greek [21:58:44] er what? [21:58:48] apergos: Pastebin? [21:58:51] Η εικόνα blah δεν μπορεί να [21:58:54] I can't pastebin it [21:58:58] it's an image [21:59:04] nah this is just firefox [21:59:07] ..προβληθεί επειδή περιέχει ... [21:59:14] getting an image/png content-type [21:59:18] with an invalid PNG body [21:59:20] ah maybe [21:59:22] this is a firefox error message, ignore that [21:59:28] but it is an image, not text [21:59:55] trust me, it's firefox :) [22:03:44] AaronSchulz: so, thumb.php being silly. plausible? [22:04:22] I was looking at some job queue and type casting stuff [22:04:26] * AaronSchulz is distracted [22:04:32] do I assume that we can ignore the errors in the xception log? [22:04:47] No width specified to ImageHandler::makeParamString [22:05:05] the $img->getRepo()->streamFile( $thumbPath, $headers ); line is suspect [22:05:15] not the other stream call, since that is from a safe local temp file [22:05:19] (no race issues) [22:07:48] ok [22:07:56] paravoid: yeah I think that is very possible [22:08:07] * AaronSchulz is looking through the call stack [22:12:53] paravoid: I wonder why it is getting noticed [more?] now? [22:14:32] we've had swift troubles lately [22:14:39] with the winsor mckay incident [22:15:02] or even tim's cron that restarts swift frontends every hour or so [22:15:49] that means you have a better chance of having a HEAD work (swift works) and in the meantime swift being unable to respond [22:15:55] and the GET failing [22:18:58] New patchset: CSteipp; "Initial WikiVoyage config for labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28375 [22:20:18] paravoid: but wouldn't those be 503s? [22:20:37] swift's would be [22:20:45] I'd assume they'd have different bodies [22:20:45] or it might be a connection interrupted [22:20:53] New review: Andrew Bogott; "OK... I think what I'd like is a refactor of role::mediawiki-update::labs as a parameterized class i..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/28355 [22:20:59] a hm, i see your point [22:21:16] or even truncated files [22:21:28] MW sometimes logs incomplete response error [22:22:10] paravoid: MW sends out content-length based on the HEAD [22:22:24] so why does swift cache the response when the body and content-length don't match? [22:22:27] I mean squid [22:24:20] I wonder if varnish and squid differ in that case [22:30:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:31:25] !log reedy synchronized php-1.21wmf2/includes/specials/ [22:31:37] Logged the message, Master [22:45:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.039 seconds [22:48:18] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [23:15:05] AaronSchulz: paravoid: i've confirmed that for "/d/d3/Flag_of_Kiribati.svg/25px-Flag_of_Kiribati.svg.png" -- the file is not in squid, but the memcached key commonswiki:backend:local-swift:file:7e7f5817442893807c2bb86342d8a028cad38caa still exists with metadata when a previous copy of it was generated in august [23:15:18] * file not in swift [23:15:19] grr [23:16:13] so in this case, mediawiki is crafting the 200 response based on stale data from memcached, not based on a head to swift [23:17:04] oh fun [23:19:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:33:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.136 seconds