[00:00:29] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [00:13:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.041 seconds [00:21:47] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 241 seconds [00:24:58] binasher: filed https://bugzilla.wikimedia.org/show_bug.cgi?id=41090 for the SVG regeneration issue [00:25:48] thanks! [00:26:27] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 3 seconds [00:46:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:58:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.951 seconds [01:33:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:41:11] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 233 seconds [01:47:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.302 seconds [01:51:56] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [02:21:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:35] !log LocalisationUpdate completed (1.21wmf1) at Wed Oct 17 02:26:34 UTC 2012 [02:26:51] Logged the message, Master [02:37:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.045 seconds [02:46:32] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [02:49:54] !log LocalisationUpdate completed (1.21wmf2) at Wed Oct 17 02:49:53 UTC 2012 [02:50:08] Logged the message, Master [03:11:54] RECOVERY - Puppet freshness on srv220 is OK: puppet ran at Wed Oct 17 03:11:30 UTC 2012 [03:20:35] RECOVERY - Puppet freshness on srv297 is OK: puppet ran at Wed Oct 17 03:20:24 UTC 2012 [03:33:20] RECOVERY - Puppet freshness on sq42 is OK: puppet ran at Wed Oct 17 03:33:04 UTC 2012 [04:15:10] New patchset: Tim Starling; "Periodically restart leaky Swift proxies so that I don't have to" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28325 [04:16:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28325 [04:18:04] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28325 [04:19:32] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [04:19:32] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [04:22:02] New patchset: Tim Starling; "Suppress cron output" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28326 [04:23:00] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28326 [04:30:15] j^: 16 15:15:12 < notpeter> jeremyb: paravoid yes, I'm waiting on a porting of a patch to librsvg. other than that, the test box was preforming just fine. [04:54:30] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [05:20:08] PROBLEM - Swift HTTP on ms-fe1003 is CRITICAL: HTTP CRITICAL - No data received from host [05:20:42] TimStarling: ^ [05:20:46] ;) [05:45:18] do we even serve stuff from over there yet? I thought not [05:45:50] apergos: idk, just saw there was recent restartings [05:47:15] it's been in a "warning" state since very soon after that "critical" report [05:47:32] I guess critical -> warning transitions don't go to IRC [05:48:23] the only thing happening in the lgos over there is getting the monitoring test file [05:48:52] yeah I guess they don't [06:04:59] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:06:30] so, anyone checking out teh search page? [06:06:35] yes [06:06:51] cool [06:06:58] good morning apergos :) [06:07:06] morning [06:07:26] need me to do anything ? if not, i'll go back to sleep [06:07:49] no. go sleep [06:08:02] thanks for asking though [06:08:08] thanks :) i guess i'll see you again when you go to sleep [06:08:22] have a good (and hopefully no more paging) morning [06:08:22] thank you [06:08:26] PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out [06:09:22] !log restarted lucene search on search1015 [06:09:32] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [06:09:34] Logged the message, Master [06:09:47] RECOVERY - Lucene on search1015 is OK: TCP OK - 0.027 second response time on port 8123 [06:19:33] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [06:41:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:44:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.151 seconds [07:07:17] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [07:13:20] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [07:18:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:31:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.960 seconds [07:37:17] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [07:59:21] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.000 second response time on port 11000 [08:05:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:20:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.406 seconds [08:34:17] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [08:55:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:07:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.614 seconds [09:12:05] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 186 seconds [09:43:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:44:33] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28210 [09:56:12] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [09:56:12] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [09:59:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.022 seconds [10:01:17] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [10:04:33] New patchset: Hashar; "remove executable flags on noc files" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28334 [10:06:59] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [10:30:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:48:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.029 seconds [11:19:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:25:59] New patchset: Hashar; "make labs TMH setup more like production" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28007 [11:26:19] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28007 [11:31:49] New review: Hashar; "deployed on beta hopefully" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28007 [11:33:28] New patchset: Mark Bergsma; "First attempt to support passed (high) range requests on Varnish backends" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28213 [11:33:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.068 seconds [11:34:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28213 [12:01:57] New patchset: Hashar; "extension assets not available on beta" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28337 [12:02:21] New review: Hashar; "Needed by beta labs." [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/28337 [12:02:25] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28337 [12:08:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:24:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.523 seconds [12:26:10] New patchset: J; "role::applicationserver::common depends on upload" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28341 [12:27:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28341 [12:34:10] New patchset: Hashar; "beta: blacklist socks proxy listed in sorbs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28342 [12:34:27] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28342 [12:44:29] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:45:50] RECOVERY - Apache HTTP on mw20 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.315 second response time [12:47:11] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [12:50:37] New review: Demon; "On further thought, this won't quite work on production yet. Need to sort out some way to deal with ..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/25508 [12:56:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:00:27] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:02:02] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.949 second response time [13:06:59] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:08:29] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.243 second response time [13:11:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.436 seconds [13:11:03] otsr-wiki has been loading very slow for a whike [13:11:29] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:12:51] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [13:12:59] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:13:41] taking ages indeed [13:14:20] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:14:37] 19 s here [13:17:20] RECOVERY - Apache HTTP on mw28 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.062 second response time [13:17:39] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.423 second response time [13:20:47] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:29] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:29] PROBLEM - Apache HTTP on mw21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:47] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [13:23:47] RECOVERY - Apache HTTP on mw21 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.747 second response time [13:23:56] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:24:23] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:25:29] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.064 second response time [13:25:35] PROBLEM - Apache HTTP on mw24 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:25:35] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:25:44] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.165 second response time [13:25:56] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:27:05] RECOVERY - Apache HTTP on mw24 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.503 second response time [13:27:05] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.412 second response time [13:27:05] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:27:05] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.359 second response time [13:27:23] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [13:30:05] RECOVERY - Apache HTTP on mw54 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.644 second response time [13:30:59] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:31:44] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:32:20] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [13:33:15] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.082 second response time [13:33:32] PROBLEM - Apache HTTP on mw24 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:33:42] PROBLEM - Apache HTTP on srv229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:34:53] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:34:53] RECOVERY - Apache HTTP on mw24 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.557 second response time [13:34:53] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:36:14] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 46872 bytes in 6.581 seconds [13:36:26] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.545 second response time [13:36:41] RECOVERY - Apache HTTP on srv229 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.620 second response time [13:37:20] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:20] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:38] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.377 second response time [13:38:38] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.736 second response time [13:40:07] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:41:20] RECOVERY - Apache HTTP on mw38 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.380 second response time [13:42:06] PROBLEM - Apache HTTP on srv210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:50] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:50] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:59] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:59] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:11] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [13:44:20] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.009 second response time [13:44:20] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.152 second response time [13:44:31] RECOVERY - Apache HTTP on mw18 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.954 second response time [13:45:05] RECOVERY - Apache HTTP on srv210 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.455 second response time [13:45:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:41] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:26] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:44] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:20] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:20] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:47] RECOVERY - Apache HTTP on mw55 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.095 second response time [13:48:41] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [13:48:41] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [13:48:41] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.912 second response time [13:49:44] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.556 second response time [13:50:20] PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:05] PROBLEM - Apache HTTP on mw19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:26] RECOVERY - Apache HTTP on mw19 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.280 second response time [13:53:21] RECOVERY - Apache HTTP on mw25 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.064 second response time [13:57:05] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:58:17] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:58:17] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:58:26] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.302 second response time [13:58:41] New patchset: Matthias Mullie; "Init Wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28238 [13:59:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.613 seconds [14:00:32] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:01:17] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.047 second response time [14:01:17] RECOVERY - Apache HTTP on mw28 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [14:01:17] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:01:58] mark: ping [14:02:02] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 46879 bytes in 0.453 seconds [14:02:07] something's wrong [14:02:26] New patchset: Demon; "Gerrit hook cleanup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28351 [14:02:28] and paging's probably broken -- I didn't get anything [14:02:42] hmm I don't eeither [14:02:44] latency is through the roof, so are 500s [14:02:48] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39606 bytes in 3.721 seconds [14:02:48] but maybe it's not "critical" [14:03:03] a lot of parsercache misses [14:03:10] apergos: LVS alerts always are afaik [14:03:23] PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28351 [14:03:32] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:59] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:04:44] RECOVERY - Apache HTTP on mw26 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.893 second response time [14:05:02] RECOVERY - Apache HTTP on mw38 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.101 second response time [14:05:06] not for https [14:06:06] that's not https [14:06:15] not all of them at least [14:06:17] indeed [14:06:19] the problem is with appservers [14:06:24] their load has spiked [14:06:38] graphite shows increased latency and 500s, probably because of the contention [14:06:59] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.142 second response time [14:07:11] a lot of BannerListLoader/BannerLoader hits [14:07:13] New review: Hashar; "Gallium has puppet client 2.7.6 which lack the puppet/util/color class :-( Will hopefully get fixed..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/28351 [14:07:25] 10% of hits that is [14:07:47] where did you see that? [14:08:07] sampled-1000 [14:08:10] and grep [14:08:11] PROBLEM - Apache HTTP on mw21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:08:12] on emery [14:08:21] poor man's log analysis [14:08:22] New patchset: Demon; "Suppress IRC notification on patchset-created for drafts (DO NOT MERGE UNTIL 2.5)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28352 [14:08:59] higher external store get rate according to graphite [14:09:08] by a factor 2-3 [14:09:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28352 [14:09:33] indeed [14:09:45] PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:09:47] New review: Demon; "jenkinsbot is ready to go (see comment here from it). Have it disabled until this is merged so we do..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/28351 [14:10:40] bannerloader/listloader are mostly cache hits [14:11:02] yeah that's probably mwalker's purge script [14:11:02] RECOVERY - Apache HTTP on mw26 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.433 second response time [14:11:09] misses are a 0.6% of hits [14:11:11] RECOVERY - Apache HTTP on mw21 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.530 second response time [14:11:22] er? [14:11:25] i notice that tim has done something on pc1 [14:11:29] parser cache [14:12:04] New review: Hashar; "disregard my comment. Chad has set up the job to use a shell instead of the rake script :-]" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/28351 [14:13:15] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28221 [14:13:44] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:14:47] seems like mysql is not listening on its port on pc1? [14:15:08] ah it's doing recovery [14:15:14] RECOVERY - Apache HTTP on mw52 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.069 second response time [14:15:30] huh [14:15:44] 121017 13:54:59 InnoDB: Database was not shut down normally! [14:15:44] InnoDB: Starting crash recovery. [14:15:46] it's at 85% [14:15:58] 88%.. [14:16:33] 90 [14:16:33] I think we should just wait [14:17:57] whoops [14:17:59] it has done recovery before, and then crashed [14:18:30] many times [14:19:08] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:19:26] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:19:44] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:20:11] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [14:20:11] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [14:20:38] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:20:38] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:20:38] RECOVERY - Apache HTTP on mw17 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.976 second response time [14:21:14] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.049 second response time [14:21:59] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.255 second response time [14:22:09] PROBLEM - Apache HTTP on srv212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:22:36] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.036 second response time [14:23:32] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:23:38] RECOVERY - Apache HTTP on srv212 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.089 second response time [14:23:47] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.548 second response time [14:23:47] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:24:59] RECOVERY - Apache HTTP on mw28 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.192 second response time [14:26:05] New patchset: jan; "Add "role::mediawiki-update::labs" for updating labs-MW-installations" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28355 [14:26:56] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.894 second response time [14:27:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28355 [14:29:11] PROBLEM - Apache HTTP on srv228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:30:23] PROBLEM - Apache HTTP on srv233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:30:34] RECOVERY - Apache HTTP on srv228 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.713 second response time [14:30:41] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:31:53] RECOVERY - Apache HTTP on srv233 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.997 second response time [14:32:20] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:32:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:23] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:41] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:50] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [14:33:50] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.178 second response time [14:34:53] !log midom synchronized wmf-config/InitialiseSettings.php [14:35:05] Logged the message, Master [14:35:20] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.585 second response time [14:36:32] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.671 second response time [14:36:41] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:38:04] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [14:40:17] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:40:26] PROBLEM - Apache HTTP on srv229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:11] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:38] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [14:41:48] RECOVERY - Apache HTTP on srv229 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.098 second response time [14:42:44] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.419 second response time [14:44:56] PROBLEM - Apache HTTP on srv245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:45:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.972 seconds [14:46:28] RECOVERY - Apache HTTP on srv245 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.592 second response time [14:46:35] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:46:44] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:47:29] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:47:29] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:47:56] RECOVERY - Apache HTTP on mw49 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.103 second response time [14:48:14] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.142 second response time [14:49:17] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.225 second response time [14:49:35] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:50:56] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.023 second response time [14:51:32] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:52:08] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.712 second response time [14:53:02] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.439 second response time [14:55:17] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [14:55:17] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:56:48] RECOVERY - Apache HTTP on mw17 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.745 second response time [14:57:50] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:59:28] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.457 second response time [15:02:20] New patchset: Mark Bergsma; "First attempt to add Range support to Varnish in streaming mode" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/28361 [15:07:17] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:38] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.569 second response time [15:09:34] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:34] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:11:02] RECOVERY - Apache HTTP on mw17 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [15:12:33] RECOVERY - Apache HTTP on mw54 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.936 second response time [15:17:29] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:18:50] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.055 second response time [15:20:03] PROBLEM - Apache HTTP on mw19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:21:23] RECOVERY - Apache HTTP on mw19 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.235 second response time [15:22:26] PROBLEM - Apache HTTP on srv233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:47] RECOVERY - Apache HTTP on srv233 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.519 second response time [15:23:57] !log reedy synchronized php-1.21wmf1/extensions/EducationProgram/ [15:24:06] Logged the message, Master [15:25:53] PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:53] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:23] RECOVERY - Apache HTTP on mw26 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.374 second response time [15:28:26] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:53] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.042 second response time [15:29:56] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 62554 bytes in 0.807 seconds [15:31:46] !log reedy synchronized php-1.21wmf2/extensions/EducationProgram [15:31:58] Logged the message, Master [15:35:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [15:35:56] PROBLEM - Apache HTTP on srv201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:36:03] !log switched donate.wikimedia.org to geodns [15:36:13] Logged the message, Master [15:37:27] RECOVERY - Apache HTTP on srv201 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.393 second response time [15:39:32] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:41:02] RECOVERY - Apache HTTP on mw54 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.716 second response time [15:45:41] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:01] !log reedy Started syncing Wikimedia installation... : Rebuilding localisation cache for educationprogram [15:47:09] Logged the message, Master [15:48:52] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.205 second response time [15:50:20] PROBLEM - check_swap on indium is CRITICAL: SWAP CRITICAL - 80% free (6060 MB out of 7628 MB) [15:53:38] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:55:17] PROBLEM - check_swap on indium is CRITICAL: SWAP CRITICAL - 72% free (5468 MB out of 7628 MB) [15:58:08] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.258 second response time [15:59:02] !log increasing weight on mw17-mw59 by 33.3% [15:59:13] Logged the message, notpeter [16:00:14] PROBLEM - check_swap on indium is CRITICAL: SWAP CRITICAL - 64% free (4854 MB out of 7628 MB) [16:00:28] Ganglia seems upset... bits esams isn't showing at all [16:00:53] one of the eqiad esams boxes is marked as down.. [16:01:06] and the lvs graphs have seemingly just stopped [16:01:32] !log reedy synchronized php-1.21wmf1/cache/l10n/ [16:01:42] Logged the message, Master [16:05:38] PROBLEM - check_swap on indium is CRITICAL: SWAP CRITICAL - 56% free (4254 MB out of 7628 MB) [16:08:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:10:17] PROBLEM - check_swap on indium is CRITICAL: SWAP CRITICAL - 48% free (3646 MB out of 7628 MB) [16:12:35] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:12:35] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:13:03] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:13:03] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:14:02] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 46870 bytes in 0.627 seconds [16:14:02] RECOVERY - Apache HTTP on mw52 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.851 second response time [16:15:11] !log reedy synchronized php-1.21wmf2/cache/l10n/ [16:15:14] PROBLEM - check_swap on indium is CRITICAL: SWAP CRITICAL - 40% free (3047 MB out of 7628 MB) [16:15:21] Logged the message, Master [16:15:59] RECOVERY - Apache HTTP on mw18 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [16:16:09] RECOVERY - Apache HTTP on mw49 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.446 second response time [16:18:42] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:18:59] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:19:17] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:19:17] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:19:26] PROBLEM - Varnish HTCP daemon on cp1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:19:44] PROBLEM - Apache HTTP on mw58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:19:44] PROBLEM - Varnish HTTP mobile-backend on cp1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:19:58] ok, who killed cp1042 [16:20:11] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [16:20:20] PROBLEM - check_swap on indium is CRITICAL: SWAP CRITICAL - 33% free (2444 MB out of 7628 MB) [16:20:20] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.013 second response time [16:20:38] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.688 second response time [16:21:03] !log restarting varnish on cp1042 , gone into swapdeath spiral [16:21:16] Logged the message, Mistress of the network gear. [16:21:59] PROBLEM - Apache HTTP on mw59 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:22:55] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:23:11] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:23:29] RECOVERY - Apache HTTP on mw59 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.418 second response time [16:23:29] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:23:29] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.358 second response time [16:23:38] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:23:38] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:14] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:23]