[01:07:39] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: Last successful Puppet run was Tue May 6 08:27:45 2014 [01:19:00] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [01:55:57] since we have so many anomalies [01:56:05] perhaps we can perform anomaly detection on anomalies [01:56:24] so we can know where there is an anomaly in the number of anomalies [02:01:39] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Fri May 9 23:00:34 2014 [02:12:19] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3790 MB (3% inode=99%): [02:16:41] !log LocalisationUpdate completed (1.24wmf3) at 2014-05-10 02:15:38+00:00 [02:16:51] Logged the message, Master [02:20:19] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3434 MB (3% inode=99%): [02:26:34] !log LocalisationUpdate completed (1.24wmf4) at 2014-05-10 02:25:31+00:00 [02:26:41] Logged the message, Master [02:30:09] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Sat May 10 02:30:03 UTC 2014 [03:00:19] RECOVERY - Disk space on virt0 is OK: DISK OK [03:03:09] PROBLEM - Disk space on analytics1013 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 89363 MB (4% inode=99%): /var/lib/hadoop/data/j 99561 MB (5% inode=99%): /var/lib/hadoop/data/e 117548 MB (6% inode=99%): /var/lib/hadoop/data/f 117361 MB (6% inode=99%): /var/lib/hadoop/data/g 108154 MB (5% inode=99%): /var/lib/hadoop/data/c 119912 MB (6% inode=99%): /var/lib/hadoop/data/k 95868 MB (5% inode=99%): /var/lib/hadoop/dat [03:13:33] !log LocalisationUpdate ResourceLoader cache refresh completed at Sat May 10 03:12:27 UTC 2014 (duration 12m 26s) [03:13:39] Logged the message, Master [03:18:09] PROBLEM - Disk space on analytics1013 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 90053 MB (4% inode=99%): /var/lib/hadoop/data/j 100169 MB (5% inode=99%): /var/lib/hadoop/data/e 118100 MB (6% inode=99%): /var/lib/hadoop/data/f 119050 MB (6% inode=99%): /var/lib/hadoop/data/g 107620 MB (5% inode=99%): /var/lib/hadoop/data/c 120802 MB (6% inode=99%): /var/lib/hadoop/data/k 96654 MB (5% inode=99%): /var/lib/hadoop/da [04:02:59] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds [04:07:09] PROBLEM - Host mw1186 is DOWN: PING CRITICAL - Packet loss = 100% [04:08:39] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: Last successful Puppet run was Tue May 6 08:27:45 2014 [05:01:59] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [05:02:15] er [05:02:30] Never heard of that erorr before [05:02:44] er problem [05:04:00] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [05:11:09] PROBLEM - Disk space on analytics1013 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 88496 MB (4% inode=99%): /var/lib/hadoop/data/j 99935 MB (5% inode=99%): /var/lib/hadoop/data/e 117519 MB (6% inode=99%): /var/lib/hadoop/data/f 117711 MB (6% inode=99%): /var/lib/hadoop/data/g 108482 MB (5% inode=99%): /var/lib/hadoop/data/c 120099 MB (6% inode=99%): /var/lib/hadoop/data/k 96518 MB (5% inode=99%): /var/lib/hadoop/dat [05:13:11] :O [05:21:00] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [06:11:13] (03PS4) 10Giuseppe Lavagetto: Fix the use of $nagios_group. [operations/puppet] - 10https://gerrit.wikimedia.org/r/132187 [06:17:59] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [06:46:11] !log aaron synchronized php-1.24wmf4/img_auth.php 'b08af402ef2de7b2c79f71d848c2b8ae98b47be0' [06:46:18] Logged the message, Master [06:56:16] !log aaron synchronized php-1.24wmf3/img_auth.php '264967c58eccb6dae872ab7345d08f8381ac43a7' [06:56:23] Logged the message, Master [07:01:26] (03PS2) 10BryanDavis: Rearrange handling of the 'vagrant' user for labs vagrant. [operations/puppet] - 10https://gerrit.wikimedia.org/r/132446 (https://bugzilla.wikimedia.org/63793) (owner: 10Andrew Bogott) [07:01:51] (03CR) 10BryanDavis: [C: 031] Rearrange handling of the 'vagrant' user for labs vagrant. [operations/puppet] - 10https://gerrit.wikimedia.org/r/132446 (https://bugzilla.wikimedia.org/63793) (owner: 10Andrew Bogott) [07:09:39] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: Last successful Puppet run was Tue May 6 08:27:45 2014 [07:19:39] * aschulz|laptop hrms at http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Image%20scalers%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1399706330&g=network_report&z=large [07:30:16] (03PS1) 10Giuseppe Lavagetto: Better provisioning, get rid of hard-coded dir. [operations/software] - 10https://gerrit.wikimedia.org/r/132598 [07:30:38] (03CR) 10Giuseppe Lavagetto: [C: 032] Better provisioning, get rid of hard-coded dir. [operations/software] - 10https://gerrit.wikimedia.org/r/132598 (owner: 10Giuseppe Lavagetto) [07:32:14] (03PS1) 10Andrew Bogott: Disable the 'sam reed old' account. [operations/puppet] - 10https://gerrit.wikimedia.org/r/132599 [07:35:59] (03CR) 10Andrew Bogott: [C: 032] Rearrange handling of the 'vagrant' user for labs vagrant. [operations/puppet] - 10https://gerrit.wikimedia.org/r/132446 (https://bugzilla.wikimedia.org/63793) (owner: 10Andrew Bogott) [07:37:24] (03CR) 10Andrew Bogott: [C: 032] Disable the 'sam reed old' account. [operations/puppet] - 10https://gerrit.wikimedia.org/r/132599 (owner: 10Andrew Bogott) [08:07:09] RECOVERY - Disk space on analytics1013 is OK: DISK OK [08:08:29] RECOVERY - Disk space on analytics1019 is OK: DISK OK [08:22:05] (03PS1) 10Hoo man: Remove the wikidata_singlenode module and the labsmediawiki role [operations/puppet] - 10https://gerrit.wikimedia.org/r/132605 [08:25:00] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [08:27:59] PROBLEM - Host mw1209 is DOWN: PING CRITICAL - Packet loss = 100% [08:27:59] PROBLEM - Host mw1201 is DOWN: PING CRITICAL - Packet loss = 100% [08:27:59] PROBLEM - Host mw1210 is DOWN: PING CRITICAL - Packet loss = 100% [08:27:59] PROBLEM - Host mw1208 is DOWN: PING CRITICAL - Packet loss = 100% [08:27:59] PROBLEM - Host mw1202 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:00] PROBLEM - Host mw1203 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:28] o_0 [08:30:08] <_joe_> wtf? [08:30:27] I'm guessing they're all in the same rack [08:30:35] <_joe_> maybe [08:30:59] possibly in 2 groups [08:31:15] try mgmt? [08:31:36] doesn't get past [08:31:36] <_joe_> I am kinda on the phone with my parents atm [08:31:37] 1 ae2-1002.cr2-eqiad.wikimedia.org (208.80.154.131) 0.212 ms 0.220 ms 0.219 ms [08:36:01] looking [08:37:15] <_joe_> tnx paravoid [08:37:30] racktables is wrong btw [08:37:30] _joe_: you're allowed a life? [08:37:43] I didn't actually look in it [08:44:08] (03PS1) 10Tpt: Update to match Wikibase change Ib4014253016db1c3d6b624be9ebbdaf452115145 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132611 [08:53:08] !log rack D5 down, switch unresponsive; minimal impact (mw1201-1203, 1208-1210) [08:53:16] Logged the message, Master [09:14:14] (03PS1) 10Yuvipanda: toollabs: Use full URLs for the web error pages [operations/puppet] - 10https://gerrit.wikimedia.org/r/132619 [09:21:26] (03PS1) 10Faidon Liambotis: Add GeoDNS for secondary/deprecated mobile domains [operations/dns] - 10https://gerrit.wikimedia.org/r/132622 [09:21:28] (03PS1) 10Faidon Liambotis: Add www.m.wikipedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/132623 [09:22:31] (03CR) 10Faidon Liambotis: [C: 032] Add GeoDNS for secondary/deprecated mobile domains [operations/dns] - 10https://gerrit.wikimedia.org/r/132622 (owner: 10Faidon Liambotis) [09:22:41] (03CR) 10Faidon Liambotis: [C: 032] Add www.m.wikipedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/132623 (owner: 10Faidon Liambotis) [09:27:59] PROBLEM - check configured eth on lvs1001 is CRITICAL: eth3 reporting no carrier. [09:28:29] PROBLEM - check configured eth on lvs1002 is CRITICAL: eth3 reporting no carrier. [09:28:49] PROBLEM - check configured eth on lvs1003 is CRITICAL: eth3 reporting no carrier. [09:29:15] that took a long while [09:29:52] wait, that's new [09:33:07] (03CR) 10coren: [C: 032] "Yeay maven? Eeew." [operations/puppet] - 10https://gerrit.wikimedia.org/r/125200 (owner: 10Yuvipanda) [09:33:12] (03PS2) 10coren: toollabs: Add maven to dev_environ [operations/puppet] - 10https://gerrit.wikimedia.org/r/125200 (owner: 10Yuvipanda) [09:34:52] (03CR) 10coren: [C: 032] "Simple package addition" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125200 (owner: 10Yuvipanda) [09:34:55] (03PS2) 10Yuvipanda: toollabs: Use full URLs for the web error pages [operations/puppet] - 10https://gerrit.wikimedia.org/r/132619 [09:38:00] (03CR) 10coren: [C: 032] "LGTM" [operations/puppet] - 10https://gerrit.wikimedia.org/r/132619 (owner: 10Yuvipanda) [10:04:35] (03PS1) 10Yuvipanda: toollabs: Enable gzip compression [operations/puppet] - 10https://gerrit.wikimedia.org/r/132632 [10:04:52] (03PS2) 10Yuvipanda: toollabs: Enable gzip compression [operations/puppet] - 10https://gerrit.wikimedia.org/r/132632 [10:10:39] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: Last successful Puppet run was Tue May 6 08:27:45 2014 [10:25:32] (03PS3) 10Yuvipanda: toollabs: Enable gzip compression [operations/puppet] - 10https://gerrit.wikimedia.org/r/132632 [10:48:51] (03CR) 10Aklapper: "Sorry this takes longer. Need to set up a new Labs instance first, or at least test on my other machine when I'm back home :-/" [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/129671 (owner: 10Odder) [11:50:04] (03CR) 10coren: [C: 032] "Moar bandwidth!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/132632 (owner: 10Yuvipanda) [11:50:13] (03PS4) 10coren: toollabs: Enable gzip compression [operations/puppet] - 10https://gerrit.wikimedia.org/r/132632 (owner: 10Yuvipanda) [11:56:45] (03CR) 10Catrope: [C: 04-1] "I'm working on creating a new Phabricator instance in labs because the existing one is kind of messed up. While doing so I took notes to u" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 (owner: 10Dzahn) [12:03:20] !log hoo synchronized php-1.24wmf3/extensions/Wikidata/ 'Update Wikibase to fix performance issues with dumpJson' [12:03:27] Logged the message, Master [12:05:02] !log hoo synchronized php-1.24wmf3/extensions/Wikidata/ 'Update Wikibase to fix performance issues with dumpJson (2nd run)' [12:05:10] Logged the message, Master [12:07:09] !log hoo synchronized php-1.24wmf4/extensions/Wikidata/ 'Update Wikibase to fix performance issues with dumpJson' [12:07:17] Logged the message, Master [12:09:09] hm, hoo just noticed that some mw* servers are down [12:09:11] icinga confirms [12:09:16] anyone know anything? [12:11:14] ottomata: D5 is down [12:11:16] see the SAL [12:11:28] should just be 6 of them, right? [12:11:31] ja think so [12:11:34] ok danke [12:11:38] just checkin [12:11:41] that someone knwe [12:11:43] knew [12:11:51] 04:53 < paravoid> !log rack D5 down, switch unresponsive; minimal impact (mw1201-1203, 1208-1210) [12:12:34] :) [12:17:09] (03PS1) 10Yuvipanda: dynamicproxy: Enable gzip for most things by default [operations/puppet] - 10https://gerrit.wikimedia.org/r/132662 [12:17:21] Coren: andrewbogott ^ [13:00:39] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Sat May 10 10:00:07 2014 [13:04:09] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 9 below the confidence bounds [13:11:39] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: Last successful Puppet run was Tue May 6 08:27:45 2014 [13:22:18] andrewbogott: https://gerrit.wikimedia.org/r/#/c/132662/ minor patch when you have the time :) [13:23:58] YuviPanda: what does that mean? what gets zipped? [13:24:54] andrewbogott: responses [13:24:56] andrewbogott: most responses [13:25:03] andrewbogott: we merged a similar change for toolllabs a bit ago [13:25:35] (03CR) 10Andrew Bogott: [C: 032] dynamicproxy: Enable gzip for most things by default [operations/puppet] - 10https://gerrit.wikimedia.org/r/132662 (owner: 10Yuvipanda) [13:25:45] andrewbogott: ty! [13:43:20] apergos: ping [13:46:11] !log Reloading Zuul to deploy I403760f1f6dd1bc2 [13:46:18] Logged the message, Master [13:59:23] !log bsitu synchronized php-1.24wmf4/extensions/Flow 'Update Flow' [13:59:31] Logged the message, Master [14:22:16] andrewbogott: 1.7 from the ppa works [14:22:20] http://tools-proxy-test.wmflabs.org/ [14:22:27] until puppet overwrites my changes to the lua files, that is :) [14:24:53] YuviPanda: When you say 'the ppa' you mean there's no debian? [14:24:58] .deb I mean? [14:25:07] andrewbogott: yeah, I downloaded the debs from the PPA [14:25:12] andrewbogott: https://launchpad.net/~nginx/+archive/development/+packages [14:25:14] oh, great. [14:25:19] So, nothing for me to do then :) [14:25:23] andrewbogott: https://gerrit.wikimedia.org/r/132695 [14:25:27] andrewbogott: yeah. shall I upgrade dynamicproxy first? [14:27:00] (03CR) 10coren: [C: 032] "Dead code is dead." [operations/puppet] - 10https://gerrit.wikimedia.org/r/132605 (owner: 10Hoo man) [14:27:32] :) [14:30:39] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Sat May 10 14:30:31 UTC 2014 [14:56:09] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [15:18:03] (03CR) 10Catrope: [C: 032] Enable anonymous editor acquisition experiment across labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132308 (owner: 10Robmoen) [15:20:54] (03Merged) 10jenkins-bot: Enable anonymous editor acquisition experiment across labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132308 (owner: 10Robmoen) [15:39:26] (03PS2) 10Reedy: Memory limit to 256M [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132494 [15:40:25] (03PS3) 10Reedy: Memory limit to 235M [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132494 [15:40:46] (03CR) 10Reedy: [C: 032] Memory limit to 235M [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132494 (owner: 10Reedy) [15:40:54] (03Merged) 10jenkins-bot: Memory limit to 235M [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132494 (owner: 10Reedy) [15:42:42] !log reedy synchronized wmf-config/InitialiseSettings.php 'I415e679197b97e2babe50544cf1e8c26c13a598a' [15:42:48] Logged the message, Master [16:12:39] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: Last successful Puppet run was Tue May 6 08:27:45 2014 [16:13:12] I know! wish I could ack that [16:20:31] <_joe_> ottomata: and we can't? funny [16:22:00] i think I don't have perms on icinga? [16:22:01] not sure [16:22:12] i tried, but it denied me! [16:22:50] <_joe_> OOH bad icinga [16:22:52] <_joe_> lemme try [16:23:02] you should be able to give yourself them in puppet now iirc [16:24:07] <_joe_> done [16:30:55] !log reedy updated /a/common to {{Gerrit|I415e67919}}: Memory limit to 235M [16:31:02] Logged the message, Master [16:31:04] (03PS1) 10Reedy: Enable Flow on en_rtlwiki on beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132709 [16:31:06] (03CR) 10jenkins-bot: [V: 04-1] Enable Flow on en_rtlwiki on beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132709 (owner: 10Reedy) [16:31:19] (03CR) 10Reedy: [C: 032] Enable Flow on en_rtlwiki on beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132709 (owner: 10Reedy) [16:31:30] (03Merged) 10jenkins-bot: Enable Flow on en_rtlwiki on beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132709 (owner: 10Reedy) [17:11:25] !log pushed new uploadwizard qunit job to Jenkins [17:11:32] Logged the message, Master [17:16:17] (03CR) 10Tim Landscheidt: "Krinkle is right about the non-final nature of the tools list; those were the ones set up in January, others were probably added since and" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/108465 (https://bugzilla.wikimedia.org/60238) (owner: 10Tim Landscheidt) [17:31:39] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Sat May 10 14:30:31 2014 [17:35:08] !log hoo synchronized php-1.24wmf3/extensions/Wikidata/ 'Update Wikidata to fix the JSON dump generation' [17:35:16] Logged the message, Master [17:36:19] PROBLEM - Apache HTTP on mw1122 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 50422 bytes in 0.031 second response time [17:36:50] !log hoo synchronized php-1.24wmf4/extensions/Wikidata/ 'Update Wikidata to fix the JSON dump generation' [17:36:58] Logged the message, Master [17:38:27] (03PS2) 10Kevinator: account for kleduc and add to admins::privatedata [operations/puppet] - 10https://gerrit.wikimedia.org/r/131905 (owner: 10Dzahn) [17:39:47] (03CR) 10Ottomata: [C: 032 V: 032] account for kleduc and add to admins::privatedata [operations/puppet] - 10https://gerrit.wikimedia.org/r/131905 (owner: 10Dzahn) [17:40:08] OK so [17:40:14] Gallium is asking for my password for sudo [17:40:26] It never did that before; I suspect it's related to the mholmquist -> marktraceur username change [17:41:39] (03PS1) 10Andrew Bogott: Include the labs_initial_content role in labs_vagrant. [operations/puppet] - 10https://gerrit.wikimedia.org/r/132721 [17:42:28] (03PS2) 10Ottomata: add kleduc to analytics users for hadoop access [operations/puppet] - 10https://gerrit.wikimedia.org/r/131908 (owner: 10Dzahn) [17:43:13] (03CR) 10Ottomata: [C: 032 V: 032] add kleduc to analytics users for hadoop access [operations/puppet] - 10https://gerrit.wikimedia.org/r/131908 (owner: 10Dzahn) [17:52:09] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [17:54:50] OK figured it out. [18:00:19] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Sat May 10 18:00:13 UTC 2014 [18:31:29] !log approved an oauth request by Aaron Halfaker by making myself oauth admin for a moment [18:31:36] Logged the message, Master [18:31:47] hoo :o [19:31:09] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [19:58:17] !log hoo synchronized php-1.24wmf3/extensions/Wikidata/ 'Resyncing Wikidata for mw1122' [19:58:19] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.071 second response time [19:58:24] Logged the message, Master [21:44:29] (03PS2) 10Jforrester: Enable VisualEditor as a Beta Feature on Commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132409 (https://bugzilla.wikimedia.org/65067) [21:54:26] (03PS1) 10Odder: Enable Extension:NewUserMessage on ukwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132747 (https://bugzilla.wikimedia.org/65125) [21:57:30] (03CR) 10John F. Lewis: [C: 031] "Looks good." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130274 (https://bugzilla.wikimedia.org/64255) (owner: 10Gerrit Patch Uploader) [21:59:09] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data exceeded the critical threshold [500.0] [21:59:48] :( [22:22:09] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [23:52:30] (03CR) 10Steinsplitter: [C: 031] Enable VisualEditor as a Beta Feature on Commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132409 (https://bugzilla.wikimedia.org/65067) (owner: 10Jforrester)