[00:03:26] (03CR) 10Dzahn: [C: 04-1] subscribe the chained.pem file to the non-chained.pem file [operations/puppet] - 10https://gerrit.wikimedia.org/r/131087 (owner: 10RobH) [00:04:42] (03CR) 10Dzahn: "was this trying to fix a specific bug?" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/111925 (owner: 10BryanDavis) [00:06:27] (03CR) 10Dzahn: [C: 031] contint: switch localvhost to apache::conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/155707 (https://bugzilla.wikimedia.org/68256) (owner: 10Hashar) [00:06:49] (03CR) 10Dzahn: [C: 031] contint: migrate localvhost to apache::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/155708 (owner: 10Hashar) [00:07:11] (03CR) 10Dzahn: [C: 032] "labs only" [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/155680 (owner: 10Dzahn) [00:07:45] (03CR) 10Dzahn: [C: 032] "labs only" [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/155682 (owner: 10Dzahn) [00:08:06] (03CR) 10Dzahn: [C: 032] "labs only, let's one build the package again" [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/155685 (owner: 10Dzahn) [00:18:51] (03CR) 10BryanDavis: "It was a followup to I83cef4b4d1c956ede13b9e124a046015962d7458 and Id2cc29f911fa36805320cdb606a5da1226c9d230 to make handling of http -> h" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/111925 (owner: 10BryanDavis) [00:29:04] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:29:43] !log disabled puppet on osmium again to debug a leak; please don't re-enable [00:29:49] Logged the message, Master [00:29:54] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:31:51] ori: If you do `puppet --disable "disabled to debug a leak; please don't re-enable"` that message will show when someone tries to run puppet [00:32:07] bd808: oh, neat. i didn't know that. thanks! [00:32:25] It's apparently a little known puppet trick :) [00:32:59] I knew about it but I saw someone write a TIL about it here a couple of days ago [00:33:04] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:34:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:35:34] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Last successful Puppet run was Fri 22 Aug 2014 20:33:50 UTC [00:37:04] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:39:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:44:37] (03PS1) 10Ori.livneh: jobrunner: use trebuchet package provider [operations/puppet] - 10https://gerrit.wikimedia.org/r/155859 [00:59:04] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:00:55] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [01:05:53] (03Abandoned) 10Jeremyb: account creation limit for CIS (tewiki) event [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153383 (https://bugzilla.wikimedia.org/69385) (owner: 10Jeremyb) [01:29:04] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:31:56] !log springle Synchronized wmf-config/db-eqiad.php: repool db1056 (duration: 00m 06s) [01:32:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [01:32:05] Logged the message, Master [01:33:38] (03CR) 10Hoo man: "Any chance we can move forward here?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/152724 (owner: 10Hoo man) [01:54:20] (03CR) 10Hoo man: [C: 031] "Would be nice to get this merged so that I can rebase the other change into a mergeable state." [operations/puppet] - 10https://gerrit.wikimedia.org/r/153034 (owner: 10Hoo man) [02:09:05] <^d> This is getting rather old. [02:09:19] orly? [02:09:46] <^d> ya rly [02:10:17] What's getting old? :P [02:10:46] <^d> elasticsearch being a disk hog [02:10:49] <^d> http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=disk_free&s=by+name&c=Elasticsearch+cluster+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [02:11:05] i thought it was a RAM hog? [02:11:42] oO [02:11:46] <^d> It uses all the ram you can give it, but we're pretty stable with the heap / disk cache we've got now. [02:12:10] <^d> 1.3.x elasticsearch introduced a regression. [02:12:12] <^d> https://github.com/elasticsearch/elasticsearch/issues/7386 [02:12:24] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [02:13:24] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [02:16:31] <^d> Hopefully the hack Nik put together with upstream will work tomorrow. [02:16:40] <^d> Otherwise we have to keep babysitting this until 1.3.3 comes out. [02:18:32] !log LocalisationUpdate completed (1.24wmf17) at 2014-08-23 02:17:28+00:00 [02:18:42] Logged the message, Master [02:20:24] PROBLEM - LighttpdHTTP on dataset1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:15] RECOVERY - LighttpdHTTP on dataset1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5122 bytes in 6.110 second response time [02:23:03] !log LocalisationUpdate completed (1.24wmf18) at 2014-08-23 02:21:59+00:00 [02:23:09] Logged the message, Master [02:36:34] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Last successful Puppet run was Fri 22 Aug 2014 20:33:50 UTC [03:07:19] !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Aug 23 03:06:13 UTC 2014 (duration 6m 12s) [03:07:26] Logged the message, Master [03:29:14] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:30:14] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:34:14] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:35:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:42:14] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:43:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:46:14] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:49:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [04:22:54] PROBLEM - puppet last run on amssq58 is CRITICAL: CRITICAL: Puppet has 1 failures [04:25:34] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Epic puppet fail [04:27:14] PROBLEM - puppet last run on amssq61 is CRITICAL: CRITICAL: Puppet has 1 failures [04:28:34] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [04:28:55] PROBLEM - HTTP error ratio anomaly detection on labmon1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [04:31:34] PROBLEM - puppet last run on amssq48 is CRITICAL: CRITICAL: Puppet has 1 failures [04:33:14] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:34:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [04:37:34] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Last successful Puppet run was Fri 22 Aug 2014 20:33:50 UTC [04:40:54] RECOVERY - puppet last run on amssq58 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [04:44:04] RECOVERY - puppet last run on amssq61 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [04:45:34] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [04:47:34] RECOVERY - puppet last run on amssq48 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [04:53:11] (03CR) 10Hoo man: [C: 04-1] "I guess this can't go out before the extension is live on all wikis also, right?" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/155753 (owner: 1001tonythomas) [04:58:16] (03CR) 10Hoo man: "Ok, this should only be changed for beta right now as (according to what 01tonythomas told me) this template is also being used for produc" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155753 (owner: 1001tonythomas) [05:03:44] (03CR) 10Hoo man: Added the bouncehandler router to catch in all bounce emails (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/155753 (owner: 1001tonythomas) [05:11:53] (03PS3) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [operations/puppet] - 10https://gerrit.wikimedia.org/r/155753 [05:13:39] (03CR) 1001tonythomas: "I changed the API receiver to 'Host: http://d" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155753 (owner: 1001tonythomas) [05:18:16] (03CR) 10BryanDavis: "A few notes inline. We will need to get the additional extensions into the wmf release branches as well but that's sort of a separate prob" (0311 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [05:19:36] (03CR) 10BryanDavis: "Commit message should link to bug 68751 and bug 62496" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [05:20:19] (03CR) 10Legoktm: Random stab at getting wikitech config in here. (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [05:25:14] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:26:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [05:52:34] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [05:52:55] RECOVERY - HTTP error ratio anomaly detection on labmon1001 is OK: OK: No anomaly detected [06:05:34] PROBLEM - puppet last run on amssq53 is CRITICAL: CRITICAL: Epic puppet fail [06:05:44] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [06:05:54] PROBLEM - HTTP 5xx req/min on labmon1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [06:06:14] PROBLEM - puppet last run on amssq61 is CRITICAL: CRITICAL: Epic puppet fail [06:06:35] PROBLEM - puppet last run on amssq49 is CRITICAL: CRITICAL: Puppet has 1 failures [06:07:04] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Puppet has 1 failures [06:18:45] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [06:18:55] RECOVERY - HTTP 5xx req/min on labmon1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:23:04] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:23:34] RECOVERY - puppet last run on amssq49 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:24:14] RECOVERY - puppet last run on amssq61 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:24:35] RECOVERY - puppet last run on amssq53 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:25:25] PROBLEM - Disk space on elastic1015 is CRITICAL: DISK CRITICAL - free space: / 748 MB (2% inode=96%): [06:27:25] PROBLEM - Disk space on elastic1013 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): [06:28:55] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:14] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:15] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:32:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [06:38:34] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Last successful Puppet run was Fri 22 Aug 2014 20:33:50 UTC [06:46:04] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:46:14] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:50:15] RECOVERY - Disk space on ms1004 is OK: DISK OK [06:53:14] PROBLEM - Disk space on ms1004 is CRITICAL: DISK CRITICAL - free space: / 588 MB (3% inode=94%): /var/lib/ureadahead/debugfs 588 MB (3% inode=94%): [06:53:34] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: Epic puppet fail [07:13:35] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [07:26:15] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:14] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [07:33:15] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:34:14] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [07:42:15] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:43:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [08:09:34] PROBLEM - Puppet freshness on elastic1015 is CRITICAL: Last successful Puppet run was Sat 23 Aug 2014 06:08:52 UTC [08:17:34] PROBLEM - Puppet freshness on elastic1013 is CRITICAL: Last successful Puppet run was Sat 23 Aug 2014 06:16:33 UTC [08:21:15] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:22:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [08:36:44] PROBLEM - puppet last run on mw1152 is CRITICAL: CRITICAL: Puppet has 1 failures [08:38:03] ... how does one reset a lost 2fa token on labs/wikitech? [08:38:51] ChrisJ: ask andrewbogott_afk or Coren :P [08:39:31] thanks legoktm... i managed to get logged in with 2FA emergency tokens, but they don't seem to work for "disable" or "reset". [08:39:34] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Last successful Puppet run was Fri 22 Aug 2014 20:33:50 UTC [08:44:30] looking at ways i could potentially get more involved as a volunteer and figured having my wikitech account functional might be helpful... :) [08:53:44] RECOVERY - puppet last run on mw1152 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [09:12:54] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/qrunner [09:13:25] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [09:15:24] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [09:15:54] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [09:30:54] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/qrunner [09:31:24] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [09:31:54] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [09:32:24] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [09:33:15] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:37:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [10:09:14] PROBLEM - puppet last run on elastic1015 is CRITICAL: CRITICAL: Puppet last ran 14404 seconds ago, expected 14400 [10:10:34] PROBLEM - Puppet freshness on elastic1015 is CRITICAL: Last successful Puppet run was Sat 23 Aug 2014 06:08:52 UTC [10:17:14] PROBLEM - puppet last run on elastic1013 is CRITICAL: CRITICAL: Puppet last ran 14424 seconds ago, expected 14400 [10:18:34] PROBLEM - Puppet freshness on elastic1013 is CRITICAL: Last successful Puppet run was Sat 23 Aug 2014 06:16:33 UTC [10:35:31] (03PS1) 10Springle: Depool db1004 for maintenance. Pool db1053 in its place. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155885 [10:36:06] (03CR) 10Springle: [C: 032] Depool db1004 for maintenance. Pool db1053 in its place. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155885 (owner: 10Springle) [10:36:10] (03Merged) 10jenkins-bot: Depool db1004 for maintenance. Pool db1053 in its place. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155885 (owner: 10Springle) [10:36:55] !log springle Synchronized wmf-config/db-eqiad.php: depool db1004. pool db1053. (duration: 00m 07s) [10:37:00] Logged the message, Master [10:37:14] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:40:34] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Last successful Puppet run was Fri 22 Aug 2014 20:33:50 UTC [10:42:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [11:21:45] !log Manually removed IPv6 address from mchenry [11:21:51] Logged the message, Master [11:23:36] !log Deactivated IPv6 router-advertisement on cr2-pmtpa [11:23:42] Logged the message, Master [11:27:47] (03PS2) 10Mark Bergsma: Remove IPv6 address from fenari [operations/puppet] - 10https://gerrit.wikimedia.org/r/155758 [11:28:07] (03CR) 10Mark Bergsma: [C: 032] Remove IPv6 address from fenari [operations/puppet] - 10https://gerrit.wikimedia.org/r/155758 (owner: 10Mark Bergsma) [11:33:36] !log Manually removed IPv6 addresses from fenari [11:33:42] Logged the message, Master [11:34:15] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:35:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [11:39:45] PROBLEM - Host ps1-d1-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [11:39:45] PROBLEM - Host ps1-d2-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [11:39:45] PROBLEM - Host ps1-c1-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [11:39:45] PROBLEM - Host mchenry is DOWN: PING CRITICAL - Packet loss = 100% [11:40:24] PROBLEM - Host linne is DOWN: CRITICAL - Time to live exceeded (208.80.152.167) [11:40:24] PROBLEM - Host sanger is DOWN: CRITICAL - Time to live exceeded (208.80.152.187) [11:40:24] PROBLEM - Host fenari is DOWN: CRITICAL - Time to live exceeded (208.80.152.165) [11:40:34] RECOVERY - Host mchenry is UP: PING OK - Packet loss = 0%, RTA = 33.95 ms [11:40:39] odd [11:40:44] RECOVERY - Host linne is UP: PING OK - Packet loss = 0%, RTA = 31.03 ms [11:40:44] RECOVERY - Host fenari is UP: PING OK - Packet loss = 0%, RTA = 31.02 ms [11:40:44] RECOVERY - Host sanger is UP: PING OK - Packet loss = 0%, RTA = 31.50 ms [11:40:44] RECOVERY - Host ps1-d1-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 33.34 ms [11:40:44] RECOVERY - Host ps1-d2-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 33.39 ms [11:40:44] RECOVERY - Host ps1-c1-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 37.32 ms [11:57:34] PROBLEM - puppet last run on es7 is CRITICAL: CRITICAL: Epic puppet fail [12:00:35] RECOVERY - puppet last run on es7 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [12:06:55] (03PS4) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [operations/puppet] - 10https://gerrit.wikimedia.org/r/155753 [12:11:34] PROBLEM - Puppet freshness on elastic1015 is CRITICAL: Last successful Puppet run was Sat 23 Aug 2014 06:08:52 UTC [12:19:04] (03PS5) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [operations/puppet] - 10https://gerrit.wikimedia.org/r/155753 [12:19:34] PROBLEM - Puppet freshness on elastic1013 is CRITICAL: Last successful Puppet run was Sat 23 Aug 2014 06:16:33 UTC [12:30:15] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:30:21] (03PS6) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [operations/puppet] - 10https://gerrit.wikimedia.org/r/155753 [12:33:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [12:41:34] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Last successful Puppet run was Fri 22 Aug 2014 20:33:50 UTC [12:42:15] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:44:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [12:51:15] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:52:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [13:54:17] (03PS1) 10Danny B.: cswikinews: Remove unused custom namespace [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155893 [13:55:25] (03CR) 10Danny B.: "This is forgotten old community decision: https://cs.wikinews.org/wiki/Wikizpr%C3%A1vy:V_redakci/07#Tematick.C3.A1_agregace_obsahu" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155893 (owner: 10Danny B.) [14:12:34] PROBLEM - Puppet freshness on elastic1015 is CRITICAL: Last successful Puppet run was Sat 23 Aug 2014 06:08:52 UTC [14:20:34] PROBLEM - Puppet freshness on elastic1013 is CRITICAL: Last successful Puppet run was Sat 23 Aug 2014 06:16:33 UTC [14:29:15] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:31:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [14:39:15] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:41:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [14:42:34] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Last successful Puppet run was Fri 22 Aug 2014 20:33:50 UTC [14:49:15] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:50:14] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:27:15] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:32:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:36:15] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:39:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:53:15] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: Epic puppet fail [16:03:49] (03CR) 10Deskana: [C: 031] "Since this is only the user-specific JS/CSS that's being enabled, there are no product issues here." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/154432 (https://bugzilla.wikimedia.org/57891) (owner: 10Legoktm) [16:12:15] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [16:13:34] PROBLEM - Puppet freshness on elastic1015 is CRITICAL: Last successful Puppet run was Sat 23 Aug 2014 06:08:52 UTC [16:21:34] PROBLEM - Puppet freshness on elastic1013 is CRITICAL: Last successful Puppet run was Sat 23 Aug 2014 06:16:33 UTC [16:43:34] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Last successful Puppet run was Fri 22 Aug 2014 20:33:50 UTC [17:39:15] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:44:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [17:48:15] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:49:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [17:55:59] (03Abandoned) 10Matanya: ci firewall: move ferm rules to role level and firewall to node level [operations/puppet] - 10https://gerrit.wikimedia.org/r/144503 (owner: 10Matanya) [17:59:27] (03CR) 10Matanya: "I agree here, but since there is no role yet, this is the first step in simplifying the structure." [operations/puppet] - 10https://gerrit.wikimedia.org/r/117698 (owner: 10Matanya) [18:10:08] (03PS3) 10MZMcBride: Enable GlobalCssJs on all CentralAuth wikis minus loginwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/154432 (https://bugzilla.wikimedia.org/13953) (owner: 10Legoktm) [18:14:34] PROBLEM - Puppet freshness on elastic1015 is CRITICAL: Last successful Puppet run was Sat 23 Aug 2014 06:08:52 UTC [18:17:47] (03PS1) 10Ori.livneh: Lint Trebuchet provider [operations/puppet] - 10https://gerrit.wikimedia.org/r/155909 [18:22:34] PROBLEM - Puppet freshness on elastic1013 is CRITICAL: Last successful Puppet run was Sat 23 Aug 2014 06:16:33 UTC [18:42:15] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:43:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [18:44:34] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Last successful Puppet run was Fri 22 Aug 2014 20:33:50 UTC [19:24:51] unable to log in: [3a6d1d09] 2014-08-23 19:23:51: Fatal exception of type MWException [19:24:58] (03PS1) 10Umherirrender: Add new user rights 'editsitejs' and 'editsitecss' to user groups [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155913 [19:32:45] (03CR) 10John F. Lewis: [C: 04-1] "Duplicated definitions in quite a few places" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155913 (owner: 10Umherirrender) [19:34:15] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:35:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:49:23] (03CR) 10John F. Lewis: Add new user rights 'editsitejs' and 'editsitecss' to user groups [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155913 (owner: 10Umherirrender) [19:51:13] (03CR) 10Calak: [C: 031] Add new user rights 'editsitejs' and 'editsitecss' to user groups [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155913 (owner: 10Umherirrender) [20:01:46] (03CR) 10Steinsplitter: [C: 031] "+1 Looks good to me." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155913 (owner: 10Umherirrender) [20:12:35] PROBLEM - puppet last run on ssl3002 is CRITICAL: CRITICAL: Puppet has 1 failures [20:12:44] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [20:13:04] PROBLEM - HTTP 5xx req/min on labmon1001 is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [20:13:54] PROBLEM - puppet last run on amssq62 is CRITICAL: CRITICAL: Puppet has 1 failures [20:15:24] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: Puppet has 2 failures [20:15:34] PROBLEM - Puppet freshness on elastic1015 is CRITICAL: Last successful Puppet run was Sat 23 Aug 2014 06:08:52 UTC [20:16:04] PROBLEM - puppet last run on cp3018 is CRITICAL: CRITICAL: Epic puppet fail [20:16:14] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: Puppet has 4 failures [20:16:44] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: Puppet has 1 failures [20:17:35] PROBLEM - puppet last run on amssq31 is CRITICAL: CRITICAL: Puppet has 2 failures [20:18:35] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Puppet has 1 failures [20:18:46] PROBLEM - puppet last run on amssq39 is CRITICAL: CRITICAL: Puppet has 1 failures [20:19:35] PROBLEM - puppet last run on cp3011 is CRITICAL: CRITICAL: Puppet has 1 failures [20:20:14] PROBLEM - puppet last run on ssl3001 is CRITICAL: CRITICAL: Puppet has 2 failures [20:20:35] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 1 failures [20:20:35] PROBLEM - puppet last run on amslvs4 is CRITICAL: CRITICAL: Epic puppet fail [20:22:40] (03PS1) 10coren: Labs: point /public/dumps to the new server [operations/puppet] - 10https://gerrit.wikimedia.org/r/156002 [20:23:04] PROBLEM - HTTP error ratio anomaly detection on labmon1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [20:23:14] PROBLEM - Packetloss_Average on erbium is CRITICAL: packet_loss_average CRITICAL: 8.11702505882 [20:23:34] PROBLEM - Puppet freshness on elastic1013 is CRITICAL: Last successful Puppet run was Sat 23 Aug 2014 06:16:33 UTC [20:23:35] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [20:25:35] PROBLEM - puppet last run on amssq49 is CRITICAL: CRITICAL: Puppet has 1 failures [20:26:15] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:27:14] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [20:28:44] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [20:29:04] PROBLEM - HTTP 5xx req/min on labmon1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [20:30:35] RECOVERY - puppet last run on ssl3002 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [20:30:52] Coren: are you around ? [20:31:35] PROBLEM - puppet last run on ms-be3001 is CRITICAL: CRITICAL: Puppet has 1 failures [20:31:44] PROBLEM - puppet last run on amssq56 is CRITICAL: CRITICAL: Puppet has 1 failures [20:32:14] PROBLEM - puppet last run on amssq40 is CRITICAL: CRITICAL: Puppet has 1 failures [20:33:14] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [20:33:24] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [20:33:35] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [20:33:35] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [20:33:54] RECOVERY - puppet last run on amssq62 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [20:34:35] RECOVERY - puppet last run on amssq31 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [20:35:14] RECOVERY - puppet last run on ssl3001 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [20:35:44] RECOVERY - puppet last run on amssq39 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [20:36:04] RECOVERY - puppet last run on cp3018 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [20:36:35] RECOVERY - puppet last run on cp3011 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [20:37:35] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [20:37:54] PROBLEM - puppet last run on amssq41 is CRITICAL: CRITICAL: Epic puppet fail [20:37:54] PROBLEM - puppet last run on amssq38 is CRITICAL: CRITICAL: Puppet has 1 failures [20:39:36] RECOVERY - puppet last run on amslvs4 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [20:41:14] RECOVERY - Packetloss_Average on erbium is OK: packet_loss_average OKAY: 3.08598107143 [20:42:35] RECOVERY - puppet last run on amssq49 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [20:45:34] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Last successful Puppet run was Fri 22 Aug 2014 20:33:50 UTC [20:48:35] RECOVERY - puppet last run on amssq56 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [20:48:45] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [20:49:04] RECOVERY - HTTP 5xx req/min on labmon1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:49:14] RECOVERY - puppet last run on amssq40 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [20:49:35] RECOVERY - puppet last run on ms-be3001 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [20:49:54] RECOVERY - puppet last run on amssq41 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [20:52:55] RECOVERY - puppet last run on amssq38 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [21:02:45] (03PS2) 10Ori.livneh: Lint Trebuchet provider [operations/puppet] - 10https://gerrit.wikimedia.org/r/155909 [21:03:05] (03CR) 10Ori.livneh: [C: 032 V: 032] "trivial and tested" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155909 (owner: 10Ori.livneh) [21:05:04] PROBLEM - HTTP 5xx req/min on labmon1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [21:05:44] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [21:22:04] RECOVERY - HTTP 5xx req/min on labmon1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:22:44] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [21:26:15] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:28:14] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [21:38:15] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:45:04] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [21:57:04] RECOVERY - HTTP error ratio anomaly detection on labmon1001 is OK: OK: No anomaly detected [21:57:35] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [22:16:34] PROBLEM - Puppet freshness on elastic1015 is CRITICAL: Last successful Puppet run was Sat 23 Aug 2014 06:08:52 UTC [22:24:34] PROBLEM - Puppet freshness on elastic1013 is CRITICAL: Last successful Puppet run was Sat 23 Aug 2014 06:16:33 UTC [22:46:34] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Last successful Puppet run was Fri 22 Aug 2014 20:33:50 UTC [23:19:12] (03CR) 10coren: [C: 032] "Better partial than broken." [operations/puppet] - 10https://gerrit.wikimedia.org/r/156002 (owner: 10coren)