[00:49:49] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [01:30:10] PROBLEM - RAID on analytics1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:37:09] RECOVERY - RAID on analytics1004 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [02:05:00] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3873 MB (3% inode=99%): [02:07:21] do we care about virt0? [02:07:26] (tampa) [02:18:48] !log LocalisationUpdate completed (1.24wmf15) at 2014-09-07 02:17:44+00:00 [02:19:02] Logged the message, Master [02:31:19] !log LocalisationUpdate completed (1.24wmf19) at 2014-09-07 02:30:15+00:00 [02:31:25] Logged the message, Master [02:43:16] !log LocalisationUpdate completed (1.24wmf20) at 2014-09-07 02:42:12+00:00 [02:43:21] Logged the message, Master [02:49:29] PROBLEM - very high load average likely xfs on ms-be1005 is CRITICAL: CRITICAL - load average: 210.89, 115.08, 55.84 [02:50:49] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [03:00:59] RECOVERY - Disk space on virt0 is OK: DISK OK [03:12:49] PROBLEM - swift eqiad-prod container availability on tungsten is CRITICAL: CRITICAL: 10.34% of data under the critical threshold [96.0] [03:29:47] i'll assume that last alert is related to ms-be1005? [03:30:57] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Sep 7 03:29:51 UTC 2014 (duration 29m 50s) [03:31:03] Logged the message, Master [03:35:58] jeremyb: heya, did you mean to claim/assign to yourself this task http://fab.wmflabs.org/T644 ? (If you did, awesome! just double checking) [04:06:58] greg-g: yeah [04:07:07] greg-g: i fixed morebots once this week already [04:07:15] jeremyb: sweet, thank you sir. :) [04:07:20] what's a little more tweaking? :) [04:23:32] speaking of, looks like wikitechwiki was still broken today [04:23:42] i wonder if andrewbogott_afk looked at my patch [04:25:02] greg-g: so, should it be !log foo or !log deployment-prep foo ? [04:25:24] or do you want a QA log and people should know to use that instead of deployment-prep? [04:52:30] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [04:59:09] greg-g: reping. btw, did you catch that swift issue above? seems like the kind of thing you like to keep an eye on [05:12:01] jeremyb: re !log, if it can be simply "!log foo" and that go to the right SAL, https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL then that's ideal. [05:12:17] ' [05:12:20] whoops [05:12:24] I didn't see the swift thing, /me looks [05:13:30] swift maybe isn't urgent or maybe is, unsure. but it's something that's happened recently and won't fix itself [05:13:55] (that's why we now have the special case nagios alert for load avg) [05:14:37] greg-g: not sure about that. you're talking about having a second beta cluster (in it's own project) and there's QA stuff unrelated to beta, right? e.g. jenkins, cloudbees, etc. [05:14:48] greg-g: i was thinking maybe it should be all a single unified log [05:41:19] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [05:41:29] PROBLEM - HTTP error ratio anomaly detection on labmon1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [06:00:40] PROBLEM - DPKG on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:00:41] PROBLEM - swift-account-reaper on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:00:49] PROBLEM - puppet last run on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:00:50] PROBLEM - SSH on ms-be1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:00:50] PROBLEM - swift-account-server on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:00:50] PROBLEM - swift-account-auditor on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:00:50] PROBLEM - swift-container-updater on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:00:50] PROBLEM - swift-container-replicator on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:00:59] PROBLEM - swift-container-auditor on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:00:59] PROBLEM - swift-container-server on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:01:09] PROBLEM - swift-object-replicator on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:01:09] PROBLEM - RAID on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:01:09] PROBLEM - swift-object-auditor on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:01:10] PROBLEM - check if dhclient is running on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:01:10] PROBLEM - swift-object-updater on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:01:19] PROBLEM - swift-object-server on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:01:30] PROBLEM - swift-account-replicator on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:05:54] greg-g: this seems to be the same as https://rt.wikimedia.org/Ticket/Display.html?id=8249 [06:06:05] fwiw [06:13:29] PROBLEM - NTP on ms-be1005 is CRITICAL: NTP CRITICAL: No response from NTP server [06:14:04] icinga-wm: thanks! we got the picture [06:14:15] ah indeed, what we have observed is the machine being responsive but it doesn't seem the case here, anyways I'm taking a look on the console [06:15:32] !log powercycle ms-be1005, not even responsive on console [06:15:37] Logged the message, Master [06:15:56] was there a stack at least? [06:16:11] no :( just last line of getty [06:16:27] perhaps it has logged on kern.log [06:16:45] is kern.log xfs? :D [06:17:29] PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [06:17:53] haha fair point, I think icinga wouldn't spam if services would have dependencies fwiw [06:18:19] RECOVERY - swift-account-replicator on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [06:18:29] RECOVERY - very high load average likely xfs on ms-be1005 is OK: OK - load average: 5.51, 1.32, 0.44 [06:18:29] RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [06:18:30] RECOVERY - DPKG on ms-be1005 is OK: All packages OK [06:18:39] RECOVERY - swift-account-reaper on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [06:18:39] RECOVERY - puppet last run on ms-be1005 is OK: OK: Puppet is currently enabled, last run 2430 seconds ago with 0 failures [06:18:40] RECOVERY - SSH on ms-be1005 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [06:18:40] RECOVERY - swift-account-auditor on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [06:18:40] RECOVERY - swift-account-server on ms-be1005 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [06:18:40] RECOVERY - swift-container-replicator on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [06:18:40] RECOVERY - swift-container-updater on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [06:18:49] RECOVERY - swift-container-auditor on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:18:49] RECOVERY - swift-container-server on ms-be1005 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [06:18:50] godog: yeah,.well i was thinking we could have a single unified check for "all expected processes running" [06:18:59] RECOVERY - swift-object-replicator on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [06:18:59] RECOVERY - RAID on ms-be1005 is OK: OK: optimal, 14 logical, 14 physical [06:19:00] RECOVERY - swift-object-auditor on ms-be1005 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [06:19:00] RECOVERY - check if dhclient is running on ms-be1005 is OK: PROCS OK: 0 processes with command name dhclient [06:19:00] RECOVERY - swift-object-updater on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [06:19:09] RECOVERY - swift-object-server on ms-be1005 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [06:19:16] dependencies could help a little. but maybe not right timing [06:20:30] yep worth a try [06:21:49] (03PS1) 10Ori.livneh: update require_package() to latest [puppet] - 10https://gerrit.wikimedia.org/r/158913 [06:27:40] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Epic puppet fail [06:28:09] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:49] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:00] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:30] RECOVERY - Disk space on ms1004 is OK: DISK OK [06:33:30] PROBLEM - Disk space on ms1004 is CRITICAL: DISK CRITICAL - free space: / 2 MB (0% inode=94%): /var/lib/ureadahead/debugfs 2 MB (0% inode=94%): [06:45:49] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:45:59] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:46:09] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:47:40] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:52:27] PROBLEM - puppet last run on ssl3002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:52:49] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [06:56:40] PROBLEM - puppet last run on db60 is CRITICAL: CRITICAL: Puppet has 3 failures [06:59:49] RECOVERY - swift eqiad-prod container availability on tungsten is OK: OK: Less than 1.00% under the threshold [98.0] [07:07:19] bd808|BUFFER: thx [07:10:19] RECOVERY - puppet last run on ssl3002 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [07:14:40] RECOVERY - puppet last run on db60 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [07:18:19] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [07:18:29] RECOVERY - HTTP error ratio anomaly detection on labmon1001 is OK: OK: No anomaly detected [08:53:49] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [10:54:49] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [12:47:49] PROBLEM - puppet last run on amssq48 is CRITICAL: CRITICAL: Epic puppet fail [12:55:49] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [13:06:49] RECOVERY - puppet last run on amssq48 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [14:46:49] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Epic puppet fail [14:56:49] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [14:59:34] (03CR) 10Hoo man: [C: 04-1] "This wont work as is as you use arrays as strings right now (see inline comments)." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/155753 (owner: 1001tonythomas) [15:06:49] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:12:09] !log manually changed /etc/hosts entry on analytics1004 from having "analyticas1004.eqiad.wmnet" to "analytics1004.eqiad.wmnet" [15:12:15] Logged the message, Master [15:12:28] just in case that impacts something weirdly (I doubt it) [15:30:41] (03PS30) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 [16:02:35] (03CR) 10Hoo man: "Looks ok now (untested, at first glance)." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/155753 (owner: 1001tonythomas) [16:05:08] (03PS2) 10Faidon Liambotis: Allocate sandbox vlans for codfw and ulsfo [dns] - 10https://gerrit.wikimedia.org/r/158636 (owner: 10Mark Bergsma) [16:05:10] (03PS1) 10Faidon Liambotis: Allocate IPv4/IPv6 for RIPE Atlas codfw/ulsfo [dns] - 10https://gerrit.wikimedia.org/r/158939 [16:26:30] (03CR) 10Faidon Liambotis: [C: 031] "A couple of small comments inline." (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/158382 (owner: 10BBlack) [16:57:49] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [17:24:51] (03CR) 10Faidon Liambotis: [C: 032] "I added this statement during the gdnsd switchover. The only reason I did was to keep the delta of responses from the old PowerDNS setup a" [puppet] - 10https://gerrit.wikimedia.org/r/158637 (owner: 10BBlack) [17:49:24] what does mw1011 do? is it a jobrunner? [18:09:42] (03CR) 10Ori.livneh: [C: 032] update require_package() to latest [puppet] - 10https://gerrit.wikimedia.org/r/158913 (owner: 10Ori.livneh) [18:10:29] jackmcbarn: yes [18:10:38] ori: and it's zend, right? [18:10:51] i'm pretty sure, but let me confirm [18:11:00] i think _joe._ imaged a couple of additional machines [18:11:12] yes, it's zend [18:11:28] why, is there some indication that it's bisbehaving? [18:11:44] it took 10 seconds to run lua that takes 2 seconds when i purge/nulledit/preview it [18:13:01] (and it didn't get done, so it probably would have taken a lot longer were it allowed to finish) [18:13:07] are you using HHVM? [18:13:27] also, it has high load relative to the web tier apaches [18:14:13] yes, i'm on hhvm. when i switch to zend, then it takes 5-6 seconds [18:14:42] should we maybe give jobrunners a higher timeout than the webservers have, to mitigate the "Script error"s appearing on articles? [18:16:15] i'm more inclined to think that we should increase capacity such that there isn't a gap in performance [18:16:21] how widespread of a problem is it? [18:18:03] this is the first one i've seen since you shut off the hhvm jobrunner [18:18:38] though there's a good chance that since page cache and links tables aren't always in sync, that there's a lot more that i don't know about [18:19:36] jackmcbarn: (slight tangent.. ) to what extent do the resource limits constrain what people do with lua? [18:21:02] i ask because i worry about the following scenario: we roll out HHVM, performance improves sharply for a bit, then gradually declines to the current norm as people use the new platform to do more and more complex things with Lua. [18:21:48] ori: i've never seen an article anywhere near the limit for "real" [18:22:23] ok, that's reassuring [18:22:40] and even really pathological cases (one of which just crashed firefox as i was about to give you the url) only takes 0.5 seconds of lua [18:23:07] on reflection maybe you're right about just increasing the limit as a stopgap for the jobrunners [18:23:22] what would be reasonable? 15s? [18:23:58] i'd say 20 for now, and see if they start to go over that [18:25:20] i'm thinking of adding a feature to scribunto, that if pages use more than (configurable) half of their assigned time, put them in a warning category [18:25:57] that would be extremely useful [18:33:51] (03PS1) 10Ori.livneh: Scribunto: double the Lua CPU limit on the job runners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158948 [18:35:50] (03CR) 10Ori.livneh: Scribunto: double the Lua CPU limit on the job runners (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158948 (owner: 10Ori.livneh) [18:37:37] jackmcbarn: thanks again for your work on lua scripting in general. [18:58:49] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [19:11:16] (03PS1) 10Jackmcbarn: Increase $wgSVGMaxSize to 4096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158951 (https://bugzilla.wikimedia.org/70529) [19:38:14] (03PS31) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 [19:50:11] welcome back(?) paravoid [19:50:48] nope :) [19:50:55] I hopped [20:36:22] !log mw1017: upgraded HHVM from 3.3-dev+20140728+wmf5 to 3.3-dev+20140728+wmf6 [20:36:26] Logged the message, Master [20:59:49] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [21:10:44] (03PS2) 10Ori.livneh: Clean up salt::grain [puppet] - 10https://gerrit.wikimedia.org/r/153783 [21:11:42] (03PS10) 10Ori.livneh: Clean up salt::minion [puppet] - 10https://gerrit.wikimedia.org/r/153727 [21:15:24] (03PS1) 10Krinkle: apache: Remove old comments referencing 'yaseo' [puppet] - 10https://gerrit.wikimedia.org/r/158996 [23:00:49] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [23:11:28] (03PS2) 10Alex Monk: apache: Remove old comments referencing 'yaseo' [puppet] - 10https://gerrit.wikimedia.org/r/158996 (owner: 10Krinkle) [23:25:29] PROBLEM - puppet last run on amssq61 is CRITICAL: CRITICAL: Epic puppet fail [23:35:05] !log upgrading liblua everywhere [23:35:09] Logged the message, Master [23:38:10] PROBLEM - puppet last run on tmh1001 is CRITICAL: CRITICAL: Puppet has 1 failures [23:40:00] PROBLEM - puppet last run on mw1157 is CRITICAL: CRITICAL: Puppet has 1 failures [23:40:19] PROBLEM - puppet last run on mw1137 is CRITICAL: CRITICAL: Puppet has 2 failures [23:41:09] PROBLEM - puppet last run on mw1214 is CRITICAL: CRITICAL: Puppet has 1 failures [23:41:39] PROBLEM - puppet last run on mw1138 is CRITICAL: CRITICAL: Puppet has 2 failures [23:41:39] PROBLEM - puppet last run on mw1182 is CRITICAL: CRITICAL: Puppet has 1 failures [23:42:59] PROBLEM - puppet last run on mw1062 is CRITICAL: CRITICAL: Puppet has 3 failures [23:43:10] PROBLEM - puppet last run on mw1132 is CRITICAL: CRITICAL: Puppet has 1 failures [23:43:39] PROBLEM - puppet last run on mw1038 is CRITICAL: CRITICAL: Puppet has 2 failures [23:44:09] PROBLEM - puppet last run on mw1005 is CRITICAL: CRITICAL: Puppet has 1 failures [23:44:29] RECOVERY - puppet last run on amssq61 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [23:57:10] RECOVERY - puppet last run on tmh1001 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [23:57:19] RECOVERY - puppet last run on mw1137 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [23:58:00] RECOVERY - puppet last run on mw1157 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [23:58:40] RECOVERY - puppet last run on mw1182 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [23:59:09] RECOVERY - puppet last run on mw1214 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures