[00:00:45] (03PS1) 10Ori.livneh: Revert "Force 'Transfer-Encoding: Chunked' header on 404 responses" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222726 [00:00:58] (03CR) 10Ori.livneh: [C: 032] Revert "Force 'Transfer-Encoding: Chunked' header on 404 responses" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222726 (owner: 10Ori.livneh) [00:01:02] (03Merged) 10jenkins-bot: Revert "Force 'Transfer-Encoding: Chunked' header on 404 responses" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222726 (owner: 10Ori.livneh) [00:15:51] 7Blocked-on-Operations, 6operations, 6Commons, 6Multimedia, and 4 others: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1426054 (10ori) For the past several weeks, this task was blocked on the fact that the HHVM renderer (mw1152) would cause 5xx spikes whenever it was pool... [00:16:06] (03PS21) 10Paladox: Rename all main WikimediaIncubator settings to have a wg prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207909 [00:25:44] 21 [00:28:09] I told him we're not doing that :< [00:29:58] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [00:30:11] ori, the codfw imagescalers are all hhvm+trusty already, right? [00:30:49] Krenair: I have no idea. Let's see. [00:30:59] it's just mw1153-1160 left? [00:31:28] yes, codfw are all hhvm+trusty [00:31:34] there's tin and terbium, too [00:31:37] right [00:31:40] they're another task [00:31:48] 7Blocked-on-Operations, 6operations, 6Commons, 6Multimedia, and 4 others: Convert eqiad imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1426057 (10Krenair) [00:32:30] 7Blocked-on-Operations, 6operations, 6Commons, 6Multimedia, and 4 others: Convert eqiad imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1095183 (10Krenair) (codfw imagescalers are all HHVM+Trusty already, so we're just left with mw1153-1160) [00:33:12] it should be pretty straightforward. IIRC the issue was/is that HHVM was slow at checking files for syntax errors, which is one of the things scap needs to do. But we can just continue using php5 for that. [00:34:18] we also need silver, and the snapshot hosts [00:34:34] * ori groans. [00:35:15] isn't there some command you can run that checks what php is installed on all boxes? :p [00:35:21] yeah [00:35:34] dsh on mediawiki-installation [00:35:36] or some salt magic [00:37:24] Ah [00:37:35] tmh [00:38:01] seems to be precise+php5 [00:38:11] sounds very likely [00:39:08] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [00:41:16] 6operations, 10Datasets-General-or-Unknown, 7HHVM: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#1426063 (10Krenair) [00:41:35] * Krenair will file a ticket for that too [00:41:42] {'terbium.eqiad.wmnet': '/usr/bin/php5'} [00:41:42] {'silver.wikimedia.org': '/usr/bin/php5'} [00:41:42] {'tin.eqiad.wmnet': '/usr/bin/php5'} [00:41:44] {'mw1159.eqiad.wmnet': '/usr/bin/php5'} [00:41:46] {'mw1158.eqiad.wmnet': '/usr/bin/php5'} [00:41:48] {'mw1157.eqiad.wmnet': '/usr/bin/php5'} [00:42:43] snapshot* hosts are not in mediawiki-installation, but they are indeed using php5 [00:42:57] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 7HHVM: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1426066 (10Krenair) 3NEW [00:43:19] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 7HHVM: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1426075 (10Krenair) [00:43:22] Krenair: thanks, that's very useful [00:43:22] 6operations, 7Tracking: Upgrade Wikimedia servers to Ubuntu Trusty (14.04) (tracking) - https://phabricator.wikimedia.org/T65899#1426074 (10Krenair) [00:43:40] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 7HHVM: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1426066 (10Krenair) [00:43:43] 6operations, 7HHVM, 7Tracking: Complete the use of HHVM over Zend PHP on the Wikimedia cluster (tracking) - https://phabricator.wikimedia.org/T86081#1426077 (10Krenair) [00:44:07] 6operations, 6Labs, 10wikitech.wikimedia.org, 7HHVM: Move wikitech (silver) to HHVM - https://phabricator.wikimedia.org/T98813#1426080 (10Krenair) [00:45:17] wasn't there already one for the tmh boxes [00:45:17] silver is such a mess [00:45:22] Or was it just mentioned on other tickets? [00:46:06] Reedy, I couldn't find one [00:46:26] Krenair, thanks again. That was some badly-needed and long-overdue bug triage. [00:47:22] So either I'm going blind or Phabricator's search is being unhelpful [00:47:27] Once this is all done, are we going to bump the minimum requirement to 5.4, or will we leapfrog 5.4 and go straight to 5.5? [00:48:15] I guess we need to survey most of the common distros [00:48:20] PHP are supporting 5.4 still at least [00:48:54] silver is running PHP 5.5.9 [00:48:58] So not technically a blocker for this [00:49:21] It's a blocker for any hhvm specific syntax in mw-config [00:49:28] But not for keeping 5.3 crap around in mw core [00:49:39] yeah [00:50:04] Krenair: phab search is useless. Try Google site:phabricator.wikimedia.org [00:50:31] needs a greasemonkey script [00:51:29] heh [00:52:33] Reedy, btw, you mentioned dsh and salt earlier: salt is an ops thing, right? [00:52:57] ops-only* [00:53:04] root-only, yes. [00:53:14] and dsh is... not? [00:53:18] nope [00:53:26] if you've got login access to the box... [00:53:31] dsh is just a wrapper around ssh [00:53:33] and can run your target command [00:53:43] I see [00:57:59] it's not as easy as it once was, Reedy [00:58:05] because agent-forwarding is disabled [00:58:12] heh, true [00:58:19] so ability to ssh to mw1041 + ability to ssh to tin != ability to ssh to mw1041 from tin [00:58:36] so you have to run it locally and it's slow? [00:58:38] but you can ssh as mwdeploy using the keyholder ssh agent socket, which is what scap does behind the scenes [00:58:58] i just hacked out the invocation, it's not exactly intuitive, so write it down :P [00:59:07] lol [00:59:12] SSH_AUTH_SOCK=/run/keyholder/proxy.sock dsh -F 20 -g mediawiki-installation -r ssh -o -oUser=mwdeploy -- [command] [00:59:58] RECOVERY - haproxy failover on dbproxy1004 is OK check_failover servers up 2 down 0 [01:00:22] https://wikitech.wikimedia.org/wiki/Dsh is super out of date :( [01:00:26] !log reload haproxy dbproxy1004 [01:00:30] Logged the message, Master [01:00:42] zwinger, lol [01:01:02] wikitech is unusably slow and clunky [01:01:04] ohai springle [01:01:19] that does not bode well for keeping docs up-to-date [01:02:02] seems pretty quick to me? [01:02:55] Reedy: you a pilot yet? [01:03:14] springle: I've got a FAA PPL now.. So I guess, technically, yes? ;D [01:03:35] zwinger first appears in SAL archive 1, i.e. 11 years ago [01:04:18] http://i.imgur.com/pvzFgAk.png -- firstPaint at 3.7s [01:05:29] enwiki's main page, which is substantially bigger, is 2.1s [01:05:50] lol [01:06:10] it's not fast, but not that slow for me... on crappy hotel wifi [01:06:57] then there's the fact that the openstack integration is so atrocious and busted that the mere presence of these features in the sidebar in my visual field makes me want to close the tab as fast as possible [01:07:21] merge with mediawiki.org? ;) [01:07:32] * Krenair kicks legoktm [01:07:47] Krenair: sorry, I meant wikimediafoundation.org [01:07:58] no, just split up the openstack and documentation stuff [01:08:10] it's getting split up eventually anyway since ops are going for openstack horizon afaik [01:08:15] legoktm, haha [01:08:22] yep [01:08:39] also it should have a markdown contenthandler [01:09:02] I love VisualEditor (really) but it's not optimal for documenting shell scripts and configuration files [01:09:38] and (most controversially) I think pages that have not been edited in N months should be auto-deleted [01:10:05] outdated documentation is worse than no documentation at all, because it is actively misleading [01:10:21] and there is just so much of it around [01:10:38] anything that is still relevant gets edited at least once a year [01:10:45] i challenge you to find something useful that hasn't been [01:11:47] here's one to amuse springle: https://wikitech.wikimedia.org/wiki/Adding_a_file_in_innodb [01:12:11] heh [01:13:32] ori: my favorite page on wikitech is https://wikitech.wikimedia.org/wiki/Hurricanes [01:16:08] and we'd lose all these awesome pages if it was up to ori [01:16:39] ori, are dumps relevant? [01:17:00] I donno, I think ori may be right [01:17:15] I went digging for an example of no edits in a year, but still useful unique info [01:17:15] Krenair: to hhvm and such? [01:17:23] in general [01:17:33] I don't follow [01:17:38] anything that is still relevant gets edited at least once a year [01:17:38] i challenge you to find something useful that hasn't been [01:17:39] all the things I could find that hadn't been updated in a year was pretty bad, and actively misleading heh [01:18:11] we could just replace 3/4 of wikitech with "read the source, here's the links to our git repos" :) [01:18:15] well, deleting is never useful. automatically moving things to a Archive: namespace and excluding that from default search would be reasonable [01:18:23] except for hardware dc ops stuff, network stuff [01:18:48] legoktm: that's the wikipedian in you talking, and you should tell him to go back to enwiki or meta and leave your developer self alone :P [01:18:53] the BGP link on the front page ends up here: https://wikitech.wikimedia.org/w/index.php?title=BGP/old_setup&redirect=no [01:19:14] which talks about tampa, and us not having our own ASN and using a private one, etc [01:19:26] it's completely horribly the opposite of what it should say today heh [01:19:54] well, deleting is never useful. <-- i disagree! [01:20:38] https://wikitech.wikimedia.org/wiki/Proxy_access_to_cluster [01:20:42] ^ says to use fenari :) [01:21:12] and mentions labsconsole [01:21:30] at least you guys have been around long enough to be able to tell at a glance that this content is obsolete [01:21:39] the SSH config page has some issues with assuming everyone has root as well [01:21:53] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Reedy via Special:Random [01:22:00] imagine how disorienting it is to be new and have to navigate content that is so frequently totally incorrect [01:22:06] yeah [01:22:19] who has time to write docs, though :/ [01:22:34] I know, we should do it as a part of everything. it's sort of like writing tests. [01:22:54] what's the thing about lily.esams? [01:23:25] isn't the esams bastion hooft? [01:23:36] yes, it is [01:23:57] I don't know what lily is/was. fenari at least existed when I first arrived. I don't think I ever saw lily [01:24:17] apparently lily has previously hosted mailing lists [01:24:19] a markdown contenthandler should be pretty easy to do, now that pulling in external libraries to use in core is easy [01:24:26] (pre-jan12) [01:25:17] 6operations: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#1426112 (10Springle) Doing this to most production DBs seems straight forward. Pain points, due just to complexity, will be on M[1-4]. My expectation is that iptables won't be a performance bottleneck, but +1 to... [01:25:25] https://wikitech.wikimedia.org/wiki/Mail [01:25:35] why markdown? [01:26:08] because gerrit, phabricator and github [01:26:22] and mediawiki, as soon as we add the markdown content handler :P [01:26:29] recursive justification ftw [01:26:46] mchenry... sanger... lily [01:26:55] ori: anyways, https://wikitech.wikimedia.org/wiki/User:Legoktm/pywikibot_on_tools_lab hasn't been touched in over a year and is still accurate :) [01:27:19] ask Katie for a null-edit [01:27:42] gotta run, bye! [01:28:07] why do we have random pages like https://wikitech.wikimedia.org/wiki/Nova_Resource:I-00000698.eqiad.wmflabs? [01:28:41] Did esams used to be separate from the 'main cluster'? [01:29:20] depends what you mean by seperate [01:30:03] https://wikitech.wikimedia.org/wiki/Proxy_access_to_cluster [01:30:45] esams needed a separate connection to the 'main cluster' (eqiad/pmtpa/sdtpa)? [01:39:28] PROBLEM - puppet last run on mw1069 is CRITICAL puppet fail [01:44:08] who has time to write docs, though :/ <---- Zanzu! [01:44:11] https://imgur.com/a/zanzu Zanzu! Documentation. Bug 1. [01:45:54] (I like to anthropomorphize/deify the random alphanumerical string that imgur gave that gallery. Zanzu is a capricious god. Zanzu is a howling storm. Zanzu will arrive next week.) [01:50:15] I wonder how yaseo/lopar fitted into https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions [01:50:29] RECOVERY - puppet last run on mw1069 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [01:50:48] And I'm pretty sure that "We do not re-use hostnames of past servers on new servers." rule has been broken [01:53:00] Rules are made to be broken [01:53:53] https://wikitech.wikimedia.org/wiki/Capella [02:12:12] legoktm, sigh. a bunch of these old pmtpa pages were either left as is, or blanked [02:12:28] not properly deleted or marked as historical [02:12:33] or moved to obsolete namespace [02:18:19] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [02:29:02] !log l10nupdate Synchronized php-1.26wmf12/cache/l10n: (no message) (duration: 09m 59s) [02:29:08] Logged the message, Master [02:31:28] PROBLEM - puppet last run on mw1069 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:39] RECOVERY - HHVM rendering on mw1077 is OK: HTTP OK: HTTP/1.1 200 OK - 66012 bytes in 0.166 second response time [02:32:49] RECOVERY - Apache HTTP on mw1077 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.047 second response time [02:32:59] PROBLEM - HHVM rendering on mw1107 is CRITICAL - Socket timeout after 10 seconds [02:33:08] PROBLEM - HHVM rendering on mw1076 is CRITICAL - Socket timeout after 10 seconds [02:33:09] RECOVERY - puppet last run on mw1069 is OK Puppet is currently enabled, last run 3 minutes ago with 0 failures [02:33:29] PROBLEM - Apache HTTP on mw1076 is CRITICAL - Socket timeout after 10 seconds [02:33:49] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [02:33:59] PROBLEM - dhclient process on mw1076 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:59] PROBLEM - DPKG on mw1076 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:00] PROBLEM - RAID on mw1076 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:08] PROBLEM - Apache HTTP on mw1107 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.011 second response time [02:34:09] PROBLEM - puppet last run on mw1076 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:39] RECOVERY - puppet last run on mw1077 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [02:34:48] PROBLEM - HHVM processes on mw1076 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:48] PROBLEM - nutcracker process on mw1076 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:49] PROBLEM - salt-minion processes on mw1110 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:50] PROBLEM - HHVM processes on mw1110 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:50] PROBLEM - puppet last run on mw1110 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:58] PROBLEM - HHVM rendering on mw1110 is CRITICAL - Socket timeout after 10 seconds [02:34:58] PROBLEM - DPKG on mw1110 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:19] PROBLEM - RAID on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:19] PROBLEM - Apache HTTP on mw1110 is CRITICAL - Socket timeout after 10 seconds [02:35:19] PROBLEM - SSH on mw1110 is CRITICAL - Socket timeout after 10 seconds [02:35:39] RECOVERY - dhclient process on mw1076 is OK: PROCS OK: 0 processes with command name dhclient [02:35:39] RECOVERY - DPKG on mw1076 is OK: All packages OK [02:35:39] RECOVERY - RAID on mw1076 is OK no RAID installed [02:35:48] PROBLEM - RAID on mw1110 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:50] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.125 second response time [02:35:50] RECOVERY - puppet last run on mw1076 is OK Puppet is currently enabled, last run 25 minutes ago with 0 failures [02:35:59] PROBLEM - Disk space on mw1110 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:08] PROBLEM - nutcracker process on mw1110 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:08] PROBLEM - dhclient process on mw1110 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:09] PROBLEM - configured eth on mw1110 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:28] RECOVERY - HHVM processes on mw1076 is OK: PROCS OK: 6 processes with command name hhvm [02:36:28] RECOVERY - nutcracker process on mw1076 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:36:38] RECOVERY - HHVM rendering on mw1107 is OK: HTTP OK: HTTP/1.1 200 OK - 66012 bytes in 0.315 second response time [02:36:39] RECOVERY - HHVM rendering on mw1076 is OK: HTTP OK: HTTP/1.1 200 OK - 66012 bytes in 0.585 second response time [02:36:49] PROBLEM - nutcracker port on mw1110 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:59] RECOVERY - RAID on mw1107 is OK no RAID installed [02:37:00] RECOVERY - Apache HTTP on mw1076 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.090 second response time [02:39:41] !log LocalisationUpdate completed (1.26wmf12) at 2015-07-04 02:39:41+00:00 [02:39:46] Logged the message, Master [02:39:49] RECOVERY - Disk space on mw1110 is OK: DISK OK [02:39:49] RECOVERY - nutcracker process on mw1110 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:39:49] RECOVERY - dhclient process on mw1110 is OK: PROCS OK: 0 processes with command name dhclient [02:39:50] RECOVERY - configured eth on mw1110 is OK - interfaces up [02:40:19] RECOVERY - salt-minion processes on mw1110 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:40:19] RECOVERY - HHVM processes on mw1110 is OK: PROCS OK: 6 processes with command name hhvm [02:40:19] RECOVERY - puppet last run on mw1110 is OK Puppet is currently enabled, last run 16 minutes ago with 0 failures [02:40:29] RECOVERY - DPKG on mw1110 is OK: All packages OK [02:40:38] RECOVERY - nutcracker port on mw1110 is OK: TCP OK - 0.000 second response time on port 11212 [02:40:49] RECOVERY - SSH on mw1110 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [02:41:19] RECOVERY - RAID on mw1110 is OK no RAID installed [02:45:58] PROBLEM - puppet last run on mw1110 is CRITICAL puppet fail [02:48:48] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [03:11:24] !log Promoted Krinkle and Krenair to admin, cloudadmin on wikitech, because duh. [03:11:30] Logged the message, Master [03:18:37] 6operations, 6Phabricator, 6Security: Phabricator dependence on wmfusercontent.org - https://phabricator.wikimedia.org/T104730#1426182 (10Bawolff) @Mike_Peel To do development stuff, sometimes we have to have files of various types that could be dangerous. For example, svgs that take over your account and do... [03:26:58] PROBLEM - HHVM rendering on mw1040 is CRITICAL - Socket timeout after 10 seconds [03:27:29] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 8 below the confidence bounds [03:27:39] PROBLEM - Apache HTTP on mw1040 is CRITICAL - Socket timeout after 10 seconds [03:28:39] PROBLEM - RAID on mw1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:28:49] PROBLEM - SSH on mw1040 is CRITICAL - Socket timeout after 10 seconds [03:28:49] PROBLEM - puppet last run on mw1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:28:59] PROBLEM - nutcracker port on mw1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:29:19] PROBLEM - DPKG on mw1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:29:48] PROBLEM - configured eth on mw1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:30:08] PROBLEM - Disk space on mw1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:30:59] PROBLEM - dhclient process on mw1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:31:49] RECOVERY - Disk space on mw1040 is OK: DISK OK [03:32:09] RECOVERY - RAID on mw1040 is OK no RAID installed [03:32:19] RECOVERY - HHVM rendering on mw1040 is OK: HTTP OK: HTTP/1.1 200 OK - 66012 bytes in 0.233 second response time [03:32:19] RECOVERY - SSH on mw1040 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [03:32:19] RECOVERY - puppet last run on mw1040 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [03:32:30] RECOVERY - nutcracker port on mw1040 is OK: TCP OK - 0.000 second response time on port 11212 [03:32:48] RECOVERY - dhclient process on mw1040 is OK: PROCS OK: 0 processes with command name dhclient [03:32:58] RECOVERY - DPKG on mw1040 is OK: All packages OK [03:33:00] RECOVERY - Apache HTTP on mw1040 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.219 second response time [03:33:20] RECOVERY - configured eth on mw1040 is OK - interfaces up [04:38:59] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 1 below the confidence bounds [04:52:09] PROBLEM - RAID on mw1069 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:52:59] PROBLEM - HHVM rendering on mw1069 is CRITICAL - Socket timeout after 10 seconds [04:53:40] PROBLEM - Apache HTTP on mw1069 is CRITICAL - Socket timeout after 10 seconds [04:53:59] PROBLEM - puppet last run on mw1069 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:54:38] PROBLEM - DPKG on mw1069 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:54:58] PROBLEM - Disk space on mw1069 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:54:59] PROBLEM - configured eth on mw1069 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:54:59] PROBLEM - salt-minion processes on mw1069 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:54:59] PROBLEM - HHVM processes on mw1069 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:54:59] PROBLEM - nutcracker process on mw1069 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:54:59] PROBLEM - dhclient process on mw1069 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:55:20] PROBLEM - SSH on mw1069 is CRITICAL - Socket timeout after 10 seconds [04:55:29] PROBLEM - nutcracker port on mw1069 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:39] RECOVERY - Disk space on mw1069 is OK: DISK OK [04:56:40] RECOVERY - salt-minion processes on mw1069 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:56:40] RECOVERY - dhclient process on mw1069 is OK: PROCS OK: 0 processes with command name dhclient [04:56:40] RECOVERY - nutcracker process on mw1069 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [04:56:40] RECOVERY - HHVM processes on mw1069 is OK: PROCS OK: 6 processes with command name hhvm [04:56:40] RECOVERY - configured eth on mw1069 is OK - interfaces up [04:57:09] RECOVERY - SSH on mw1069 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [04:57:10] RECOVERY - nutcracker port on mw1069 is OK: TCP OK - 0.000 second response time on port 11212 [04:57:39] RECOVERY - RAID on mw1069 is OK no RAID installed [04:57:39] RECOVERY - puppet last run on mw1069 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [04:58:09] RECOVERY - DPKG on mw1069 is OK: All packages OK [05:01:43] !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Jul 4 05:01:43 UTC 2015 (duration 1m 42s) [05:01:47] Logged the message, Master [06:08:29] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 3 below the confidence bounds [06:27:29] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 3 below the confidence bounds [06:30:59] PROBLEM - puppet last run on mw1217 is CRITICAL Puppet has 1 failures [06:31:09] PROBLEM - puppet last run on cp3008 is CRITICAL puppet fail [06:33:29] PROBLEM - puppet last run on db1067 is CRITICAL Puppet has 1 failures [06:34:59] PROBLEM - puppet last run on ms-fe2003 is CRITICAL Puppet has 1 failures [06:35:28] PROBLEM - puppet last run on db2040 is CRITICAL Puppet has 2 failures [06:36:18] PROBLEM - puppet last run on labcontrol2001 is CRITICAL Puppet has 1 failures [06:37:59] PROBLEM - puppet last run on mw2066 is CRITICAL Puppet has 1 failures [06:37:59] PROBLEM - puppet last run on mw1065 is CRITICAL Puppet has 1 failures [06:38:19] PROBLEM - puppet last run on mw2073 is CRITICAL Puppet has 1 failures [06:38:19] PROBLEM - puppet last run on mw2023 is CRITICAL Puppet has 1 failures [06:38:19] PROBLEM - puppet last run on mw2143 is CRITICAL Puppet has 1 failures [06:38:49] PROBLEM - puppet last run on mw2212 is CRITICAL Puppet has 1 failures [06:39:19] PROBLEM - puppet last run on mw2184 is CRITICAL Puppet has 1 failures [06:39:49] PROBLEM - puppet last run on mw1114 is CRITICAL Puppet has 1 failures [06:46:18] RECOVERY - puppet last run on mw1217 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:28] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:46:49] RECOVERY - puppet last run on db1067 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:46:50] RECOVERY - puppet last run on db2040 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:46:59] RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:47:38] RECOVERY - puppet last run on mw2066 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:39] RECOVERY - puppet last run on mw1065 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:48] RECOVERY - puppet last run on labcontrol2001 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:47:49] RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:47:49] RECOVERY - puppet last run on mw2023 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:58] RECOVERY - puppet last run on mw2143 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:29] RECOVERY - puppet last run on ms-fe2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:29] RECOVERY - puppet last run on mw2212 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:19] RECOVERY - puppet last run on mw1114 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:39] PROBLEM - puppet last run on mw1106 is CRITICAL Puppet has 5 failures [07:06:49] RECOVERY - puppet last run on mw1106 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [07:21:19] PROBLEM - Apache HTTP on mw1113 is CRITICAL - Socket timeout after 10 seconds [07:21:39] PROBLEM - HHVM rendering on mw1113 is CRITICAL - Socket timeout after 10 seconds [07:21:39] PROBLEM - DPKG on mw1113 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:22:08] PROBLEM - nutcracker port on mw1113 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:22:09] PROBLEM - puppet last run on mw1113 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:22:29] PROBLEM - RAID on mw1113 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:22:39] PROBLEM - SSH on mw1113 is CRITICAL - Socket timeout after 10 seconds [07:22:49] PROBLEM - HHVM processes on mw1113 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:22:50] PROBLEM - configured eth on mw1113 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:22:58] PROBLEM - nutcracker process on mw1113 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:19] PROBLEM - salt-minion processes on mw1113 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:50] PROBLEM - Disk space on mw1113 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:50] PROBLEM - dhclient process on mw1113 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:25:09] RECOVERY - salt-minion processes on mw1113 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:25:28] RECOVERY - DPKG on mw1113 is OK: All packages OK [07:25:38] RECOVERY - Disk space on mw1113 is OK: DISK OK [07:25:39] RECOVERY - dhclient process on mw1113 is OK: PROCS OK: 0 processes with command name dhclient [07:25:48] RECOVERY - nutcracker port on mw1113 is OK: TCP OK - 0.000 second response time on port 11212 [07:25:48] RECOVERY - puppet last run on mw1113 is OK Puppet is currently enabled, last run 26 minutes ago with 0 failures [07:26:09] RECOVERY - RAID on mw1113 is OK no RAID installed [07:26:19] RECOVERY - SSH on mw1113 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [07:26:29] RECOVERY - HHVM processes on mw1113 is OK: PROCS OK: 6 processes with command name hhvm [07:26:30] RECOVERY - configured eth on mw1113 is OK - interfaces up [07:26:40] RECOVERY - nutcracker process on mw1113 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [07:31:59] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60579 bytes in 0.165 second response time [07:33:19] PROBLEM - puppet last run on mw1113 is CRITICAL Puppet has 12 failures [07:56:48] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [08:00:38] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60558 bytes in 1.211 second response time [08:06:30] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [08:23:29] !log krinkle Synchronized php-1.26wmf12/resources/src/mediawiki/mediawiki.Title.js: I1dae1e63e47 (duration: 00m 17s) [08:49:19] PROBLEM - puppet last run on mw1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:49:39] PROBLEM - dhclient process on mw1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:49:58] PROBLEM - SSH on mw1045 is CRITICAL - Socket timeout after 10 seconds [08:50:08] PROBLEM - nutcracker process on mw1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:50:28] PROBLEM - HHVM rendering on mw1045 is CRITICAL - Socket timeout after 10 seconds [08:50:49] PROBLEM - Apache HTTP on mw1045 is CRITICAL - Socket timeout after 10 seconds [08:51:58] PROBLEM - HHVM processes on mw1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:51:58] PROBLEM - salt-minion processes on mw1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:52:08] PROBLEM - configured eth on mw1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:52:09] PROBLEM - nutcracker port on mw1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:52:19] PROBLEM - RAID on mw1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:52:38] PROBLEM - DPKG on mw1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:08] PROBLEM - Disk space on mw1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:56:59] RECOVERY - dhclient process on mw1045 is OK: PROCS OK: 0 processes with command name dhclient [08:57:28] RECOVERY - nutcracker process on mw1045 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [08:57:38] RECOVERY - nutcracker port on mw1045 is OK: TCP OK - 0.000 second response time on port 11212 [09:01:18] RECOVERY - HHVM processes on mw1045 is OK: PROCS OK: 1 process with command name hhvm [09:01:18] RECOVERY - salt-minion processes on mw1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:01:19] RECOVERY - configured eth on mw1045 is OK - interfaces up [09:01:39] RECOVERY - RAID on mw1045 is OK no RAID installed [09:01:49] RECOVERY - DPKG on mw1045 is OK: All packages OK [09:01:58] RECOVERY - Apache HTTP on mw1045 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.351 second response time [09:02:19] RECOVERY - Disk space on mw1045 is OK: DISK OK [09:02:49] RECOVERY - SSH on mw1045 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [09:03:28] RECOVERY - HHVM rendering on mw1045 is OK: HTTP OK: HTTP/1.1 200 OK - 66012 bytes in 0.142 second response time [09:17:39] PROBLEM - puppet last run on mw1030 is CRITICAL Puppet has 5 failures [09:24:28] RECOVERY - puppet last run on mw1045 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [09:28:08] 7Puppet, 6operations, 10Beta-Cluster: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#1426362 (10Nemo_bis) [09:39:34] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1426371 (10Krinkle) >>! In T99096#1354157, @dduvall wrote: >>>! In T99096#1337418, @BBlack wrote: >>... [09:41:10] (03CR) 10Krinkle: Set $wgMainStash to redis instead of the DB default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221885 (https://phabricator.wikimedia.org/T88493) (owner: 10Aaron Schulz) [10:23:49] 6operations, 6Phabricator, 6Security: Phabricator dependence on wmfusercontent.org - https://phabricator.wikimedia.org/T104730#1426381 (10Paladox) [10:24:38] PROBLEM - puppet last run on nescio is CRITICAL puppet fail [10:43:19] RECOVERY - puppet last run on nescio is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [10:45:59] PROBLEM - Apache HTTP on mw1028 is CRITICAL - Socket timeout after 10 seconds [10:46:09] PROBLEM - RAID on mw1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:47:29] PROBLEM - HHVM rendering on mw1028 is CRITICAL - Socket timeout after 10 seconds [10:47:38] PROBLEM - configured eth on mw1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:47:39] PROBLEM - nutcracker port on mw1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:47:58] PROBLEM - dhclient process on mw1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:48:40] PROBLEM - SSH on mw1028 is CRITICAL - Socket timeout after 10 seconds [10:48:48] PROBLEM - nutcracker process on mw1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:48:48] PROBLEM - HHVM processes on mw1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:49:09] PROBLEM - salt-minion processes on mw1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:49:09] PROBLEM - puppet last run on mw1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:49:10] PROBLEM - DPKG on mw1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:52:49] PROBLEM - HHVM rendering on mw1030 is CRITICAL - Socket timeout after 10 seconds [10:53:00] PROBLEM - Apache HTTP on mw1030 is CRITICAL - Socket timeout after 10 seconds [10:53:09] PROBLEM - Disk space on mw1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:53:19] RECOVERY - dhclient process on mw1028 is OK: PROCS OK: 0 processes with command name dhclient [10:53:59] PROBLEM - RAID on mw1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:54:09] PROBLEM - dhclient process on mw1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:54:09] RECOVERY - SSH on mw1028 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [10:54:10] RECOVERY - nutcracker process on mw1028 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [10:54:10] RECOVERY - HHVM processes on mw1028 is OK: PROCS OK: 6 processes with command name hhvm [10:54:19] PROBLEM - SSH on mw1030 is CRITICAL - Socket timeout after 10 seconds [10:54:29] RECOVERY - salt-minion processes on mw1028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:54:29] RECOVERY - puppet last run on mw1028 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [10:54:29] PROBLEM - configured eth on mw1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:54:30] RECOVERY - DPKG on mw1028 is OK: All packages OK [10:54:49] RECOVERY - configured eth on mw1028 is OK - interfaces up [10:54:50] RECOVERY - Disk space on mw1028 is OK: DISK OK [10:54:58] RECOVERY - nutcracker port on mw1028 is OK: TCP OK - 0.000 second response time on port 11212 [10:55:10] RECOVERY - RAID on mw1028 is OK no RAID installed [10:55:59] RECOVERY - dhclient process on mw1030 is OK: PROCS OK: 0 processes with command name dhclient [10:56:09] PROBLEM - DPKG on mw1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:56:40] PROBLEM - HHVM processes on mw1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:56:40] PROBLEM - nutcracker process on mw1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:57:59] RECOVERY - DPKG on mw1030 is OK: All packages OK [10:57:59] RECOVERY - SSH on mw1030 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [10:58:08] RECOVERY - configured eth on mw1030 is OK - interfaces up [10:58:29] RECOVERY - HHVM processes on mw1030 is OK: PROCS OK: 6 processes with command name hhvm [10:58:29] RECOVERY - nutcracker process on mw1030 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [10:58:29] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [10:59:28] RECOVERY - RAID on mw1030 is OK no RAID installed [11:00:10] RECOVERY - HHVM rendering on mw1030 is OK: HTTP OK: HTTP/1.1 200 OK - 66012 bytes in 0.132 second response time [11:00:28] RECOVERY - Apache HTTP on mw1030 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.050 second response time [11:07:39] PROBLEM - Cassanda CQL query interface on restbase1005 is CRITICAL: Connection refused [11:07:59] PROBLEM - Cassandra database on restbase1005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [11:11:29] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [11:12:58] RECOVERY - puppet last run on mw1030 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [11:18:47] (03PS1) 10Merlijn van Deen: dynamicproxy: set up outage error system [puppet] - 10https://gerrit.wikimedia.org/r/222753 (https://phabricator.wikimedia.org/T102971) [11:23:28] PROBLEM - puppet last run on mw1070 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:24:38] PROBLEM - Apache HTTP on mw1070 is CRITICAL - Socket timeout after 10 seconds [11:24:38] PROBLEM - HHVM rendering on mw1070 is CRITICAL - Socket timeout after 10 seconds [11:25:09] PROBLEM - dhclient process on mw1070 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:09] PROBLEM - nutcracker process on mw1070 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:48] PROBLEM - salt-minion processes on mw1070 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:49] PROBLEM - Disk space on mw1070 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:49] PROBLEM - configured eth on mw1070 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:49] PROBLEM - SSH on mw1070 is CRITICAL - Socket timeout after 10 seconds [11:26:08] PROBLEM - DPKG on mw1070 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:26:19] PROBLEM - nutcracker port on mw1070 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:26:19] PROBLEM - HHVM processes on mw1070 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:26:29] PROBLEM - RAID on mw1070 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:28:49] RECOVERY - dhclient process on mw1070 is OK: PROCS OK: 0 processes with command name dhclient [11:28:49] RECOVERY - nutcracker process on mw1070 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [11:31:08] RECOVERY - salt-minion processes on mw1070 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:31:09] RECOVERY - Disk space on mw1070 is OK: DISK OK [11:31:09] RECOVERY - configured eth on mw1070 is OK - interfaces up [11:31:18] RECOVERY - SSH on mw1070 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [11:31:28] RECOVERY - DPKG on mw1070 is OK: All packages OK [11:31:39] RECOVERY - nutcracker port on mw1070 is OK: TCP OK - 0.000 second response time on port 11212 [11:31:39] RECOVERY - HHVM processes on mw1070 is OK: PROCS OK: 6 processes with command name hhvm [11:31:49] RECOVERY - RAID on mw1070 is OK no RAID installed [11:39:49] RECOVERY - puppet last run on mw1070 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [11:46:38] PROBLEM - HHVM rendering on mw1106 is CRITICAL - Socket timeout after 10 seconds [11:48:08] PROBLEM - Apache HTTP on mw1106 is CRITICAL - Socket timeout after 10 seconds [11:48:29] PROBLEM - Disk space on mw1106 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:48:49] PROBLEM - SSH on mw1106 is CRITICAL - Socket timeout after 10 seconds [11:48:50] PROBLEM - DPKG on mw1106 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:08] PROBLEM - RAID on mw1106 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:18] PROBLEM - salt-minion processes on mw1106 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:18] PROBLEM - HHVM processes on mw1106 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:19] PROBLEM - puppet last run on mw1106 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:29] PROBLEM - nutcracker port on mw1106 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:50] PROBLEM - dhclient process on mw1106 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:50] PROBLEM - nutcracker process on mw1106 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:58] PROBLEM - configured eth on mw1106 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:57:48] RECOVERY - Disk space on mw1106 is OK: DISK OK [11:58:09] RECOVERY - DPKG on mw1106 is OK: All packages OK [11:58:09] RECOVERY - SSH on mw1106 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [11:58:18] RECOVERY - RAID on mw1106 is OK no RAID installed [11:58:19] RECOVERY - HHVM processes on mw1106 is OK: PROCS OK: 6 processes with command name hhvm [11:58:19] RECOVERY - salt-minion processes on mw1106 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:58:29] RECOVERY - nutcracker port on mw1106 is OK: TCP OK - 0.000 second response time on port 11212 [11:58:59] RECOVERY - nutcracker process on mw1106 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [11:58:59] RECOVERY - dhclient process on mw1106 is OK: PROCS OK: 0 processes with command name dhclient [11:59:00] RECOVERY - Apache HTTP on mw1106 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.227 second response time [11:59:08] RECOVERY - configured eth on mw1106 is OK - interfaces up [11:59:29] RECOVERY - HHVM rendering on mw1106 is OK: HTTP OK: HTTP/1.1 200 OK - 66012 bytes in 0.141 second response time [12:05:59] RECOVERY - puppet last run on mw1106 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:17:19] PROBLEM - nutcracker port on silver is CRITICAL - Socket timeout after 2 seconds [12:19:09] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [12:31:38] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 8.33% of data above the critical threshold [500.0] [12:39:29] PROBLEM - puppet last run on mw1081 is CRITICAL Puppet has 18 failures [12:40:59] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [12:44:18] PROBLEM - puppet last run on mw2099 is CRITICAL puppet fail [12:46:44] (03PS1) 10Nemo bis: [Planet Wikimedia] Add Josve05a to English Planet [puppet] - 10https://gerrit.wikimedia.org/r/222761 [12:47:27] mutante: some nice freedom of panorama stuff, +2 pls :) https://gerrit.wikimedia.org/r/#/c/222761/ [12:49:59] PROBLEM - puppet last run on lvs2004 is CRITICAL puppet fail [12:51:09] PROBLEM - Cassanda CQL query interface on restbase1001 is CRITICAL: Connection refused [12:51:09] PROBLEM - Cassandra database on restbase1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [12:51:19] (03PS2) 10Yuvipanda: planet: Add Josve05a to English Planet [puppet] - 10https://gerrit.wikimedia.org/r/222761 (owner: 10Nemo bis) [12:51:27] cassandra on restbase1001 & restbase1005 are down, can someone restart? [12:51:27] (03CR) 10Yuvipanda: [C: 032 V: 032] planet: Add Josve05a to English Planet [puppet] - 10https://gerrit.wikimedia.org/r/222761 (owner: 10Nemo bis) [12:51:45] thanks! [12:52:38] urandom: yeah, doing now. service cassandra restart? [12:53:07] yeah [12:53:08] !log restarted cassandra on restbase1001 and 1005 [12:53:10] done [12:53:29] YuviPanda: thanks [12:53:34] yw [12:53:39] RECOVERY - Cassandra database on restbase1005 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [12:54:49] RECOVERY - Cassandra database on restbase1001 is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [12:55:06] legoktm: > To use Connected Apps on this site, you must have an account across all projects. When you have an account on all projects, you can try to connect "SQL Quarry Local Test Instance" again. [12:55:08] RECOVERY - Cassanda CQL query interface on restbase1005 is OK: TCP OK - 0.002 second response time on port 9042 [12:55:09] I thought you had fixed that. [12:55:11] :( [12:56:15] !log restart nutcracker on silver [12:56:39] RECOVERY - Cassanda CQL query interface on restbase1001 is OK: TCP OK - 0.000 second response time on port 9042 [12:59:39] PROBLEM - puppet last run on mw2153 is CRITICAL Puppet has 1 failures [13:01:30] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [13:02:49] RECOVERY - puppet last run on mw2099 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:06:39] RECOVERY - puppet last run on lvs2004 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [13:12:28] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [13:12:49] RECOVERY - puppet last run on mw1081 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [13:14:19] RECOVERY - puppet last run on mw2153 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [14:38:49] PROBLEM - HHVM rendering on mw1112 is CRITICAL - Socket timeout after 10 seconds [14:40:20] PROBLEM - Apache HTTP on mw1112 is CRITICAL - Socket timeout after 10 seconds [14:40:40] PROBLEM - HHVM processes on mw1112 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:40:40] PROBLEM - Disk space on mw1112 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:41:40] PROBLEM - nutcracker process on mw1112 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:41:40] PROBLEM - configured eth on mw1112 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:41:49] PROBLEM - salt-minion processes on mw1112 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:41:49] PROBLEM - RAID on mw1112 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:41:59] PROBLEM - dhclient process on mw1112 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:42:08] PROBLEM - SSH on mw1112 is CRITICAL - Socket timeout after 10 seconds [14:42:10] PROBLEM - nutcracker port on mw1112 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:42:18] PROBLEM - puppet last run on mw1112 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:42:18] PROBLEM - DPKG on mw1112 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:43:08] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [14:46:18] RECOVERY - Disk space on mw1112 is OK: DISK OK [14:46:20] RECOVERY - HHVM processes on mw1112 is OK: PROCS OK: 6 processes with command name hhvm [14:47:09] RECOVERY - nutcracker process on mw1112 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [14:47:09] RECOVERY - configured eth on mw1112 is OK - interfaces up [14:47:19] RECOVERY - salt-minion processes on mw1112 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:47:28] RECOVERY - RAID on mw1112 is OK no RAID installed [14:47:28] RECOVERY - dhclient process on mw1112 is OK: PROCS OK: 0 processes with command name dhclient [14:47:38] RECOVERY - SSH on mw1112 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [14:47:39] RECOVERY - nutcracker port on mw1112 is OK: TCP OK - 0.000 second response time on port 11212 [14:47:48] RECOVERY - puppet last run on mw1112 is OK Puppet is currently enabled, last run 30 minutes ago with 0 failures [14:47:48] RECOVERY - DPKG on mw1112 is OK: All packages OK [14:48:17] mw1112 got up on its own [14:49:11] the OOM killer killed hhvm [14:49:28] PROBLEM - HHVM rendering on mw1062 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.005 second response time [14:49:39] PROBLEM - Apache HTTP on mw1062 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.353 second response time [14:49:51] PROBLEM - puppet last run on mw1062 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:49:58] PROBLEM - RAID on mw1062 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:51:18] RECOVERY - HHVM rendering on mw1062 is OK: HTTP OK: HTTP/1.1 200 OK - 65533 bytes in 0.377 second response time [14:51:29] RECOVERY - Apache HTTP on mw1062 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.206 second response time [14:51:30] RECOVERY - puppet last run on mw1062 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [14:51:40] RECOVERY - RAID on mw1062 is OK no RAID installed [14:56:09] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [15:04:47] 6operations, 7HHVM: HHVM memory leaks result in OOMs & 500 spikes - https://phabricator.wikimedia.org/T104769#1426505 (10faidon) 3NEW [15:13:58] PROBLEM - Apache HTTP on mw1057 is CRITICAL - Socket timeout after 10 seconds [15:14:08] PROBLEM - HHVM rendering on mw1057 is CRITICAL - Socket timeout after 10 seconds [15:21:48] PROBLEM - puppet last run on mw1057 is CRITICAL Puppet has 24 failures [15:27:11] 6operations: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#1426525 (10jcrespo) M[1-5] :-) And it is definitely a We! We could standardize on 4444, which is not assigned according to `/etc/services` and the one Galera uses for xtrabackup SSTs http://galeracluster.com/doc... [15:38:30] PROBLEM - puppet last run on mw1032 is CRITICAL Puppet has 68 failures [15:40:17] <_joe_> again almost OOMing ^^ [15:42:06] <_joe_> a lot of appservers are in that situation btw [15:47:14] 6operations, 7HHVM: HHVM memory leaks result in OOMs & 500 spikes - https://phabricator.wikimedia.org/T104769#1426544 (10Joe) For reference, the problem doesn't seem to present itself on the API appservers. On mw1117, HHVM is running since june 25 but just using up 20% of the memory [15:53:19] RECOVERY - puppet last run on mw1032 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [16:03:09] PROBLEM - Apache HTTP on mw1083 is CRITICAL - Socket timeout after 10 seconds [16:03:18] PROBLEM - RAID on mw1083 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:03:59] PROBLEM - HHVM rendering on mw1083 is CRITICAL - Socket timeout after 10 seconds [16:04:39] PROBLEM - puppet last run on mw1083 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:08] PROBLEM - salt-minion processes on mw1083 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:09] PROBLEM - nutcracker port on mw1083 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:10] PROBLEM - dhclient process on mw1083 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:18] PROBLEM - SSH on mw1083 is CRITICAL - Socket timeout after 10 seconds [16:05:38] PROBLEM - Disk space on mw1083 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:49] PROBLEM - HHVM processes on mw1083 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:06:19] PROBLEM - nutcracker process on mw1083 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:06:19] PROBLEM - DPKG on mw1083 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:06:39] PROBLEM - configured eth on mw1083 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:10:59] RECOVERY - Disk space on mw1083 is OK: DISK OK [16:11:09] RECOVERY - HHVM processes on mw1083 is OK: PROCS OK: 6 processes with command name hhvm [16:11:48] RECOVERY - nutcracker process on mw1083 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [16:12:16] !log krenair Synchronized wmf-config/interwiki.cdb: Updating interwiki cache (duration: 10m 35s) [16:12:18] RECOVERY - salt-minion processes on mw1083 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:12:19] RECOVERY - dhclient process on mw1083 is OK: PROCS OK: 0 processes with command name dhclient [16:12:19] RECOVERY - nutcracker port on mw1083 is OK: TCP OK - 0.000 second response time on port 11212 [16:12:19] RECOVERY - RAID on mw1083 is OK no RAID installed [16:12:20] Logged the message, Master [16:12:29] RECOVERY - SSH on mw1083 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [16:12:59] RECOVERY - HHVM rendering on mw1083 is OK: HTTP OK: HTTP/1.1 200 OK - 65533 bytes in 0.226 second response time [16:13:38] RECOVERY - DPKG on mw1083 is OK: All packages OK [16:13:48] RECOVERY - puppet last run on mw1083 is OK Puppet is currently enabled, last run 13 minutes ago with 0 failures [16:13:49] RECOVERY - configured eth on mw1083 is OK - interfaces up [16:14:00] RECOVERY - Apache HTTP on mw1083 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.055 second response time [16:40:51] (03PS2) 10Merlijn van Deen: dynamicproxy/tools: set up outage error system [puppet] - 10https://gerrit.wikimedia.org/r/222753 (https://phabricator.wikimedia.org/T102971) [16:40:56] YuviPanda: ^ [16:41:46] (03PS3) 10Merlijn van Deen: dynamicproxy/tools: set up outage error system [puppet] - 10https://gerrit.wikimedia.org/r/222753 (https://phabricator.wikimedia.org/T102971) [16:43:52] (03CR) 10Yuvipanda: dynamicproxy/tools: set up outage error system (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/222753 (https://phabricator.wikimedia.org/T102971) (owner: 10Merlijn van Deen) [16:43:56] valhallasw`cloud: whee in general. [16:46:13] OH GOD SO MANY COMMENTS [16:46:14] :D [16:49:26] (03CR) 10Merlijn van Deen: dynamicproxy/tools: set up outage error system (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/222753 (https://phabricator.wikimedia.org/T102971) (owner: 10Merlijn van Deen) [16:49:31] YuviPanda: nfs might be an issue, yeah [16:51:27] YuviPanda: I'm now actually worried puppet might not even run correctly when stuff is broken [16:51:46] there's lots of other stuff in puppet that's applied by default running anyway [16:51:54] there's a gridengine install on the proxies. [16:52:28] YuviPanda: yeah, I know. But if puppet doesn't run because NFS is down, we also can't apply changes to the nginx conf [16:52:42] right [16:52:42] doesn't run --> crashes/hangs [16:52:52] and the minute it hits a nfs file it's going to just hang. [16:53:01] yep [16:53:26] but if we restart the instance it'll come back up without NFS [16:53:29] and then puppet will run [16:53:29] it might be OK though, because if nfs is down, it'll just re-clone stuff on / [16:53:30] so it's nto that bad [16:53:44] and we can work towards getting rid of NFS from tools instances that don't need them [16:54:03] yeah, setting up /data/project/admin shouldn't really be in proxy anyway [16:54:15] yeah [16:54:22] oh well [16:55:08] I can set the file{} to only copy if the file doesnt exist, I guess? [16:55:18] PROBLEM - HHVM rendering on mw1074 is CRITICAL - Socket timeout after 10 seconds [16:55:35] then it doesn't need to wait for the git clone... except if it randomly still waits for it because puppet is single-threaded [16:55:36] ugh [16:55:58] PROBLEM - Apache HTTP on mw1074 is CRITICAL - Socket timeout after 10 seconds [16:56:11] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1426579 (10BBlack) [16:56:55] YuviPanda: I'll change the code to insert the override in comments instead of not inserting it [16:56:59] PROBLEM - dhclient process on mw1074 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:57:04] so that if puppet doesn't work it's still easy to apply [16:57:09] ok [16:57:18] PROBLEM - nutcracker process on mw1074 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:57:18] PROBLEM - SSH on mw1074 is CRITICAL - Socket timeout after 10 seconds [16:57:19] PROBLEM - HHVM processes on mw1074 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:57:29] PROBLEM - DPKG on mw1074 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:57:49] PROBLEM - RAID on mw1074 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:58:19] PROBLEM - puppet last run on mw1074 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:58:49] RECOVERY - dhclient process on mw1074 is OK: PROCS OK: 0 processes with command name dhclient [16:58:59] RECOVERY - nutcracker process on mw1074 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [16:58:59] RECOVERY - SSH on mw1074 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [16:59:00] RECOVERY - HHVM processes on mw1074 is OK: PROCS OK: 6 processes with command name hhvm [16:59:18] RECOVERY - DPKG on mw1074 is OK: All packages OK [16:59:38] RECOVERY - RAID on mw1074 is OK no RAID installed [17:07:20] PROBLEM - Cassanda CQL query interface on restbase1001 is CRITICAL: Connection refused [17:07:39] PROBLEM - Cassandra database on restbase1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [17:15:00] RECOVERY - Cassandra database on restbase1001 is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [17:15:00] <_joe_> !log restarted cassandra on restbase1001 [17:15:05] Logged the message, Master [17:16:29] RECOVERY - Cassanda CQL query interface on restbase1001 is OK: TCP OK - 0.017 second response time on port 9042 [17:34:33] (03PS1) 10Alex Monk: wikitech: Clean up contentadmin rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222776 [17:34:53] YuviPanda, ^ [17:35:30] (03PS4) 10Merlijn van Deen: dynamicproxy/tools: set up outage error system [puppet] - 10https://gerrit.wikimedia.org/r/222753 (https://phabricator.wikimedia.org/T102971) [17:35:43] YuviPanda: ^ added notes in the config file. will write some docs as well [17:36:12] wait [17:36:14] ('nuke' allows you to delete all pages created by a user, as a convenience addition to 'delete', which is something Krinkle ran into yesterday) [17:39:18] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [17:43:25] YuviPanda: + https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin/emergency_guides/labs_down_notification [17:44:05] hmm, I dunno if the commenting is going to be that useful - if puppet isn't running you probably can't ssh in either, but still better to have [17:44:07] also yay :) [17:48:15] YuviPanda: yeah, not sure either, but puppet might also fail for other reasons (say, wikitech down) [17:48:23] yeah, fair enough [17:54:09] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [18:03:57] valhallasw`cloud: eugh, the quarry code in general seems kind of shit >_> [18:05:03] YuviPanda: :( [18:05:16] valhallasw`cloud: it isn't *too* bad, but it could be much better. [18:05:19] the organization is kind of shit. [18:06:37] meh, that's what you get with code that grows [18:08:15] valhallasw`cloud: probably, although this could've started better. [18:08:59] valhallasw`cloud: I should look into blueprints, etc and structure these a bit better [18:09:09] having the entire app be in one giant app.py file isn't that great either probably [18:10:35] hmm, the 'api' calls can probably be split out first [18:11:06] (03PS1) 10Alex Monk: wikitech: Get rid of unused ops namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222792 [18:12:49] (03CR) 10Alex Monk: "And just to be sure:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222792 (owner: 10Alex Monk) [18:14:53] valhallasw`cloud: do you think applications should be factored by functionality (query, user, login, etc) or by some other factor? (Api / login / html)? [18:14:56] hmm, the former makes more sense [18:15:27] except stuff like 'query' might become really big [18:15:30] (03CR) 10Glaisher: "https://wikitech.wikimedia.org/wiki/Ops:Templateeditor_right_for_enwiki Shouldn't something be done with this then?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222792 (owner: 10Alex Monk) [18:17:15] (03CR) 10Alex Monk: "Yes, namespaceDupes would move it into the main namespace with an 'Ops:' pretend-namespace prefix." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222792 (owner: 10Alex Monk) [18:17:34] YuviPanda: you should have used django =p [18:17:44] heh [18:17:47] it might not be too late [18:17:54] (not really, but the structure might be useful for quarry as well) [18:18:00] :P [18:18:04] the admin bit might be yeah [18:19:03] so I'm thinking of splitting out login first, and then query (just deals with individual querys, running and displaying them) and then query-lists [18:19:15] yeah, that sounds sane [18:19:29] SQLAlchemy is kind of a bitch... [18:19:47] hm, it really is a bit of a CRUDdy app, in that sense [18:19:59] it's very very CRUDdy [18:20:01] well, no D [18:20:05] yeah [18:20:20] maybe you should reconsider django then [18:20:48] PROBLEM - HHVM rendering on mw1086 is CRITICAL - Socket timeout after 10 seconds [18:20:49] PROBLEM - Apache HTTP on mw1086 is CRITICAL - Socket timeout after 10 seconds [18:21:11] valhallasw`cloud: yeah, and I guess I can integrate django into celery easily [18:21:50] PROBLEM - nutcracker process on mw1086 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:21:59] PROBLEM - Disk space on mw1086 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:22:19] PROBLEM - puppet last run on mw1086 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:22:49] PROBLEM - nutcracker port on mw1086 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:22:57] valhallasw`cloud: although, I've to balance redoing this vs helping halfak with ORES [18:22:59] PROBLEM - salt-minion processes on mw1086 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:22:59] PROBLEM - RAID on mw1086 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:22:59] PROBLEM - configured eth on mw1086 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:23:09] PROBLEM - Disk space on labstore2001 is CRITICAL: DISK CRITICAL - free space: /srv/backup-others-20150703 135031 MB (3% inode=98%) [18:23:10] PROBLEM - SSH on mw1086 is CRITICAL - Socket timeout after 10 seconds [18:23:10] PROBLEM - DPKG on mw1086 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:23:19] PROBLEM - HHVM processes on mw1086 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:23:39] PROBLEM - dhclient process on mw1086 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:25:25] Is that the same server? Or a different one this time? [18:25:29] Why do random mw servers keep doing this? [18:26:57] Krenair: it's because hhvm is too stable... [18:27:00] hm, apparently that host randomly caused sync to get stuck a few days ago, but otherwise... nothing I could find in my logs for this channel [18:27:01] heh [18:27:20] quite literally - it's a memleak that makes hosts go OOM [18:27:29] fun [18:27:32] but because HHVM used to crash repeatedly it never became that bad before [18:27:39] unfortunately hHVM doesn't crash as much [18:29:59] RECOVERY - nutcracker port on mw1086 is OK: TCP OK - 0.000 second response time on port 11212 [18:30:10] RECOVERY - salt-minion processes on mw1086 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:30:10] RECOVERY - configured eth on mw1086 is OK - interfaces up [18:30:10] RECOVERY - RAID on mw1086 is OK no RAID installed [18:30:12] great... :| [18:30:19] RECOVERY - SSH on mw1086 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [18:30:19] RECOVERY - DPKG on mw1086 is OK: All packages OK [18:30:29] RECOVERY - HHVM processes on mw1086 is OK: PROCS OK: 6 processes with command name hhvm [18:30:48] RECOVERY - dhclient process on mw1086 is OK: PROCS OK: 0 processes with command name dhclient [18:30:49] RECOVERY - nutcracker process on mw1086 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [18:30:59] RECOVERY - Disk space on mw1086 is OK: DISK OK [18:33:29] PROBLEM - puppet last run on mw1084 is CRITICAL Puppet has 1 failures [18:37:12] valhallasw`cloud: I guess I could try by making a simple django mw oauth thing and see how I feel... [18:38:39] RECOVERY - puppet last run on mw1086 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [18:39:10] YuviPanda: yeah. hard to say what the best approach is [18:39:38] valhallasw`cloud: yeah. I also haven't written any Django in 5 years... [18:39:53] :D it's just like python except with layers and layers of indirection [18:40:05] valhallasw`cloud: we could add mw support to https://github.com/omab/python-social-auth! [18:40:59] PROBLEM - Apache HTTP on mw1058 is CRITICAL - Socket timeout after 10 seconds [18:41:09] PROBLEM - HHVM rendering on mw1058 is CRITICAL - Socket timeout after 10 seconds [18:41:40] YuviPanda: oooh. [18:42:39] PROBLEM - DPKG on mw1058 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:42:48] PROBLEM - Disk space on mw1058 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:42:50] PROBLEM - HHVM rendering on mw1075 is CRITICAL - Socket timeout after 10 seconds [18:43:19] PROBLEM - nutcracker port on mw1058 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:43:20] PROBLEM - configured eth on mw1058 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:43:28] PROBLEM - RAID on mw1058 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:43:29] PROBLEM - Apache HTTP on mw1075 is CRITICAL - Socket timeout after 10 seconds [18:43:29] PROBLEM - puppet last run on mw1058 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:43:39] PROBLEM - nutcracker port on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:43:39] PROBLEM - dhclient process on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:43:48] PROBLEM - puppet last run on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:43:49] PROBLEM - dhclient process on mw1058 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:43:49] PROBLEM - HHVM processes on mw1058 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:43:49] PROBLEM - SSH on mw1058 is CRITICAL - Socket timeout after 10 seconds [18:44:09] PROBLEM - nutcracker process on mw1058 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:44:09] PROBLEM - salt-minion processes on mw1058 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:44:19] PROBLEM - salt-minion processes on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:44:19] PROBLEM - DPKG on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:44:39] PROBLEM - RAID on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:44:49] PROBLEM - configured eth on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:44:49] PROBLEM - HHVM processes on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:44:49] PROBLEM - nutcracker process on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:44:59] PROBLEM - Disk space on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:44:59] PROBLEM - SSH on mw1075 is CRITICAL - Socket timeout after 10 seconds [18:46:27] (03PS1) 10BBlack: Move dhparam support from tlsproxy to sslcert/ciphersuite [puppet] - 10https://gerrit.wikimedia.org/r/222839 [18:48:08] YuviPanda: I'm going to poke Coren next week to take a look at the DNS stuff for the mailrelay, as he has more experience with spf etc [18:48:18] RECOVERY - Disk space on mw1058 is OK: DISK OK [18:48:23] +1, I don't even know what SPF stands for [18:48:28] RECOVERY - HHVM processes on mw1075 is OK: PROCS OK: 6 processes with command name hhvm [18:48:28] RECOVERY - nutcracker process on mw1075 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [18:48:28] RECOVERY - configured eth on mw1075 is OK - interfaces up [18:48:29] RECOVERY - Disk space on mw1075 is OK: DISK OK [18:48:29] RECOVERY - SSH on mw1075 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [18:48:59] RECOVERY - nutcracker port on mw1075 is OK: TCP OK - 0.000 second response time on port 11212 [18:49:08] RECOVERY - dhclient process on mw1075 is OK: PROCS OK: 0 processes with command name dhclient [18:49:18] RECOVERY - puppet last run on mw1075 is OK Puppet is currently enabled, last run 28 minutes ago with 0 failures [18:49:23] (03Abandoned) 10Merlijn van Deen: [tools] Add user keys for Tools roots [puppet] - 10https://gerrit.wikimedia.org/r/222298 (owner: 10Merlijn van Deen) [18:49:38] RECOVERY - nutcracker process on mw1058 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [18:49:38] RECOVERY - salt-minion processes on mw1058 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:49:49] RECOVERY - salt-minion processes on mw1075 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:49:51] RECOVERY - DPKG on mw1075 is OK: All packages OK [18:49:51] RECOVERY - DPKG on mw1058 is OK: All packages OK [18:50:29] RECOVERY - nutcracker port on mw1058 is OK: TCP OK - 0.000 second response time on port 11212 [18:50:29] RECOVERY - configured eth on mw1058 is OK - interfaces up [18:50:39] RECOVERY - RAID on mw1058 is OK no RAID installed [18:50:58] RECOVERY - HHVM processes on mw1058 is OK: PROCS OK: 6 processes with command name hhvm [18:50:59] RECOVERY - dhclient process on mw1058 is OK: PROCS OK: 0 processes with command name dhclient [18:50:59] RECOVERY - SSH on mw1058 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [18:51:49] RECOVERY - puppet last run on mw1084 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:53:34] valhallasw`cloud: I think I won't have time to move it completely to django, unfortunately. Also this could just as well be a learning experience :P I'm going to factor it into a bunch of flask blueprints [18:53:59] PROBLEM - configured eth on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:54:10] PROBLEM - SSH on mw1075 is CRITICAL - Socket timeout after 10 seconds [18:54:49] PROBLEM - puppet last run on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:55:19] PROBLEM - DPKG on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:55:49] PROBLEM - HHVM processes on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:55:49] PROBLEM - nutcracker process on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:55:59] PROBLEM - Disk space on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:56:29] PROBLEM - nutcracker port on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:09] PROBLEM - salt-minion processes on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:58:28] PROBLEM - dhclient process on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:01:30] RECOVERY - HHVM processes on mw1075 is OK: PROCS OK: 6 processes with command name hhvm [19:01:30] RECOVERY - nutcracker process on mw1075 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [19:01:39] RECOVERY - SSH on mw1075 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:03:58] PROBLEM - Cassandra database on restbase1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [19:04:53] valhallasw`cloud: how does https://gerrit.wikimedia.org/r/222841 look? [19:05:28] PROBLEM - Cassanda CQL query interface on restbase1001 is CRITICAL: Connection refused [19:06:19] (03PS1) 10BBlack: tlsproxy: add negotiated cipher to conn props [puppet] - 10https://gerrit.wikimedia.org/r/222842 [19:06:59] PROBLEM - HHVM processes on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:06:59] PROBLEM - nutcracker process on mw1075 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:07:09] PROBLEM - SSH on mw1075 is CRITICAL - Socket timeout after 10 seconds [19:07:39] RECOVERY - nutcracker port on mw1075 is OK: TCP OK - 0.000 second response time on port 11212 [19:07:39] RECOVERY - dhclient process on mw1075 is OK: PROCS OK: 0 processes with command name dhclient [19:07:48] RECOVERY - puppet last run on mw1075 is OK Puppet is currently enabled, last run 47 minutes ago with 0 failures [19:08:09] RECOVERY - salt-minion processes on mw1075 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:08:10] RECOVERY - DPKG on mw1075 is OK: All packages OK [19:08:39] RECOVERY - RAID on mw1075 is OK no RAID installed [19:08:40] RECOVERY - configured eth on mw1075 is OK - interfaces up [19:08:48] RECOVERY - HHVM processes on mw1075 is OK: PROCS OK: 6 processes with command name hhvm [19:08:48] RECOVERY - nutcracker process on mw1075 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [19:08:49] RECOVERY - Disk space on mw1075 is OK: DISK OK [19:08:49] RECOVERY - SSH on mw1075 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:14:59] PROBLEM - puppet last run on mw1075 is CRITICAL puppet fail [19:15:02] YuviPanda, restbase1001 may need a restart it looks like [19:15:23] subbu: alright. cassandra or restbase? [19:15:47] i think cassandra? [19:15:49] cassandra it looks like [19:15:50] yeah [19:15:55] !log restarted cassandra on restbase1001 [19:16:00] Logged the message, Master [19:16:12] you still in UK or back in SF? [19:16:20] subbu: still in the UK :) [19:16:30] ah, wikimania plans? [19:16:39] RECOVERY - Cassandra database on restbase1001 is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [19:16:51] subbu: yes! next weekend, I'm going to NYC, and then from there to Mexico, and then from there to SF :) [19:17:22] wow, ok .. so, sf -> lyon -> uk -> nyc -> mexico -> sf? [19:17:33] or did i miss a few stops in between? :) [19:18:09] RECOVERY - Cassanda CQL query interface on restbase1001 is OK: TCP OK - 0.001 second response time on port 9042 [19:18:21] subbu: no, it was OAK -> JFK -> Lyon -> Glasgow -> Birmingham (questionable?!) -> Manchester -> JFK -> Mexico -> OAK [19:18:43] ok :) [19:18:47] subbu: am currently planning next round of travel in August, for CCC Camp. [19:18:55] happens once in 4 years! and my GSoC student is going! :) [19:19:03] YuviPanda: oh? going to CCC Camp? [19:19:06] some day, .. i'll get there. [19:19:08] paravoid: yes! [19:19:18] paravoid: haven't booked tickets yes, but I am. [19:19:22] how come? [19:19:24] i've always wanted to check it out .. but your travel plans are making me dizzy. [19:19:48] paravoid: happens once every 4 years, and several people I was involved with in the himalayan hackerspace are coming too [19:20:16] ah [19:20:40] * paravoid throws https://wiki.debconf.org/wiki/DebConf15/CCCampTransport out there [19:20:45] paravoid: plus it's a good excuse to meet _joe_ for a bit [19:21:14] invalid ceritifcate authority? [19:21:33] ignore that, SPI CA [19:21:40] paravoid: haha, paul proteus is going as well. [19:21:50] well of course :) [19:21:59] I considered doing CCC Camp too, but too many trips this year [19:22:03] it's like 3/4 separate social circles intersecting in weird ways (my gf / people in her circles are going to show up as well) [19:22:07] I'll probably go to congress in dec, though [19:22:35] paravoid: nice! [19:22:44] this page is for the CCC -> DebConf15 bus [19:22:47] paravoid: so this also means I get only 20days in SF after I go back this time... [19:22:56] just saying :P [19:22:59] :P [19:23:02] oh [19:23:06] I thought it was DebConf -> CCC [19:23:23] no, debconf is after CCC [19:23:26] right [19:23:32] I think Moritz is going to DebConf [19:23:36] yes, me too [19:23:39] ooooh, nice [19:23:48] why is Europe so cool yet stingy about letting me stay? brrrrr [19:24:04] but I have a 1 year visa now, MUST MAKE USE! [19:24:13] so coming to debconf then? [19:24:19] I'm super tempted [19:24:23] hahaha [19:24:42] at this point I'm basically like 'fuck you visas, I have you now, I shall go to all the conferences' [19:24:44] :P [19:25:25] paravoid: I might actually come, I'll talk with the others and see what they're upto afterwards [19:25:25] yeah I was reading http://lists.debconf.org/lurker/message/20150702.181154.2b5cffef.en.html and thinking of you [19:25:35] http://lists.debconf.org/lurker/message/20150703.025850.94e422ca.en.html etc. [19:25:50] paravoid: yeah [19:26:01] paravoid: however, I don't have to go through that! because if you apply from the *US*, they don't give a shit... [19:26:06] yeah yeah [19:26:15] I've probably ranted enough to bleed everyone's ears out [19:26:20] :P [19:26:31] you missed debconf's sponsorship deadline, though -- you'd have to pay... $30something/day for accomodation + food [19:26:46] that's ok. cheaper than living in SF... [19:26:48] if there is space [19:26:50] right. [19:26:57] and if nothing is happening post Camp... [19:27:21] it is really tempting, however. [19:27:42] :P [19:43:47] valhallasw`cloud: I'm going to go merge these anyway :P [19:49:59] YuviPanda: I'm watching a movie, sorry [19:50:07] valhallasw`cloud: yeah, 'tis ok :) [19:50:11] valhallasw`cloud: they ended up being fine. [19:50:17] I'll do more splits later. [19:50:21] valhallasw`cloud: enjoy your movie! [19:55:48] PROBLEM - Apache HTTP on mw1022 is CRITICAL - Socket timeout after 10 seconds [19:56:38] PROBLEM - HHVM rendering on mw1022 is CRITICAL - Socket timeout after 10 seconds [19:57:09] PROBLEM - puppet last run on mw1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:57:18] PROBLEM - nutcracker port on mw1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:57:19] PROBLEM - salt-minion processes on mw1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:57:19] PROBLEM - configured eth on mw1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:57:19] PROBLEM - dhclient process on mw1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:57:19] PROBLEM - SSH on mw1022 is CRITICAL - Socket timeout after 10 seconds [19:57:49] PROBLEM - DPKG on mw1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:57:49] PROBLEM - Disk space on mw1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:57:49] PROBLEM - RAID on mw1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:58:59] PROBLEM - HHVM processes on mw1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:58:59] RECOVERY - nutcracker port on mw1022 is OK: TCP OK - 0.000 second response time on port 11212 [19:58:59] RECOVERY - salt-minion processes on mw1022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:59:08] RECOVERY - configured eth on mw1022 is OK - interfaces up [19:59:08] RECOVERY - dhclient process on mw1022 is OK: PROCS OK: 0 processes with command name dhclient [19:59:08] RECOVERY - SSH on mw1022 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:59:28] RECOVERY - Apache HTTP on mw1022 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.417 second response time [19:59:29] RECOVERY - Disk space on mw1022 is OK: DISK OK [19:59:29] RECOVERY - RAID on mw1022 is OK no RAID installed [19:59:29] RECOVERY - DPKG on mw1022 is OK: All packages OK [20:00:19] RECOVERY - HHVM rendering on mw1022 is OK: HTTP OK: HTTP/1.1 200 OK - 65611 bytes in 0.184 second response time [20:00:49] RECOVERY - HHVM processes on mw1022 is OK: PROCS OK: 6 processes with command name hhvm [20:00:50] RECOVERY - puppet last run on mw1022 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [20:18:12] (03PS2) 10BBlack: Move dhparam support from tlsproxy to sslcert/ciphersuite [puppet] - 10https://gerrit.wikimedia.org/r/222839 [20:18:14] (03PS1) 10BBlack: update "strong" desc for accuracy [puppet] - 10https://gerrit.wikimedia.org/r/222851 [20:19:43] (03CR) 10BBlack: [C: 032] update "strong" desc for accuracy [puppet] - 10https://gerrit.wikimedia.org/r/222851 (owner: 10BBlack) [20:21:00] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1426759 (10BBlack) [20:38:32] (03PS3) 10BBlack: Move dhparam support from tlsproxy to sslcert/ciphersuite [puppet] - 10https://gerrit.wikimedia.org/r/222839 [20:50:29] PROBLEM - puppet last run on mw1070 is CRITICAL puppet fail [20:51:00] PROBLEM - Apache HTTP on mw1054 is CRITICAL - Socket timeout after 10 seconds [20:51:39] PROBLEM - HHVM rendering on mw1054 is CRITICAL - Socket timeout after 10 seconds [20:52:18] PROBLEM - nutcracker port on mw1054 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:52:19] PROBLEM - salt-minion processes on mw1054 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:52:38] PROBLEM - SSH on mw1054 is CRITICAL - Socket timeout after 10 seconds [20:52:50] PROBLEM - DPKG on mw1054 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:53:09] PROBLEM - configured eth on mw1054 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:53:38] PROBLEM - puppet last run on mw1054 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:53:39] PROBLEM - RAID on mw1054 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:53:58] PROBLEM - nutcracker process on mw1054 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:58:29] PROBLEM - dhclient process on mw1054 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:59:29] RECOVERY - nutcracker process on mw1054 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [20:59:58] RECOVERY - SSH on mw1054 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [21:00:09] RECOVERY - DPKG on mw1054 is OK: All packages OK [21:00:19] RECOVERY - dhclient process on mw1054 is OK: PROCS OK: 0 processes with command name dhclient [21:00:28] RECOVERY - configured eth on mw1054 is OK - interfaces up [21:00:58] RECOVERY - HHVM rendering on mw1054 is OK: HTTP OK: HTTP/1.1 200 OK - 65702 bytes in 2.063 second response time [21:00:59] RECOVERY - puppet last run on mw1054 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [21:00:59] RECOVERY - RAID on mw1054 is OK no RAID installed [21:01:30] RECOVERY - nutcracker port on mw1054 is OK: TCP OK - 0.000 second response time on port 11212 [21:01:30] RECOVERY - salt-minion processes on mw1054 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:02:08] RECOVERY - Apache HTTP on mw1054 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.050 second response time [21:20:10] RECOVERY - puppet last run on mw1070 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [21:22:17] !log restarted cassandra on restbase1004 per urandom [21:22:21] Logged the message, Master [21:37:48] PROBLEM - puppet last run on mw1093 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:38:08] PROBLEM - HHVM rendering on mw1093 is CRITICAL - Socket timeout after 10 seconds [21:38:36] (03PS2) 10Alex Monk: Set $wgMainStash to redis instead of the DB default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221885 (https://phabricator.wikimedia.org/T88493) (owner: 10Aaron Schulz) [21:39:08] PROBLEM - Apache HTTP on mw1093 is CRITICAL - Socket timeout after 10 seconds [21:39:29] PROBLEM - nutcracker process on mw1093 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:39:40] PROBLEM - configured eth on mw1093 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:40:08] (03PS3) 10Alex Monk: Set $wgMainStash to redis instead of the DB default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221885 (https://phabricator.wikimedia.org/T88493) (owner: 10Aaron Schulz) [21:40:19] PROBLEM - dhclient process on mw1093 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:40:19] PROBLEM - SSH on mw1093 is CRITICAL - Socket timeout after 10 seconds [21:40:39] PROBLEM - DPKG on mw1093 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:40:39] PROBLEM - salt-minion processes on mw1093 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:40:49] PROBLEM - nutcracker port on mw1093 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:40:58] PROBLEM - Disk space on mw1093 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:41:09] PROBLEM - RAID on mw1093 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:42:19] PROBLEM - HHVM processes on mw1093 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:44:51] Do we use Dovecot anywhere these days? [21:46:19] RECOVERY - salt-minion processes on mw1093 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:46:19] RECOVERY - nutcracker port on mw1093 is OK: TCP OK - 0.000 second response time on port 11212 [21:46:28] RECOVERY - Disk space on mw1093 is OK: DISK OK [21:47:49] RECOVERY - HHVM processes on mw1093 is OK: PROCS OK: 6 processes with command name hhvm [21:51:49] PROBLEM - salt-minion processes on mw1093 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:51:49] PROBLEM - nutcracker port on mw1093 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:51:58] PROBLEM - Disk space on mw1093 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:53:19] PROBLEM - HHVM processes on mw1093 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:59:59] RECOVERY - nutcracker process on mw1093 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [22:03:39] PROBLEM - puppet last run on mw1021 is CRITICAL Puppet has 22 failures [22:03:49] RECOVERY - configured eth on mw1093 is OK - interfaces up [22:04:20] RECOVERY - dhclient process on mw1093 is OK: PROCS OK: 0 processes with command name dhclient [22:04:20] RECOVERY - HHVM processes on mw1093 is OK: PROCS OK: 6 processes with command name hhvm [22:04:28] RECOVERY - SSH on mw1093 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [22:04:48] RECOVERY - salt-minion processes on mw1093 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:04:48] RECOVERY - nutcracker port on mw1093 is OK: TCP OK - 0.000 second response time on port 11212 [22:04:48] RECOVERY - DPKG on mw1093 is OK: All packages OK [22:04:50] RECOVERY - Disk space on mw1093 is OK: DISK OK [22:05:18] RECOVERY - RAID on mw1093 is OK no RAID installed [22:05:29] RECOVERY - puppet last run on mw1093 is OK Puppet is currently enabled, last run 49 minutes ago with 0 failures [22:05:58] RECOVERY - HHVM rendering on mw1093 is OK: HTTP OK: HTTP/1.1 200 OK - 65694 bytes in 6.962 second response time [22:06:48] RECOVERY - Apache HTTP on mw1093 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.051 second response time [22:11:28] PROBLEM - HHVM rendering on mw1055 is CRITICAL - Socket timeout after 10 seconds [22:11:29] PROBLEM - Apache HTTP on mw1055 is CRITICAL - Socket timeout after 10 seconds [22:12:58] PROBLEM - puppet last run on mw1055 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:13:29] PROBLEM - nutcracker port on mw1055 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:14:08] PROBLEM - dhclient process on mw1055 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:14:19] PROBLEM - nutcracker process on mw1055 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:14:19] PROBLEM - salt-minion processes on mw1055 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:14:19] PROBLEM - DPKG on mw1055 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:14:19] PROBLEM - Disk space on mw1055 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:14:19] PROBLEM - RAID on mw1055 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:14:20] PROBLEM - SSH on mw1055 is CRITICAL - Socket timeout after 10 seconds [22:14:39] PROBLEM - HHVM processes on mw1055 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:14:40] PROBLEM - configured eth on mw1055 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:20:19] RECOVERY - puppet last run on mw1021 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:26:19] RECOVERY - nutcracker port on mw1055 is OK: TCP OK - 0.000 second response time on port 11212 [22:27:19] RECOVERY - SSH on mw1055 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [22:27:29] RECOVERY - HHVM processes on mw1055 is OK: PROCS OK: 1 process with command name hhvm [22:27:38] RECOVERY - configured eth on mw1055 is OK - interfaces up [22:28:08] RECOVERY - HHVM rendering on mw1055 is OK: HTTP OK: HTTP/1.1 200 OK - 65694 bytes in 6.078 second response time [22:28:08] RECOVERY - Apache HTTP on mw1055 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.130 second response time [22:28:49] RECOVERY - dhclient process on mw1055 is OK: PROCS OK: 0 processes with command name dhclient [22:28:58] RECOVERY - salt-minion processes on mw1055 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:28:59] RECOVERY - nutcracker process on mw1055 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [22:29:00] RECOVERY - Disk space on mw1055 is OK: DISK OK [22:29:00] RECOVERY - RAID on mw1055 is OK no RAID installed [22:29:08] RECOVERY - DPKG on mw1055 is OK: All packages OK [22:31:19] RECOVERY - puppet last run on mw1055 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [22:33:30] PROBLEM - HHVM rendering on mw1023 is CRITICAL - Socket timeout after 10 seconds [22:34:09] PROBLEM - Apache HTTP on mw1023 is CRITICAL - Socket timeout after 10 seconds [22:35:08] PROBLEM - configured eth on mw1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:35:19] PROBLEM - SSH on mw1023 is CRITICAL - Socket timeout after 10 seconds [22:35:20] PROBLEM - RAID on mw1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:35:20] PROBLEM - salt-minion processes on mw1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:35:20] PROBLEM - puppet last run on mw1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:35:49] PROBLEM - DPKG on mw1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:35:49] PROBLEM - nutcracker process on mw1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:36:18] PROBLEM - Disk space on mw1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:37:19] RECOVERY - salt-minion processes on mw1023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:38:49] PROBLEM - nutcracker port on mw1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:39:49] PROBLEM - puppet last run on mw2148 is CRITICAL Puppet has 1 failures [22:42:58] PROBLEM - salt-minion processes on mw1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:44:09] PROBLEM - HHVM processes on mw1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:44:59] PROBLEM - dhclient process on mw1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:48:28] RECOVERY - salt-minion processes on mw1023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:48:38] RECOVERY - dhclient process on mw1023 is OK: PROCS OK: 0 processes with command name dhclient [22:53:59] PROBLEM - salt-minion processes on mw1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:54:09] PROBLEM - dhclient process on mw1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:56:29] RECOVERY - puppet last run on mw2148 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:01:08] RECOVERY - nutcracker port on mw1023 is OK: TCP OK - 0.000 second response time on port 11212 [23:01:30] RECOVERY - salt-minion processes on mw1023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:01:30] RECOVERY - dhclient process on mw1023 is OK: PROCS OK: 0 processes with command name dhclient [23:01:49] RECOVERY - nutcracker process on mw1023 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [23:01:50] RECOVERY - DPKG on mw1023 is OK: All packages OK [23:02:19] RECOVERY - Disk space on mw1023 is OK: DISK OK [23:02:38] RECOVERY - HHVM processes on mw1023 is OK: PROCS OK: 6 processes with command name hhvm [23:02:47] (03PS1) 10John F. Lewis: (www.)wmfusercontent.org point to text-lb [dns] - 10https://gerrit.wikimedia.org/r/222859 (https://phabricator.wikimedia.org/T104735) [23:02:59] RECOVERY - configured eth on mw1023 is OK - interfaces up [23:02:59] (03CR) 10jenkins-bot: [V: 04-1] (www.)wmfusercontent.org point to text-lb [dns] - 10https://gerrit.wikimedia.org/r/222859 (https://phabricator.wikimedia.org/T104735) (owner: 10John F. Lewis) [23:03:09] RECOVERY - SSH on mw1023 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [23:03:19] RECOVERY - RAID on mw1023 is OK no RAID installed [23:05:09] RECOVERY - puppet last run on mw1023 is OK Puppet is currently enabled, last run 51 minutes ago with 0 failures [23:05:39] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.121 second response time [23:06:49] RECOVERY - HHVM rendering on mw1023 is OK: HTTP OK: HTTP/1.1 200 OK - 65676 bytes in 0.104 second response time [23:06:58] (03Abandoned) 10John F. Lewis: (www.)wmfusercontent.org point to text-lb [dns] - 10https://gerrit.wikimedia.org/r/222859 (https://phabricator.wikimedia.org/T104735) (owner: 10John F. Lewis) [23:08:32] (03PS1) 10John F. Lewis: (www.)wmfusercontent.org point to text-lb [dns] - 10https://gerrit.wikimedia.org/r/222860 (https://phabricator.wikimedia.org/T104735) [23:15:59] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 4 below the confidence bounds [23:17:39] PROBLEM - puppet last run on cp3019 is CRITICAL puppet fail [23:19:28] PROBLEM - HHVM rendering on mw1021 is CRITICAL - Socket timeout after 10 seconds [23:19:29] PROBLEM - Apache HTTP on mw1021 is CRITICAL - Socket timeout after 10 seconds [23:19:59] PROBLEM - dhclient process on mw1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:20:18] PROBLEM - nutcracker port on mw1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:20:19] PROBLEM - DPKG on mw1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:20:29] PROBLEM - HHVM processes on mw1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:20:40] PROBLEM - configured eth on mw1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:20:40] PROBLEM - SSH on mw1021 is CRITICAL - Socket timeout after 10 seconds [23:20:40] PROBLEM - nutcracker process on mw1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:20:59] PROBLEM - RAID on mw1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:21:00] PROBLEM - Disk space on mw1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:21:08] PROBLEM - salt-minion processes on mw1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:21:38] PROBLEM - puppet last run on mw1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:27:28] RECOVERY - nutcracker port on mw1021 is OK: TCP OK - 0.000 second response time on port 11212 [23:27:50] RECOVERY - HHVM processes on mw1021 is OK: PROCS OK: 6 processes with command name hhvm [23:28:00] RECOVERY - nutcracker process on mw1021 is OK: PROCS OK: 1 process with UID = 109 (nutcracker), command name nutcracker [23:33:09] PROBLEM - nutcracker port on mw1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:33:19] PROBLEM - HHVM processes on mw1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:33:38] PROBLEM - nutcracker process on mw1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:34:29] RECOVERY - puppet last run on cp3019 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [23:34:30] RECOVERY - dhclient process on mw1021 is OK: PROCS OK: 0 processes with command name dhclient [23:34:48] RECOVERY - nutcracker port on mw1021 is OK: TCP OK - 0.000 second response time on port 11212 [23:34:59] RECOVERY - DPKG on mw1021 is OK: All packages OK [23:35:00] RECOVERY - HHVM processes on mw1021 is OK: PROCS OK: 6 processes with command name hhvm [23:35:10] RECOVERY - configured eth on mw1021 is OK - interfaces up [23:35:19] RECOVERY - SSH on mw1021 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [23:35:19] RECOVERY - nutcracker process on mw1021 is OK: PROCS OK: 1 process with UID = 109 (nutcracker), command name nutcracker [23:35:38] RECOVERY - Disk space on mw1021 is OK: DISK OK [23:35:38] RECOVERY - salt-minion processes on mw1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:35:59] RECOVERY - HHVM rendering on mw1021 is OK: HTTP OK: HTTP/1.1 200 OK - 65677 bytes in 0.146 second response time [23:36:00] RECOVERY - Apache HTTP on mw1021 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.055 second response time [23:36:00] RECOVERY - puppet last run on mw1021 is OK Puppet is currently enabled, last run 37 minutes ago with 0 failures [23:37:19] RECOVERY - RAID on mw1021 is OK no RAID installed [23:42:13] !log Ran "mwscript updateSpecialPages.php labswiki --override --only=Wantedpages" on silver because I wanted to see what was missing. We should make this automatic at some point, completed in 0.44 seconds [23:44:02] Is morebots asleep? :/ [23:44:06] !log test morebots [23:44:26] Logged the message, Master [23:45:05] !log Ran "mwscript updateSpecialPages.php labswiki --override --only=Wantedpages" on silver because I wanted to see what was missing. We should make this automatic at some point, completed in 0.44 seconds [23:45:23] maybe it's just... too long? [23:45:36] does it need to go in 140 characters? [23:49:37] !log Ran "mwscript updateSpecialPages.php labswiki --override --only=Wantedpages" on silver, completed in 0.44 seconds [23:49:41] Logged the message, Master [23:50:05] gj [23:50:57] twitter for admins [23:51:16] and surprise surprise, wantedpages reveals a bunch of outdated things