[00:00:45] (03PS1) 10Ori.livneh: Revert "Force 'Transfer-Encoding: Chunked' header on 404 responses" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222726 [00:00:58] (03CR) 10Ori.livneh: [C: 032] Revert "Force 'Transfer-Encoding: Chunked' header on 404 responses" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222726 (owner: 10Ori.livneh) [00:01:02] (03Merged) 10jenkins-bot: Revert "Force 'Transfer-Encoding: Chunked' header on 404 responses" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222726 (owner: 10Ori.livneh) [00:15:51] 7Blocked-on-Operations, 6operations, 6Commons, 6Multimedia, and 4 others: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1426054 (10ori) For the past several weeks, this task was blocked on the fact that the HHVM renderer (mw1152) would cause 5xx spikes whenever it was pool... [00:16:06] (03PS21) 10Paladox: Rename all main WikimediaIncubator settings to have a wg prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207909 [00:25:44] 21 [00:28:09] I told him we're not doing that :< [00:29:58] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [00:30:11] ori, the codfw imagescalers are all hhvm+trusty already, right? [00:30:49] Krenair: I have no idea. Let's see. [00:30:59] it's just mw1153-1160 left? [00:31:28] yes, codfw are all hhvm+trusty [00:31:34] there's tin and terbium, too [00:31:37] right [00:31:40] they're another task [00:31:48] 7Blocked-on-Operations, 6operations, 6Commons, 6Multimedia, and 4 others: Convert eqiad imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1426057 (10Krenair) [00:32:30] 7Blocked-on-Operations, 6operations, 6Commons, 6Multimedia, and 4 others: Convert eqiad imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1095183 (10Krenair) (codfw imagescalers are all HHVM+Trusty already, so we're just left with mw1153-1160) [00:33:12] it should be pretty straightforward. IIRC the issue was/is that HHVM was slow at checking files for syntax errors, which is one of the things scap needs to do. But we can just continue using php5 for that. [00:34:18] we also need silver, and the snapshot hosts [00:34:34] * ori groans. [00:35:15] isn't there some command you can run that checks what php is installed on all boxes? :p [00:35:21] yeah [00:35:34] dsh on mediawiki-installation [00:35:36] or some salt magic [00:37:24] Ah [00:37:35] tmh [00:38:01] seems to be precise+php5 [00:38:11] sounds very likely [00:39:08] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [00:41:16] 6operations, 10Datasets-General-or-Unknown, 7HHVM: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#1426063 (10Krenair) [00:41:35] * Krenair will file a ticket for that too [00:41:42] {'terbium.eqiad.wmnet': '/usr/bin/php5'} [00:41:42] {'silver.wikimedia.org': '/usr/bin/php5'} [00:41:42] {'tin.eqiad.wmnet': '/usr/bin/php5'} [00:41:44] {'mw1159.eqiad.wmnet': '/usr/bin/php5'} [00:41:46] {'mw1158.eqiad.wmnet': '/usr/bin/php5'} [00:41:48] {'mw1157.eqiad.wmnet': '/usr/bin/php5'} [00:42:43] snapshot* hosts are not in mediawiki-installation, but they are indeed using php5 [00:42:57] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 7HHVM: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1426066 (10Krenair) 3NEW [00:43:19] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 7HHVM: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1426075 (10Krenair) [00:43:22] Krenair: thanks, that's very useful [00:43:22] 6operations, 7Tracking: Upgrade Wikimedia servers to Ubuntu Trusty (14.04) (tracking) - https://phabricator.wikimedia.org/T65899#1426074 (10Krenair) [00:43:40] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 7HHVM: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1426066 (10Krenair) [00:43:43] 6operations, 7HHVM, 7Tracking: Complete the use of HHVM over Zend PHP on the Wikimedia cluster (tracking) - https://phabricator.wikimedia.org/T86081#1426077 (10Krenair) [00:44:07] 6operations, 6Labs, 10wikitech.wikimedia.org, 7HHVM: Move wikitech (silver) to HHVM - https://phabricator.wikimedia.org/T98813#1426080 (10Krenair) [00:45:17] wasn't there already one for the tmh boxes [00:45:17] silver is such a mess [00:45:22] Or was it just mentioned on other tickets? [00:46:06] Reedy, I couldn't find one [00:46:26] Krenair, thanks again. That was some badly-needed and long-overdue bug triage. [00:47:22] So either I'm going blind or Phabricator's search is being unhelpful [00:47:27] Once this is all done, are we going to bump the minimum requirement to 5.4, or will we leapfrog 5.4 and go straight to 5.5? [00:48:15] I guess we need to survey most of the common distros [00:48:20] PHP are supporting 5.4 still at least [00:48:54] silver is running PHP 5.5.9 [00:48:58] So not technically a blocker for this [00:49:21] It's a blocker for any hhvm specific syntax in mw-config [00:49:28] But not for keeping 5.3 crap around in mw core [00:49:39] yeah [00:50:04] Krenair: phab search is useless. Try Google site:phabricator.wikimedia.org [00:50:31] needs a greasemonkey script [00:51:29] heh [00:52:33] Reedy, btw, you mentioned dsh and salt earlier: salt is an ops thing, right? [00:52:57] ops-only* [00:53:04] root-only, yes. [00:53:14] and dsh is... not? [00:53:18] nope [00:53:26] if you've got login access to the box... [00:53:31] dsh is just a wrapper around ssh [00:53:33] and can run your target command [00:53:43] I see [00:57:59] it's not as easy as it once was, Reedy [00:58:05] because agent-forwarding is disabled [00:58:12] heh, true [00:58:19] so ability to ssh to mw1041 + ability to ssh to tin != ability to ssh to mw1041 from tin [00:58:36] so you have to run it locally and it's slow? [00:58:38] but you can ssh as mwdeploy using the keyholder ssh agent socket, which is what scap does behind the scenes [00:58:58] i just hacked out the invocation, it's not exactly intuitive, so write it down :P [00:59:07] lol [00:59:12] SSH_AUTH_SOCK=/run/keyholder/proxy.sock dsh -F 20 -g mediawiki-installation -r ssh -o -oUser=mwdeploy -- [command] [00:59:58] RECOVERY - haproxy failover on dbproxy1004 is OK check_failover servers up 2 down 0 [01:00:22] https://wikitech.wikimedia.org/wiki/Dsh is super out of date :( [01:00:26] !log reload haproxy dbproxy1004 [01:00:30] Logged the message, Master [01:00:42] zwinger, lol [01:01:02] wikitech is unusably slow and clunky [01:01:04] ohai springle [01:01:19] that does not bode well for keeping docs up-to-date [01:02:02] seems pretty quick to me? [01:02:55] Reedy: you a pilot yet? [01:03:14] springle: I've got a FAA PPL now.. So I guess, technically, yes? ;D [01:03:35] zwinger first appears in SAL archive 1, i.e. 11 years ago [01:04:18] http://i.imgur.com/pvzFgAk.png -- firstPaint at 3.7s [01:05:29] enwiki's main page, which is substantially bigger, is 2.1s [01:05:50] lol [01:06:10] it's not fast, but not that slow for me... on crappy hotel wifi [01:06:57] then there's the fact that the openstack integration is so atrocious and busted that the mere presence of these features in the sidebar in my visual field makes me want to close the tab as fast as possible [01:07:21] merge with mediawiki.org? ;) [01:07:32] * Krenair kicks legoktm [01:07:47] Krenair: sorry, I meant wikimediafoundation.org [01:07:58] no, just split up the openstack and documentation stuff [01:08:10] it's getting split up eventually anyway since ops are going for openstack horizon afaik [01:08:15] legoktm, haha [01:08:22] yep [01:08:39] also it should have a markdown contenthandler [01:09:02] I love VisualEditor (really) but it's not optimal for documenting shell scripts and configuration files [01:09:38] and (most controversially) I think pages that have not been edited in N months should be auto-deleted [01:10:05] outdated documentation is worse than no documentation at all, because it is actively misleading [01:10:21] and there is just so much of it around [01:10:38] anything that is still relevant gets edited at least once a year [01:10:45] i challenge you to find something useful that hasn't been [01:11:47] here's one to amuse springle: https://wikitech.wikimedia.org/wiki/Adding_a_file_in_innodb [01:12:11] heh [01:13:32] ori: my favorite page on wikitech is https://wikitech.wikimedia.org/wiki/Hurricanes [01:16:08] and we'd lose all these awesome pages if it was up to ori [01:16:39] ori, are dumps relevant? [01:17:00] I donno, I think ori may be right [01:17:15] I went digging for an example of no edits in a year, but still useful unique info [01:17:15] Krenair: to hhvm and such? [01:17:23] in general [01:17:33] I don't follow [01:17:38] anything that is still relevant gets edited at least once a year [01:17:38] i challenge you to find something useful that hasn't been [01:17:39] all the things I could find that hadn't been updated in a year was pretty bad, and actively misleading heh [01:18:11] we could just replace 3/4 of wikitech with "read the source, here's the links to our git repos" :) [01:18:15] well, deleting is never useful. automatically moving things to a Archive: namespace and excluding that from default search would be reasonable [01:18:23] except for hardware dc ops stuff, network stuff [01:18:48] legoktm: that's the wikipedian in you talking, and you should tell him to go back to enwiki or meta and leave your developer self alone :P [01:18:53] the BGP link on the front page ends up here: https://wikitech.wikimedia.org/w/index.php?title=BGP/old_setup&redirect=no [01:19:14] which talks about tampa, and us not having our own ASN and using a private one, etc [01:19:26] it's completely horribly the opposite of what it should say today heh [01:19:54] well, deleting is never useful. <-- i disagree! [01:20:38] https://wikitech.wikimedia.org/wiki/Proxy_access_to_cluster [01:20:42] ^ says to use fenari :) [01:21:12] and mentions labsconsole [01:21:30] at least you guys have been around long enough to be able to tell at a glance that this content is obsolete [01:21:39] the SSH config page has some issues with assuming everyone has root as well [01:21:53] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Reedy via Special:Random [01:22:00] imagine how disorienting it is to be new and have to navigate content that is so frequently totally incorrect [01:22:06] yeah [01:22:19] who has time to write docs, though :/ [01:22:34] I know, we should do it as a part of everything. it's sort of like writing tests. [01:22:54] what's the thing about lily.esams? [01:23:25] isn't the esams bastion hooft? [01:23:36] yes, it is [01:23:57] I don't know what lily is/was. fenari at least existed when I first arrived. I don't think I ever saw lily [01:24:17] apparently lily has previously hosted mailing lists [01:24:19] a markdown contenthandler should be pretty easy to do, now that pulling in external libraries to use in core is easy [01:24:26] (pre-jan12) [01:25:17] 6operations: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#1426112 (10Springle) Doing this to most production DBs seems straight forward. Pain points, due just to complexity, will be on M[1-4]. My expectation is that iptables won't be a performance bottleneck, but +1 to... [01:25:25] https://wikitech.wikimedia.org/wiki/Mail [01:25:35] why markdown? [01:26:08] because gerrit, phabricator and github [01:26:22] and mediawiki, as soon as we add the markdown content handler :P [01:26:29] recursive justification ftw [01:26:46] mchenry... sanger... lily [01:26:55] ori: anyways, https://wikitech.wikimedia.org/wiki/User:Legoktm/pywikibot_on_tools_lab hasn't been touched in over a year and is still accurate :) [01:27:19] ask Katie for a null-edit [01:27:42] gotta run, bye! [01:28:07] why do we have random pages like https://wikitech.wikimedia.org/wiki/Nova_Resource:I-00000698.eqiad.wmflabs? [01:28:41] Did esams used to be separate from the 'main cluster'? [01:29:20] depends what you mean by seperate [01:30:03] https://wikitech.wikimedia.org/wiki/Proxy_access_to_cluster [01:30:45] esams needed a separate connection to the 'main cluster' (eqiad/pmtpa/sdtpa)? [01:39:28] PROBLEM - puppet last run on mw1069 is CRITICAL puppet fail [01:44:08] who has time to write docs, though :/ <---- Zanzu! [01:44:11] https://imgur.com/a/zanzu Zanzu! Documentation. Bug 1. [01:45:54] (I like to anthropomorphize/deify the random alphanumerical string that imgur gave that gallery. Zanzu is a capricious god. Zanzu is a howling storm. Zanzu will arrive next week.) [01:50:15] I wonder how yaseo/lopar fitted into https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions [01:50:29] RECOVERY - puppet last run on mw1069 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [01:50:48] And I'm pretty sure that "We do not re-use hostnames of past servers on new servers." rule has been broken [01:53:00] Rules are made to be broken [01:53:53] https://wikitech.wikimedia.org/wiki/Capella [02:12:12] legoktm, sigh. a bunch of these old pmtpa pages were either left as is, or blanked [02:12:28] not properly deleted or marked as historical [02:12:33] or moved to obsolete namespace [02:18:19] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [02:29:02] !log l10nupdate Synchronized php-1.26wmf12/cache/l10n: (no message) (duration: 09m 59s) [02:29:08] Logged the message, Master [02:31:28] PROBLEM - puppet last run on mw1069 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:39] RECOVERY - HHVM rendering on mw1077 is OK: HTTP OK: HTTP/1.1 200 OK - 66012 bytes in 0.166 second response time [02:32:49] RECOVERY - Apache HTTP on mw1077 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.047 second response time [02:32:59] PROBLEM - HHVM rendering on mw1107 is CRITICAL - Socket timeout after 10 seconds [02:33:08] PROBLEM - HHVM rendering on mw1076 is CRITICAL - Socket timeout after 10 seconds [02:33:09] RECOVERY - puppet last run on mw1069 is OK Puppet is currently enabled, last run 3 minutes ago with 0 failures [02:33:29] PROBLEM - Apache HTTP on mw1076 is CRITICAL - Socket timeout after 10 seconds [02:33:49] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [02:33:59] PROBLEM - dhclient process on mw1076 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:59] PROBLEM - DPKG on mw1076 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:00] PROBLEM - RAID on mw1076 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:08] PROBLEM - Apache HTTP on mw1107 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.011 second response time [02:34:09] PROBLEM - puppet last run on mw1076 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:39] RECOVERY - puppet last run on mw1077 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [02:34:48] PROBLEM - HHVM processes on mw1076 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:48] PROBLEM - nutcracker process on mw1076 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:49] PROBLEM - salt-minion processes on mw1110 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:50] PROBLEM - HHVM processes on mw1110 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:50] PROBLEM - puppet last run on mw1110 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:58] PROBLEM - HHVM rendering on mw1110 is CRITICAL - Socket timeout after 10 seconds [02:34:58] PROBLEM - DPKG on mw1110 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:19] PROBLEM - RAID on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:19] PROBLEM - Apache HTTP on mw1110 is CRITICAL - Socket timeout after 10 seconds [02:35:19] PROBLEM - SSH on mw1110 is CRITICAL - Socket timeout after 10 seconds [02:35:39] RECOVERY - dhclient process on mw1076 is OK: PROCS OK: 0 processes with command name dhclient [02:35:39] RECOVERY - DPKG on mw1076 is OK: All packages OK [02:35:39] RECOVERY - RAID on mw1076 is OK no RAID installed [02:35:48] PROBLEM - RAID on mw1110 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:50] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.125 second response time [02:35:50] RECOVERY - puppet last run on mw1076 is OK Puppet is currently enabled, last run 25 minutes ago with 0 failures [02:35:59] PROBLEM - Disk space on mw1110 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:08] PROBLEM - nutcracker process on mw1110 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:08] PROBLEM - dhclient process on mw1110 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:09] PROBLEM - configured eth on mw1110 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:28] RECOVERY - HHVM processes on mw1076 is OK: PROCS OK: 6 processes with command name hhvm [02:36:28] RECOVERY - nutcracker process on mw1076 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:36:38] RECOVERY - HHVM rendering on mw1107 is OK: HTTP OK: HTTP/1.1 200 OK - 66012 bytes in 0.315 second response time [02:36:39] RECOVERY - HHVM rendering on mw1076 is OK: HTTP OK: HTTP/1.1 200 OK - 66012 bytes in 0.585 second response time [02:36:49] PROBLEM - nutcracker port on mw1110 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:59] RECOVERY - RAID on mw1107 is OK no RAID installed [02:37:00] RECOVERY - Apache HTTP on mw1076 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.090 second response time [02:39:41] !log LocalisationUpdate completed (1.26wmf12) at 2015-07-04 02:39:41+00:00 [02:39:46] Logged the message, Master [02:39:49] RECOVERY - Disk space on mw1110 is OK: DISK OK [02:39:49] RECOVERY - nutcracker process on mw1110 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:39:49] RECOVERY - dhclient process on mw1110 is OK: PROCS OK: 0 processes with command name dhclient [02:39:50] RECOVERY - configured eth on mw1110 is OK - interfaces up [02:40:19] RECOVERY - salt-minion processes on mw1110 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:40:19] RECOVERY - HHVM processes on mw1110 is OK: PROCS OK: 6 processes with command name hhvm [02:40:19] RECOVERY - puppet last run on mw1110 is OK Puppet is currently enabled, last run 16 minutes ago with 0 failures [02:40:29] RECOVERY - DPKG on mw1110 is OK: All packages OK [02:40:38] RECOVERY - nutcracker port on mw1110 is OK: TCP OK - 0.000 second response time on port 11212 [02:40:49] RECOVERY - SSH on mw1110 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [02:41:19] RECOVERY - RAID on mw1110 is OK no RAID installed [02:45:58] PROBLEM - puppet last run on mw1110 is CRITICAL puppet fail [02:48:48] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [03:11:24] !log Promoted Krinkle and Krenair to admin, cloudadmin on wikitech, because duh. [03:11:30] Logged the message, Master [03:18:37] 6operations, 6Phabricator, 6Security: Phabricator dependence on wmfusercontent.org - https://phabricator.wikimedia.org/T104730#1426182 (10Bawolff) @Mike_Peel To do development stuff, sometimes we have to have files of various types that could be dangerous. For example, svgs that take over your account and do... [03:26:58] PROBLEM - HHVM rendering on mw1040 is CRITICAL - Socket timeout after 10 seconds [03:27:29] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 8 below the confidence bounds [03:27:39] PROBLEM - Apache HTTP on mw1040 is CRITICAL - Socket timeout after 10 seconds [03:28:39] PROBLEM - RAID on mw1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:28:49] PROBLEM - SSH on mw1040 is CRITICAL - Socket timeout after 10 seconds [03:28:49] PROBLEM - puppet last run on mw1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:28:59] PROBLEM - nutcracker port on mw1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:29:19] PROBLEM - DPKG on mw1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:29:48] PROBLEM - configured eth on mw1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:30:08] PROBLEM - Disk space on mw1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:30:59] PROBLEM - dhclient process on mw1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:31:49] RECOVERY - Disk space on mw1040 is OK: DISK OK [03:32:09] RECOVERY - RAID on mw1040 is OK no RAID installed [03:32:19] RECOVERY - HHVM rendering on mw1040 is OK: HTTP OK: HTTP/1.1 200 OK - 66012 bytes in 0.233 second response time [03:32:19] RECOVERY - SSH on mw1040 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [03:32:19] RECOVERY - puppet last run on mw1040 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [03:32:30] RECOVERY - nutcracker port on mw1040 is OK: TCP OK - 0.000 second response time on port 11212 [03:32:48] RECOVERY - dhclient process on mw1040 is OK: PROCS OK: 0 processes with command name dhclient [03:32:58] RECOVERY - DPKG on mw1040 is OK: All packages OK [03:33:00] RECOVERY - Apache HTTP on mw1040 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.219 second response time [03:33:20] RECOVERY - configured eth on mw1040 is OK - interfaces up [04:38:59] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 1 below the confidence bounds [04:52:09] PROBLEM - RAID on mw1069 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:52:59] PROBLEM - HHVM rendering on mw1069 is CRITICAL - Socket timeout after 10 seconds [04:53:40] PROBLEM - Apache HTTP on mw1069 is CRITICAL - Socket timeout after 10 seconds [04:53:59] PROBLEM - puppet last run on mw1069 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:54:38] PROBLEM - DPKG on mw1069 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:54:58] PROBLEM - Disk space on mw1069 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:54:59] PROBLEM - configured eth on mw1069 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:54:59] PROBLEM - salt-minion processes on mw1069 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:54:59] PROBLEM - HHVM processes on mw1069 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:54:59] PROBLEM - nutcracker process on mw1069 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:54:59] PROBLEM - dhclient process on mw1069 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:55:20] PROBLEM - SSH on mw1069 is CRITICAL - Socket timeout after 10 seconds [04:55:29] PROBLEM - nutcracker port on mw1069 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:56:39] RECOVERY - Disk space on mw1069 is OK: DISK OK [04:56:40] RECOVERY - salt-minion processes on mw1069 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:56:40] RECOVERY - dhclient process on mw1069 is OK: PROCS OK: 0 processes with command name dhclient [04:56:40] RECOVERY - nutcracker process on mw1069 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [04:56:40] RECOVERY - HHVM processes on mw1069 is OK: PROCS OK: 6 processes with command name hhvm [04:56:40] RECOVERY - configured eth on mw1069 is OK - interfaces up [04:57:09] RECOVERY - SSH on mw1069 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [04:57:10] RECOVERY - nutcracker port on mw1069 is OK: TCP OK - 0.000 second response time on port 11212 [04:57:39] RECOVERY - RAID on mw1069 is OK no RAID installed [04:57:39] RECOVERY - puppet last run on mw1069 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [04:58:09] RECOVERY - DPKG on mw1069 is OK: All packages OK [05:01:43] !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Jul 4 05:01:43 UTC 2015 (duration 1m 42s) [05:01:47] Logged the message, Master [06:08:29] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 3 below the confidence bounds [06:27:29] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 3 below the confidence bounds [06:30:59] PROBLEM - puppet last run on mw1217 is CRITICAL Puppet has 1 failures [06:31:09] PROBLEM - puppet last run on cp3008 is CRITICAL puppet fail [06:33:29] PROBLEM - puppet last run on db1067 is CRITICAL Puppet has 1 failures [06:34:59] PROBLEM - puppet last run on ms-fe2003 is CRITICAL Puppet has 1 failures [06:35:28] PROBLEM - puppet last run on db2040 is CRITICAL Puppet has 2 failures [06:36:18] PROBLEM - puppet last run on labcontrol2001 is CRITICAL Puppet has 1 failures [06:37:59] PROBLEM - puppet last run on mw2066 is CRITICAL Puppet has 1 failures [06:37:59] PROBLEM - puppet last run on mw1065 is CRITICAL Puppet has 1 failures [06:38:19] PROBLEM - puppet last run on mw2073 is CRITICAL Puppet has 1 failures [06:38:19] PROBLEM - puppet last run on mw2023 is CRITICAL Puppet has 1 failures [06:38:19] PROBLEM - puppet last run on mw2143 is CRITICAL Puppet has 1 failures [06:38:49] PROBLEM - puppet last run on mw2212 is CRITICAL Puppet has 1 failures [06:39:19] PROBLEM - puppet last run on mw2184 is CRITICAL Puppet has 1 failures [06:39:49] PROBLEM - puppet last run on mw1114 is CRITICAL Puppet has 1 failures [06:46:18] RECOVERY - puppet last run on mw1217 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:28] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:46:49] RECOVERY - puppet last run on db1067 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:46:50] RECOVERY - puppet last run on db2040 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:46:59] RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:47:38] RECOVERY - puppet last run on mw2066 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:39] RECOVERY - puppet last run on mw1065 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:48] RECOVERY - puppet last run on labcontrol2001 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:47:49] RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:47:49] RECOVERY - puppet last run on mw2023 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:58] RECOVERY - puppet last run on mw2143 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:29] RECOVERY - puppet last run on ms-fe2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:29] RECOVERY - puppet last run on mw2212 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:19] RECOVERY - puppet last run on mw1114 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:39] PROBLEM - puppet last run on mw1106 is CRITICAL Puppet has 5 failures [07:06:49] RECOVERY - puppet last run on mw1106 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [07:21:19] PROBLEM - Apache HTTP on mw1113 is CRITICAL - Socket timeout after 10 seconds [07:21:39] PROBLEM - HHVM rendering on mw1113 is CRITICAL - Socket timeout after 10 seconds [07:21:39] PROBLEM - DPKG on mw1113 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:22:08] PROBLEM - nutcracker port on mw1113 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:22:09] PROBLEM - puppet last run on mw1113 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:22:29] PROBLEM - RAID on mw1113 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:22:39] PROBLEM - SSH on mw1113 is CRITICAL - Socket timeout after 10 seconds [07:22:49] PROBLEM - HHVM processes on mw1113 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:22:50] PROBLEM - configured eth on mw1113 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:22:58] PROBLEM - nutcracker process on mw1113 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:19] PROBLEM - salt-minion processes on mw1113 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:50] PROBLEM - Disk space on mw1113 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:50] PROBLEM - dhclient process on mw1113 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:25:09] RECOVERY - salt-minion processes on mw1113 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:25:28] RECOVERY - DPKG on mw1113 is OK: All packages OK [07:25:38] RECOVERY - Disk space on mw1113 is OK: DISK OK [07:25:39] RECOVERY - dhclient process on mw1113 is OK: PROCS OK: 0 processes with command name dhclient [07:25:48] RECOVERY - nutcracker port on mw1113 is OK: TCP OK - 0.000 second response time on port 11212 [07:25:48] RECOVERY - puppet last run on mw1113 is OK Puppet is currently enabled, last run 26 minutes ago with 0 failures [07:26:09] RECOVERY - RAID on mw1113 is OK no RAID installed [07:26:19] RECOVERY - SSH on mw1113 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [07:26:29] RECOVERY - HHVM processes on mw1113 is OK: PROCS OK: 6 processes with command name hhvm [07:26:30] RECOVERY - configured eth on mw1113 is OK - interfaces up [07:26:40] RECOVERY - nutcracker process on mw1113 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [07:31:59] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60579 bytes in 0.165 second response time [07:33:19] PROBLEM - puppet last run on mw1113 is CRITICAL Puppet has 12 failures [07:56:48] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [08:00:38] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60558 bytes in 1.211 second response time [08:06:30] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [08:23:29] !log krinkle Synchronized php-1.26wmf12/resources/src/mediawiki/mediawiki.Title.js: I1dae1e63e47 (duration: 00m 17s) [08:49:19] PROBLEM - puppet last run on mw1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:49:39] PROBLEM - dhclient process on mw1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:49:58] PROBLEM - SSH on mw1045 is CRITICAL - Socket timeout after 10 seconds [08:50:08] PROBLEM - nutcracker process on mw1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:50:28] PROBLEM - HHVM rendering on mw1045 is CRITICAL - Socket timeout after 10 seconds [08:50:49] PROBLEM - Apache HTTP on mw1045 is CRITICAL - Socket timeout after 10 seconds [08:51:58] PROBLEM - HHVM processes on mw1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:51:58] PROBLEM - salt-minion processes on mw1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:52:08] PROBLEM - configured eth on mw1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:52:09] PROBLEM - nutcracker port on mw1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:52:19] PROBLEM - RAID on mw1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:52:38] PROBLEM - DPKG on mw1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:08] PROBLEM - Disk space on mw1045 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:56:59] RECOVERY - dhclient process on mw1045 is OK: PROCS OK: 0 processes with command name dhclient [08:57:28] RECOVERY - nutcracker process on mw1045 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [08:57:38] RECOVERY - nutcracker port on mw1045 is OK: TCP OK - 0.000 second response time on port 11212 [09:01:18] RECOVERY - HHVM processes on mw1045 is OK: PROCS OK: 1 process with command name hhvm [09:01:18] RECOVERY - salt-minion processes on mw1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:01:19] RECOVERY - configured eth on mw1045 is OK - interfaces up [09:01:39] RECOVERY - RAID on mw1045 is OK no RAID installed [09:01:49] RECOVERY - DPKG on mw1045 is OK: All packages OK [09:01:58] RECOVERY - Apache HTTP on mw1045 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.351 second response time [09:02:19] RECOVERY - Disk space on mw1045 is OK: DISK OK [09:02:49] RECOVERY - SSH on mw1045 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [09:03:28] RECOVERY - HHVM rendering on mw1045 is OK: HTTP OK: HTTP/1.1 200 OK - 66012 bytes in 0.142 second response time [09:17:39] PROBLEM - puppet last run on mw1030 is CRITICAL Puppet has 5 failures [09:24:28] RECOVERY - puppet last run on mw1045 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [09:28:08] 7Puppet, 6operations, 10Beta-Cluster: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#1426362 (10Nemo_bis) [09:39:34] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1426371 (10Krinkle) >>! In T99096#1354157, @dduvall wrote: >>>! In T99096#1337418, @BBlack wrote: >>... [09:41:10] (03CR) 10Krinkle: Set $wgMainStash to redis instead of the DB default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221885 (https://phabricator.wikimedia.org/T88493) (owner: 10Aaron Schulz) [10:23:49] 6operations, 6Phabricator, 6Security: Phabricator dependence on wmfusercontent.org - https://phabricator.wikimedia.org/T104730#1426381 (10Paladox) [10:24:38] PROBLEM - puppet last run on nescio is CRITICAL puppet fail [10:43:19] RECOVERY - puppet last run on nescio is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [10:45:59] PROBLEM - Apache HTTP on mw1028 is CRITICAL - Socket timeout after 10 seconds [10:46:09] PROBLEM - RAID on mw1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:47:29] PROBLEM - HHVM rendering on mw1028 is CRITICAL - Socket timeout after 10 seconds [10:47:38] PROBLEM - configured eth on mw1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:47:39] PROBLEM - nutcracker port on mw1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:47:58] PROBLEM - dhclient process on mw1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:48:40] PROBLEM - SSH on mw1028 is CRITICAL - Socket timeout after 10 seconds [10:48:48] PROBLEM - nutcracker process on mw1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:48:48] PROBLEM - HHVM processes on mw1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:49:09] PROBLEM - salt-minion processes on mw1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:49:09] PROBLEM - puppet last run on mw1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:49:10] PROBLEM - DPKG on mw1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:52:49] PROBLEM - HHVM rendering on mw1030 is CRITICAL - Socket timeout after 10 seconds [10:53:00] PROBLEM - Apache HTTP on mw1030 is CRITICAL - Socket timeout after 10 seconds [10:53:09] PROBLEM - Disk space on mw1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:53:19] RECOVERY - dhclient process on mw1028 is OK: PROCS OK: 0 processes with command name dhclient [10:53:59] PROBLEM - RAID on mw1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:54:09] PROBLEM - dhclient process on mw1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:54:09]