[00:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160128T0000). [00:00:04] Krenair Jdlrobson matt_flaschen mobrovac: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:52] * mobrovac here [00:00:58] how did we get to 10 patches?! [00:01:11] mine's just a betacluster config patch, not touching prod [00:01:52] 6operations, 10netops: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1972136 (10Dzahn) [00:01:54] 6operations, 10ops-eqiad: return nitrogen to spares - https://phabricator.wikimedia.org/T124717#1972137 (10Dzahn) [00:02:02] And the Flow are the same one to two branches, not sure how that counts... [00:02:30] 6operations, 10ops-eqiad: return nitrogen to spares - https://phabricator.wikimedia.org/T124717#1963665 (10Dzahn) [00:02:32] 6operations, 10netops: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1936607 (10Dzahn) [00:02:56] 6operations, 10ops-eqiad: return nitrogen to spares - https://phabricator.wikimedia.org/T124717#1963665 (10Dzahn) [00:03:21] (03PS5) 10Alex Monk: Rename NS_PROJECT_TALK at bswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247093 (https://phabricator.wikimedia.org/T115812) (owner: 10Luke081515) [00:03:48] 6operations, 10ops-eqiad: return nitrogen to spares - https://phabricator.wikimedia.org/T124717#1972156 (10Dzahn) a:5Dzahn>3Cmjohnson [00:04:01] (03CR) 10Alex Monk: [C: 04-1] "Actually, don't we also need an alias for the old name?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247093 (https://phabricator.wikimedia.org/T115812) (owner: 10Luke081515) [00:04:05] 6operations, 10netops: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1972157 (10Dzahn) a:5Papaul>3Dzahn [00:04:37] (03CR) 10Alex Monk: [C: 032] RESTBase: Start using deployment-restbase02 for clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266945 (https://phabricator.wikimedia.org/T125003) (owner: 10Mobrovac) [00:04:59] Also, I'll do initial testing, but mooeypoo is going to monitor for Collaboration team afterwards to make sure nothing explodes. I'll be reachable by phone as well. [00:05:04] cheers Krenair [00:05:24] (03Merged) 10jenkins-bot: RESTBase: Start using deployment-restbase02 for clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266945 (https://phabricator.wikimedia.org/T125003) (owner: 10Mobrovac) [00:06:07] I suppose mooeypoo could call [00:06:22] but I am not going to start calling people if their swat breaks [00:06:51] I'll send smoke signals... [00:07:10] Seriously, though, if things go as bad as to break, I assume we can revert. We don't expect that will happen, though. [00:07:16] indeed [00:07:30] stuff that's non-trivial to revert should not be going out in swat [00:07:36] (03PS1) 10Dzahn: remove public IP of nitrogen, decom [dns] - 10https://gerrit.wikimedia.org/r/266953 (https://phabricator.wikimedia.org/T123732) [00:07:44] jdlrobson, around? [00:08:39] Krenair: ++ "stuff that's non-trivial to revert should not be going out in swat" [00:08:56] (03PS2) 10Dzahn: remove public IP of nitrogen, decom [dns] - 10https://gerrit.wikimedia.org/r/266953 (https://phabricator.wikimedia.org/T123732) [00:09:14] Krenair: yes [00:09:35] !log krenair@mira Synchronized wmf-config/InitialiseSettings-labs.php: https://gerrit.wikimedia.org/r/#/c/266945/ (duration: 02m 36s) [00:09:40] It is trivial to revert, it's unlikely to break (always possible though), and that's why mooeypoo is here. [00:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:10:05] greg-g, Krenair: I'd instead just say 'stuff that's non-trivial to revert should not be going out, ever'. :-) [00:10:13] "always possible" in general, no flags with this particular patch, just good practice to have someone monitor. [00:10:24] matt_flaschen, the use of Message objects here is interesting, but it looks like it'll work [00:10:29] James_F: that too, but, uh, see also: session manager :) [00:10:40] greg-g: Well indeed. :-( [00:11:09] James_F, if only that were always possible. [00:11:11] (03CR) 10Dzahn: [C: 032] "server has already been shutdown and is gone from icinga" [dns] - 10https://gerrit.wikimedia.org/r/266953 (https://phabricator.wikimedia.org/T123732) (owner: 10Dzahn) [00:11:25] Krenair, yeah, it should. It is working on Beta and locally: http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special%3ALog&type=delete&user=Mattflaschen&page=&year=&month=-1&tagfilter=&limit=1 [00:11:36] True. [00:12:28] alright, I need to walk homeward, tgr you're still good on those patches post swat [00:12:41] okay [00:12:55] 6operations, 10Analytics-Cluster: analytics1026 - '/bin/mount /mnt/hdfs' returned 1: fuse: - https://phabricator.wikimedia.org/T125009#1972202 (10Dzahn) 3NEW [00:12:55] matt_flaschen and jdlrobson's patches are going through jenkins, mobrovac's beta one is done [00:13:02] Thanks Krenair [00:13:09] speaking of which, is that beta change working? [00:13:46] 6operations, 10Analytics-Cluster: analytics1026 - '/bin/mount /mnt/hdfs' returned 1: fuse: - https://phabricator.wikimedia.org/T125009#1972217 (10Dzahn) [00:14:10] * mobrovac checking [00:16:12] euh is VE disabled in beta??? [00:17:28] * mobrovac can't test [00:18:46] 6operations, 7Swift: swift: puppetized mkfs fails on ms-be2003, ms-be2015 - https://phabricator.wikimedia.org/T125013#1972247 (10Dzahn) 3NEW [00:19:02] 6operations, 7Swift: swift: puppetized mkfs/parted fails on ms-be2003, ms-be2015 - https://phabricator.wikimedia.org/T125013#1972254 (10Dzahn) [00:19:31] James_F: ^ ? (VE in beta) [00:21:00] 6operations, 10Analytics: Requests to (hard) redirect pages return their target's contents but are counted as pageviews to the redirect page - https://phabricator.wikimedia.org/T125015#1972277 (10mforns) 3NEW [00:21:03] it's listed on http://en.wikipedia.beta.wmflabs.org/wiki/Special:Version mobrovac [00:21:36] Krenair: logged in, but don't see the magic VE button on http://en.wikipedia.beta.wmflabs.org/wiki/Lucius_Arruntius [00:21:43] * mobrovac trying manully the veaction [00:21:59] retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [00:22:29] mutante: on sca100x? [00:22:32] mobrovac, might be because of the single-tab thing. James_F? [00:22:40] mobrovac: yes [00:22:43] mobrovac, I see the edit button [00:22:43] I.E. you just see Edit then can switch back and forth. [00:22:45] and it opens VE [00:22:50] as expected [00:23:06] I do have the always use VE preference set though [00:23:12] Krenair: adding ?veaction=edit did the trick [00:23:22] Krenair: confirming it works [00:23:30] Krenair: the patch [00:23:32] :) [00:23:44] mobrovac, you may want to read the description of https://phabricator.wikimedia.org/T58337 btw [00:24:01] mutante: known, we need to remove graphoid from sca100x hosts [00:24:05] mutante: can you ack? [00:24:27] Krenair: wow, that a long list :) thnx! [00:24:30] mobrovac: yes, should i link to something? [00:24:43] mooeypoo, test case for MediaWiki.org: https://www.mediawiki.org/w/index.php?title=Special%3ALog&type=delete&user=mattflaschen-WMF&page=&year=&month=-1&tagfilter=&limit=1 [00:24:44] (03CR) 10Tim Landscheidt: "IMHO the comment describes the peculiarities of the cronrunner and thus would belong in that manifest, but it's also not explanatory for s" [puppet] - 10https://gerrit.wikimedia.org/r/266935 (owner: 10Tim Landscheidt) [00:24:55] mobrovac, that's the pre-fix version. [00:25:04] mutante: no task yet, will create one so that we don't forget it [00:25:08] mobrovac, indeed, the VE loading logic complexity has grown a significant amount [00:25:08] (03PS2) 10Tim Landscheidt: Tools: Fix argument quoting in jlocal [puppet] - 10https://gerrit.wikimedia.org/r/266935 [00:25:22] matt_flaschen, mooeypoo: wmf.10 deploying now [00:25:31] mobrovac: great, thx [00:25:51] I went and juggled a bunch of things while waiting for jenkins [00:26:16] (not literally) [00:26:22] Thanks. :) [00:27:06] all right, jenkins just merged everything else too [00:27:18] sync-masters is so slow :( [00:27:23] !log krenair@mira Synchronized php-1.27.0-wmf.10/extensions/Flow/includes: https://gerrit.wikimedia.org/r/#/c/266938/ (duration: 02m 29s) [00:27:28] matt_flaschen, mooeypoo: ^ [00:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:28:39] 6operations, 7Swift: swift: puppetized mkfs/parted fails on ms-be2003, ms-be2015 - https://phabricator.wikimedia.org/T125013#1972323 (10Dzahn) or just hardware error.. there is also a CRIT RAID on 2003 and DISK space on 2015 [00:28:56] 6operations, 7Swift: swift: puppetized mkfs/parted fails on ms-be2003, ms-be2015 / disk error - https://phabricator.wikimedia.org/T125013#1972327 (10Dzahn) [00:30:15] Anyone have a work account on enwiki to do a sandbox test? I think they must have edited the original test case. [00:30:23] I have a test on MediaWiki.org, though, above. [00:31:11] 6operations, 10netops, 5Patch-For-Review: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1972337 (10Dzahn) removed IP from DNS, fixed link to blocking ticket, this is done, follow-up in the ops-eqiad ticket, assigned to Chris [00:31:23] 6operations, 10ops-eqiad: return nitrogen to spares - https://phabricator.wikimedia.org/T124717#1972339 (10Dzahn) [00:31:25] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1972340 (10Dzahn) [00:31:27] 6operations, 10netops, 5Patch-For-Review: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1972338 (10Dzahn) 5Open>3Resolved [00:31:38] 6operations, 10Parsoid, 6Services: Switch Parsoid to Jessie and Node 4.2 - https://phabricator.wikimedia.org/T125017#1972341 (10mobrovac) 3NEW [00:31:53] 6operations, 10Parsoid, 6Services: Switch Parsoid to Jessie and Node 4.2 - https://phabricator.wikimedia.org/T125017#1972353 (10mobrovac) [00:31:56] 6operations, 10Beta-Cluster-Infrastructure, 6Services: Move Node.JS services to Jessie and Node 4.2 - https://phabricator.wikimedia.org/T124989#1972352 (10mobrovac) [00:32:39] Just need someone to delete my test case at https://en.wikipedia.org/wiki/Topic:Sxafn7jgp80cvu10 [00:33:02] 6operations, 10netops: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1936607 (10Dzahn) [00:33:21] mooeypoo, ^ [00:34:18] I don't think I can delete without being an admin...? [00:34:50] I'm logging in, hang on [00:34:58] I just deleted the topic [00:35:02] oh! thank you [00:35:14] matt_flaschen, seems to work [00:35:22] Thanks, Reedy [00:36:00] Looks good: https://en.wikipedia.org/w/index.php?title=Special:Log/delete&user=Reedy&limit=1 [00:36:04] (03PS1) 10Bmansurov: Enable QuickSurveys in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266955 (https://phabricator.wikimedia.org/T123771) [00:36:12] 6operations, 7Icinga: icinga contacts and permissions for ema and elukey - https://phabricator.wikimedia.org/T124941#1972376 (10Dzahn) p:5Triage>3Normal [00:37:12] Looks good Krenair ta [00:37:24] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Finish conversion to Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1972377 (10GWicke) [00:37:32] jdlrobson, I didn't deploy your change yet... [00:37:45] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1527327 (10GWicke) [00:37:57] Huhh wtf [00:38:16] 6operations: reinstall/upgrade gerrit server (ytterbium) - https://phabricator.wikimedia.org/T125018#1972391 (10Dzahn) 3NEW [00:38:19] ahh i was testing the wrong thing [00:38:21] hah :) [00:38:23] ignore [00:38:56] matt_flaschen, mooeypoo, Reedy: doing wmf.11 [00:38:57] 6operations: reinstall/upgrade gerrit server (ytterbium) - https://phabricator.wikimedia.org/T125018#1972391 (10Dzahn) [00:39:16] Thanks. Test case for that is at https://www.mediawiki.org/w/index.php?title=Special%3ALog&type=delete&user=mattflaschen-WMF&page=&year=&month=-1&tagfilter=&limit=1 [00:39:27] 6operations, 6Release-Engineering-Team: reinstall/upgrade gerrit server (ytterbium) from precise to jessie - https://phabricator.wikimedia.org/T125018#1972401 (10Dzahn) [00:40:23] mutante: Wasn't there a request for new hardware for gerrit? [00:40:38] Reedy: i dont know [00:40:51] couldnt find anything linked to the tracking ticket [00:41:06] i thought we already had them all, but no [00:41:11] https://phabricator.wikimedia.org/T123132 [00:41:13] !log krenair@mira Synchronized php-1.27.0-wmf.11/extensions/Flow/: https://gerrit.wikimedia.org/r/#/c/266939/ (duration: 02m 27s) [00:41:18] "Need spare server to upgrade/migrate gerrit" [00:41:19] matt_flaschen, mooeypoo, Reedy : ^ [00:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:41:23] Reedy: thanks, linking that [00:41:48] 6operations, 10Gerrit, 10hardware-requests: Need spare server to upgrade/migrate gerrit - https://phabricator.wikimedia.org/T123132#1972418 (10Dzahn) [00:41:51] 6operations, 6Release-Engineering-Team: reinstall/upgrade gerrit server (ytterbium) from precise to jessie - https://phabricator.wikimedia.org/T125018#1972417 (10Dzahn) [00:42:36] 6operations: upgrade netmon1001 to jessie - https://phabricator.wikimedia.org/T125020#1972424 (10Dzahn) 3NEW [00:42:49] Thanks, Krenair. It worked: https://www.mediawiki.org/w/index.php?title=Special%3ALog&type=delete&user=mattflaschen-WMF&page=&year=&month=-1&tagfilter=&limit=1 [00:42:52] great [00:42:55] 6operations: upgrade netmon1001 to jessie - https://phabricator.wikimedia.org/T125020#1972424 (10Dzahn) [00:43:15] 6operations, 6Release-Engineering-Team: reinstall/upgrade gerrit server (ytterbium) from precise to jessie - https://phabricator.wikimedia.org/T125018#1972435 (10Reedy) ytterbium is nearly out of warranty. I suspect as it's a still a working machine, it would go into spare after a reinstall? [00:43:24] 6operations, 10Gerrit, 10hardware-requests: Need spare server to upgrade/migrate gerrit - https://phabricator.wikimedia.org/T123132#1972439 (10Dzahn) p:5Normal>3High there is also T125018 to get the server OS away from precise and up to jessie [00:43:59] jdlrobson, deploying [00:44:26] (03PS1) 10Bmansurov: Enable QuickSurveys in eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266957 (https://phabricator.wikimedia.org/T123770) [00:45:59] (03PS2) 10Bmansurov: Enable QuickSurveys in eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266957 (https://phabricator.wikimedia.org/T123770) [00:46:19] !log krenair@mira Synchronized php-1.27.0-wmf.11/extensions/Gather/resources: https://gerrit.wikimedia.org/r/#/c/266793/ and https://gerrit.wikimedia.org/r/#/c/266792/ (duration: 02m 23s) [00:46:24] jdlrobson, ^ please test [00:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:46:29] on it! [00:46:32] 6operations, 7Icinga: upgrade neon (icinga) - https://phabricator.wikimedia.org/T125023#1972468 (10Dzahn) 3NEW [00:46:44] 6operations, 7Icinga: upgrade neon (icinga) to jessie - https://phabricator.wikimedia.org/T125023#1972468 (10Dzahn) [00:47:47] Krenair: looks good! thx [00:48:00] I forgot about wmf10 but given that updates soon we'll survive [00:48:03] 6operations: upgrade swift servers from precise to jessie - https://phabricator.wikimedia.org/T125024#1972477 (10Dzahn) 3NEW [00:49:05] 6operations: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#1972487 (10Dzahn) list of servers: mc1001.eqiad.wmnet: True mc1002.eqiad.wmnet: True mc1003.eqiad.wmnet: True mc1004.eqiad.wmnet: True mc1005.eqiad.wmnet: True mc1006.eqiad.wmnet: True mc1007.eqiad.wmnet: True... [00:49:22] Krenair: unless you have time to swat those same changes into wmf10..? [00:49:39] 6operations: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#1972488 (10Dzahn) [00:49:52] 6operations: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#1936600 (10Dzahn) 1005 as well: labsdb1005.eqiad.wmnet: True labsdb1006.eqiad.wmnet: True labsdb1007.eqiad.wmnet: True [00:49:55] jdlrobson, not really, no [00:50:00] np :) [00:50:29] don't really have time for the stuff that I put up for swat [00:50:36] because others came along and added more than the limit [00:51:27] (03PS3) 10Alex Monk: Add namespace aliases for English Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264066 (https://phabricator.wikimedia.org/T123187) (owner: 10Pmlineditor) [00:51:42] (03CR) 10Alex Monk: [C: 032] Add namespace aliases for English Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264066 (https://phabricator.wikimedia.org/T123187) (owner: 10Pmlineditor) [00:51:51] 6operations: upgrade iron to jessie (or get rid of it) - https://phabricator.wikimedia.org/T125025#1972495 (10Dzahn) 3NEW [00:52:17] 6operations: "Opsonly" bastion? - https://phabricator.wikimedia.org/T114992#1711827 (10Dzahn) either we get rid of it or we have to upgrade it in T125025 [00:53:37] (03PS3) 10Alex Monk: Add Category to $wgNamespacesWithSubpages on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264460 (https://phabricator.wikimedia.org/T121985) (owner: 10Pmlineditor) [00:53:47] (03CR) 10Alex Monk: [C: 032] Add Category to $wgNamespacesWithSubpages on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264460 (https://phabricator.wikimedia.org/T121985) (owner: 10Pmlineditor) [00:53:57] 6operations, 10DBA: upgrade dbproxy1001/1002 to jessie - https://phabricator.wikimedia.org/T125027#1972513 (10Dzahn) 3NEW [00:55:28] 6operations, 10DBA: upgrade db servers to jessie - https://phabricator.wikimedia.org/T125028#1972522 (10Dzahn) 3NEW [00:55:53] (03Merged) 10jenkins-bot: Add namespace aliases for English Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264066 (https://phabricator.wikimedia.org/T123187) (owner: 10Pmlineditor) [00:56:37] (03Merged) 10jenkins-bot: Add Category to $wgNamespacesWithSubpages on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264460 (https://phabricator.wikimedia.org/T121985) (owner: 10Pmlineditor) [00:57:55] 6operations: Phase out antimony.wikimedia.org - https://phabricator.wikimedia.org/T123718#1972533 (10Dzahn) blocked on and progress here: https://phabricator.wikimedia.org/tag/gitblit-deprecate/ [00:58:48] !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/264066/ (duration: 02m 26s) [00:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:00:04] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160128T0100). Please do the needful. [01:00:19] (03CR) 10Alex Monk: [C: 04-1] "Actually, don't we need to alias this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265885 (https://phabricator.wikimedia.org/T123654) (owner: 10Dereckson) [01:00:48] 6operations, 6Release-Engineering-Team: reinstall/upgrade gerrit server (ytterbium) from precise to jessie - https://phabricator.wikimedia.org/T125018#1972551 (10Dzahn) @Robh please see Reedy's question above. [01:01:57] 6operations, 7Blocked-on-RelEng: Phase out antimony.wikimedia.org - https://phabricator.wikimedia.org/T123718#1972554 (10Dzahn) [01:02:42] finishing up with a last patch [01:02:49] will defer the last to the next swat [01:03:05] Krenair: thanks [01:03:23] Krenair, sorry, I'll count the stuff deployed to both branches as two in the future, try to get it in earlier, and stop at 8 unless it's urgent (if it's urgent, I might have to request permission from greg per the docs). [01:03:48] !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/264460/ (duration: 02m 30s) [01:03:51] thank you matt_flaschen [01:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:03:57] 6operations, 10Graphoid, 6Services: Remove Graphoid from sca100x - https://phabricator.wikimedia.org/T125029#1972563 (10mobrovac) 3NEW [01:04:13] mutante: re grahpoid on sca100x - https://phabricator.wikimedia.org/T125029 [01:04:28] matt_flaschen: yep, urgent things are urgent and don't need a swat (but can use one if it's timely) [01:05:55] mobrovac: :) ok [01:07:37] ACKNOWLEDGEMENT - graphoid endpoints health on sca1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) daniel_zahn https://phabricator.wikimedia.org/T125029 [01:07:37] ACKNOWLEDGEMENT - graphoid endpoints health on sca1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) daniel_zahn https://phabricator.wikimedia.org/T125029 [01:09:20] 6operations, 10Gerrit, 10hardware-requests: Need spare server to upgrade/migrate gerrit - https://phabricator.wikimedia.org/T123132#1972584 (10RobH) a:3mark So with the upgrade of lead to 32GB, we now need @mark's approval to allocate lead as the gerrit server replacement. Once that is done, related T1250... [01:10:35] 6operations, 6Release-Engineering-Team: reinstall/upgrade gerrit server (ytterbium) from precise to jessie - https://phabricator.wikimedia.org/T125018#1972595 (10RobH) This would invalidate once T123132 is approved, and I'd instead reclaim ytterbium to spares. [01:50:36] tgr: did you get your stuff done? [01:52:28] greg-g: no, I was waiting for the end of the Phabricator slot [01:52:37] also, decided I should eat first [01:53:43] (03PS2) 10Dzahn: admin: allow dc-ops group to read log files [puppet] - 10https://gerrit.wikimedia.org/r/266919 (owner: 10Papaul) [01:54:41] (03CR) 10Dzahn: [C: 031] "yes, this makes sense and we talked about it. in addition to shell access on carbon (DHCPd) being able to read syslog is needed to debug" [puppet] - 10https://gerrit.wikimedia.org/r/266919 (owner: 10Papaul) [01:55:43] (03PS3) 10Dzahn: admin: allow dc-ops group to read log files [puppet] - 10https://gerrit.wikimedia.org/r/266919 (owner: 10Papaul) [01:56:18] tgr: gotcha, I don't think twentyafterfour is doing any phab deploy tonight [01:57:02] huh [01:57:04] I probably should have asked :) [01:57:13] I'll just proceed then [01:57:19] Indeed, I still haven't gotten signoff from the people who would be affected by the significant upstream changes, so phabricator updates are on hold [01:58:09] (03CR) 10Dzahn: creation of parsoid-rt-admin group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266632 (https://phabricator.wikimedia.org/T124701) (owner: 10RobH) [01:59:23] (03CR) 10Dzahn: "sounds good, thanks for understanding" [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder) [02:00:19] yeah, the "HOLD" in the deploy windows usually means "if needed".... maybe I should add that :) [02:01:09] (03CR) 10Dzahn: [C: 031] Tools: Fix double file resource for jlocal [puppet] - 10https://gerrit.wikimedia.org/r/266934 (owner: 10Tim Landscheidt) [02:04:55] Krenair, do you still need me on standby for our deployed patch? I assume all's good by now, but just making sure before I head out for the evening [02:05:15] mooeypoo, nope, it was confirmed as fine [02:05:27] all is good [02:05:32] awesome. Have a good evening/night [02:05:39] see you guys tomorrow [02:05:59] (03CR) 10Dzahn: [C: 032] "both names worked, i assume somebody added it without deployment-prep as a temp. fix" [puppet] - 10https://gerrit.wikimedia.org/r/266539 (owner: 10Cscott) [02:06:08] (03PS2) 10Dzahn: Add missing `.deployment-prep` to redis server hostname. [puppet] - 10https://gerrit.wikimedia.org/r/266539 (owner: 10Cscott) [02:08:19] (03PS3) 10Dzahn: contint: stop cloning mediawiki/tools/codesniffer.git [puppet] - 10https://gerrit.wikimedia.org/r/260018 (https://phabricator.wikimedia.org/T66371) (owner: 10Hashar) [02:08:30] (03CR) 10Dzahn: [C: 032] contint: stop cloning mediawiki/tools/codesniffer.git [puppet] - 10https://gerrit.wikimedia.org/r/260018 (https://phabricator.wikimedia.org/T66371) (owner: 10Hashar) [02:12:00] (03CR) 10Dzahn: [C: 031] role::deployment::salt_masters: correct a hiera lookup [puppet] - 10https://gerrit.wikimedia.org/r/266216 (owner: 10Giuseppe Lavagetto) [02:15:57] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [02:21:31] (03CR) 10Dzahn: [C: 031 V: 031] ganglia diskstat.py: pep8 fixes all over the place [puppet] - 10https://gerrit.wikimedia.org/r/264997 (owner: 10Chad) [02:26:23] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [02:27:07] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.10) (duration: 10m 21s) [02:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:27:42] 6operations, 7HTTPS: ssl certificate replacement: dumps.wikimedia.org (expires 2016-02-26) - https://phabricator.wikimedia.org/T122321#1972753 (10Dzahn) a:5ArielGlenn>3RobH re-assigning back to robh [02:30:35] (03CR) 10Dzahn: "Hello Halfak, i see that https://phabricator.wikimedia.org/T122666 has been closed. Does that mean this change is not needed anymore or is" [puppet] - 10https://gerrit.wikimedia.org/r/261642 (owner: 10Halfak) [02:30:44] PROBLEM - puppet last run on mw2136 is CRITICAL: CRITICAL: puppet fail [02:33:13] (03CR) 10Dzahn: "is the limn module still used today?" [puppet] - 10https://gerrit.wikimedia.org/r/231144 (owner: 10Faidon Liambotis) [02:34:23] (03CR) 10Dzahn: [C: 04-1] "@cscott are you still planning to amend this / respond to Yuvipanda's comment?" [puppet] - 10https://gerrit.wikimedia.org/r/170130 (owner: 10Cscott) [02:36:10] (03CR) 10Dzahn: [C: 04-1] "if toolserver runs on labs, then this should be replaced with a change that deletes the unused config from the cluster setup" [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis) [02:36:40] (03CR) 10Dzahn: [C: 04-2] "the linked ticket is closed. https://phabricator.wikimedia.org/T113298 what's up ?" [puppet] - 10https://gerrit.wikimedia.org/r/240684 (https://phabricator.wikimedia.org/T113298) (owner: 10coren) [02:37:59] andrewbogott, around? [02:39:33] (03CR) 10Dzahn: "bump" [puppet] - 10https://gerrit.wikimedia.org/r/182141 (owner: 10AndyRussG) [02:41:52] !log tgr@mira Synchronized php-1.27.0-wmf.11/includes/: deploy SessionManager patch for T124971: gerrit 266944, 266946 (duration: 03m 20s) [02:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:43:56] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [24.0] [02:44:38] (03CR) 10Dzahn: [C: 031] "it's quite old and has a path conflict, but it still looks ok. was there a reason you have it open since then?" [puppet] - 10https://gerrit.wikimedia.org/r/204275 (owner: 10Filippo Giunchedi) [02:46:07] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.11) (duration: 06m 16s) [02:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:46:20] (03CR) 10Dzahn: [C: 031] "doesnt exist on masters, yep" [puppet] - 10https://gerrit.wikimedia.org/r/214625 (owner: 10Alexandros Kosiaris) [02:47:27] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [02:54:25] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [24.0] [02:55:26] (03PS1) 10Dzahn: racktables: fix top-scope variable without explicit namespace [puppet] - 10https://gerrit.wikimedia.org/r/266962 [02:58:43] RECOVERY - puppet last run on mw2136 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:58:59] (03PS1) 10Dzahn: monitoring: fix top-scope vars without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266963 [03:00:21] (03PS1) 10Dzahn: ircyall: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266964 [03:01:35] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [03:02:32] (03PS1) 10Dzahn: ganglia: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266965 [03:03:41] (03PS1) 10Dzahn: dataset: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266966 [03:05:38] (03PS2) 10Dzahn: dataset: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266966 [03:06:56] (03PS1) 10Dzahn: mediawiki/jobrunner: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266967 [03:07:42] (03PS1) 10Mobrovac: RESTBase: Labs: Set correct IP for deployment-restbase01 [puppet] - 10https://gerrit.wikimedia.org/r/266968 (https://phabricator.wikimedia.org/T125003) [03:08:54] (03PS1) 10Dzahn: ldap: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266969 [03:08:57] mutante: robh: could i get a review of a trivial rb labs config change - https://gerrit.wikimedia.org/r/#/c/266968/1 ? [03:11:04] (03CR) 10Dzahn: [C: 032] "deployment-restbase01.deployment-prep.eqiad.wmflabs has address 10.68.16.128" [puppet] - 10https://gerrit.wikimedia.org/r/266968 (https://phabricator.wikimedia.org/T125003) (owner: 10Mobrovac) [03:11:44] thnx mutante! [03:11:49] np [03:16:08] (03PS1) 10Dzahn: apache: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266970 [03:18:28] (03PS1) 10Dzahn: osm: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266971 [03:21:17] (03PS1) 10Dzahn: dynamicproxy: fix top-scope var without namespace, lint [puppet] - 10https://gerrit.wikimedia.org/r/266973 [03:24:29] (03PS1) 10Dzahn: cassandra: fix top-scope vars without namespaces [puppet] - 10https://gerrit.wikimedia.org/r/266975 [03:25:37] (03PS1) 10Dzahn: labs_bootstrapvs: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266976 [03:26:47] (03PS1) 10Dzahn: varnish: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266977 [03:28:15] (03PS1) 10Dzahn: grafana: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266978 [03:29:40] (03PS1) 10Dzahn: openstack: fix typo, "spandby-server" for glance [puppet] - 10https://gerrit.wikimedia.org/r/266979 [03:30:16] (03PS2) 10Dzahn: openstack: fix typo, "spandby-server" for glance [puppet] - 10https://gerrit.wikimedia.org/r/266979 [03:32:48] (03PS1) 10Dzahn: openstack: fix top-scope vars without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266980 [03:34:33] (03PS1) 10Dzahn: salt: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266981 [03:37:05] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [03:37:05] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [03:37:32] (03PS1) 10Dzahn: labs: fix top-scope vars without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266982 [03:41:34] (03PS1) 10Dzahn: lists: fix top-scope var, arrow alignment [puppet] - 10https://gerrit.wikimedia.org/r/266983 [03:41:35] 6operations, 10Beta-Cluster-Infrastructure, 6Labs: Duplicate IP address DNS entry - https://phabricator.wikimedia.org/T125040#1972870 (10mobrovac) 3NEW [03:43:01] (03PS1) 10Dzahn: aptly: fix top-scope vars without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266984 [03:46:16] (03PS1) 10Dzahn: ipsec: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266985 [03:54:34] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:54:34] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:01:39] 6operations: fix all "top-scope variable being used without an explicit namespace" across the puppet repo - https://phabricator.wikimedia.org/T125042#1973024 (10Dzahn) 3NEW [04:02:27] 6operations, 5Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#1973044 (10Dzahn) [04:02:29] 6operations: fix all "top-scope variable being used without an explicit namespace" across the puppet repo - https://phabricator.wikimedia.org/T125042#1973045 (10Dzahn) [05:09:44] PROBLEM - puppet last run on db2052 is CRITICAL: CRITICAL: Puppet has 1 failures [05:18:04] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 87.50% of data above the critical threshold [24.0] [05:35:55] RECOVERY - puppet last run on db2052 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [05:39:05] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [06:03:47] Krenair: I’m back now, sorry I missed you [06:04:40] (03CR) 10Andrew Bogott: [C: 031] "This is surely correct but I want to watch when it applies." [puppet] - 10https://gerrit.wikimedia.org/r/266979 (owner: 10Dzahn) [06:19:13] (03PS1) 10Yuvipanda: Cleanup chroot after compressing it [docker-images/debian] - 10https://gerrit.wikimedia.org/r/266991 [06:19:16] (03PS1) 10Yuvipanda: Add steps to write a Dockerfile and build the image [docker-images/debian] - 10https://gerrit.wikimedia.org/r/266992 [06:22:45] (03PS2) 10Yuvipanda: Add steps to write a Dockerfile and build the image [docker-images/debian] - 10https://gerrit.wikimedia.org/r/266992 [06:30:14] (03PS3) 10Yuvipanda: Add steps to write a Dockerfile and build the image [docker-images/debian] - 10https://gerrit.wikimedia.org/r/266992 [06:30:25] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [06:30:33] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:35] (03CR) 10Yuvipanda: [C: 032 V: 032] Cleanup chroot after compressing it [docker-images/debian] - 10https://gerrit.wikimedia.org/r/266991 (owner: 10Yuvipanda) [06:30:46] (03CR) 10Yuvipanda: [C: 032 V: 032] Add steps to write a Dockerfile and build the image [docker-images/debian] - 10https://gerrit.wikimedia.org/r/266992 (owner: 10Yuvipanda) [06:31:43] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:14] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:24] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:13] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:44] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:41:44] PROBLEM - Ubuntu mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/ubuntu is over 12 hours old. [06:43:33] RECOVERY - Ubuntu mirror in sync with upstream on carbon is OK: /srv/mirrors/ubuntu is over 0 hours old. [06:49:58] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Adjust balance of WDQS nodes to allow continued operation if eqiad went offline. - https://phabricator.wikimedia.org/T124627#1973170 (10EBernhardson) @robh Would you be able to provide a ballpark estimate on per-server costs of WDQS nodes? We w... [06:56:04] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:56:35] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:56:43] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:57:34] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:58:05] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:33] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:34] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:03:44] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Adjust balance of WDQS nodes to allow continued operation if eqiad went offline. - https://phabricator.wikimedia.org/T124627#1973186 (10Smalyshev) I am not sure what is required by RAID configuration, so I'll talk in terms of diskspace. Right n... [07:04:07] 6operations, 6Discovery: Elasticsearch health and capacity planning FY2016-17 - https://phabricator.wikimedia.org/T124626#1973187 (10EBernhardson) I've completed load testing of the new codfw cluster, including testing some changes we are planning to make to CirrusSearch. Quick top-level stats as follows: Clu... [07:13:10] (03PS1) 10ArielGlenn: remove palladium from list of salt masters on all minions [puppet] - 10https://gerrit.wikimedia.org/r/266993 [07:15:37] (03CR) 10ArielGlenn: [C: 032] remove palladium from list of salt masters on all minions [puppet] - 10https://gerrit.wikimedia.org/r/266993 (owner: 10ArielGlenn) [07:16:09] 6operations, 6Discovery: Elasticsearch health and capacity planning FY2016-17 - https://phabricator.wikimedia.org/T124626#1973190 (10EBernhardson) @robh Would you be able to chime in here with a ballpark estimate of per server costs? Like WDQS this isn't for immediate purchase, this is to include in the budget... [07:17:08] <_joe_> apergos: \o/ [07:17:16] yeah [07:17:37] still a bit of follow-up but still [07:18:47] is there a new salt master? [07:18:55] yes, I'm about to send the email now [07:19:06] it's been a secondary and the primary salt master for awhile [07:19:19] cool, is the new one better? [07:20:22] i guess i'll wait for the email :) [07:20:39] email sent [07:21:09] yes, it's actually reliable [07:21:32] did my Monday (I think) email not go around? [07:22:04] I see it in my inbox so presumably it did go [07:22:28] now I get to wait awhile for puppet to run everywhere [07:25:26] (03PS2) 10Giuseppe Lavagetto: Reduce poolcounter configuration complexity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266511 (https://phabricator.wikimedia.org/T114273) [07:30:49] (03PS1) 10EBernhardson: Return more like search queries to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266995 [07:35:29] (03PS3) 10Giuseppe Lavagetto: Reduce poolcounter configuration complexity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266511 (https://phabricator.wikimedia.org/T114273) [07:35:31] (03PS7) 10Giuseppe Lavagetto: Define Production service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) [07:35:33] (03PS6) 10Giuseppe Lavagetto: Rationalize definition of service hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266509 (https://phabricator.wikimedia.org/T114273) [07:35:35] (03PS4) 10Giuseppe Lavagetto: Use the logical redis definition for GettingStarted. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266481 (https://phabricator.wikimedia.org/T124671) [07:37:31] <_joe_> ebernhardson: so the plan is to go active-active with elasticsearch? [07:39:05] <_joe_> well, time for breakfast, ttyl :) [07:41:39] 6operations, 10Citoid, 10Graphoid, 6Services: Remove Graphoid and Citoid from sca100x - https://phabricator.wikimedia.org/T125029#1973256 (10mobrovac) [07:49:12] 6operations, 10Citoid, 10Graphoid, 10Mathoid, 6Services: Remove Graphoid, Mathoid and Citoid from sca100x - https://phabricator.wikimedia.org/T125029#1973292 (10mobrovac) [07:49:27] (03PS1) 10Mobrovac: SCA: Remove Citoid, Mathoid and Graphoid [puppet] - 10https://gerrit.wikimedia.org/r/266996 (https://phabricator.wikimedia.org/T125029) [07:55:54] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0] [07:56:54] (03CR) 10Yurik: [C: 031] SCA: Remove Citoid, Mathoid and Graphoid [puppet] - 10https://gerrit.wikimedia.org/r/266996 (https://phabricator.wikimedia.org/T125029) (owner: 10Mobrovac) [08:03:13] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [08:13:34] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [24.0] [08:20:34] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [08:41:46] (03PS1) 10ArielGlenn: remove salt master role from palladium [puppet] - 10https://gerrit.wikimedia.org/r/267000 [08:43:17] (03CR) 10ArielGlenn: [C: 032] remove salt master role from palladium [puppet] - 10https://gerrit.wikimedia.org/r/267000 (owner: 10ArielGlenn) [09:20:19] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1973432 (10ArielGlenn) Neodymium is now the only salt master, and just to be sure, I've removed the salt-master package from palladium as well as the role. The following hosts do not respond t... [09:21:05] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Check the redis (jobqueue) configuration in codfw - https://phabricator.wikimedia.org/T124672#1973433 (10Joe) [09:25:13] 6operations, 10Analytics-Cluster: analytics1017.eqiad.wmnet issues (no ssh in, no salt response) - https://phabricator.wikimedia.org/T125055#1973447 (10ArielGlenn) 3NEW [09:29:04] 6operations: rhodium.eqiad.wmnet status? - https://phabricator.wikimedia.org/T125056#1973461 (10ArielGlenn) 3NEW [09:35:20] (03PS1) 10Faidon Liambotis: mirrors: switch Debian source mirror to ftp2 [puppet] - 10https://gerrit.wikimedia.org/r/267002 [09:36:09] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mirrors: switch Debian source mirror to ftp2 [puppet] - 10https://gerrit.wikimedia.org/r/267002 (owner: 10Faidon Liambotis) [09:36:56] (03Abandoned) 10Giuseppe Lavagetto: wikidata: disable cronjobs temporarily [puppet] - 10https://gerrit.wikimedia.org/r/265721 (owner: 10Giuseppe Lavagetto) [09:39:51] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare citoid for the codfw switchover - https://phabricator.wikimedia.org/T125057#1973482 (10Joe) 3NEW [09:45:22] 6operations, 6Services, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare mathoid for the codfw switchover - https://phabricator.wikimedia.org/T125058#1973500 (10Joe) 3NEW [09:51:19] (03PS1) 10Yuvipanda: Add key for apt.wikimedia.org [docker-images/debian] - 10https://gerrit.wikimedia.org/r/267004 [09:53:04] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [09:53:43] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [09:54:20] 6operations, 6Services, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare graphoid for the codfw switchover - https://phabricator.wikimedia.org/T125060#1973519 (10Joe) 3NEW [09:54:46] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [docker-images/debian] - 10https://gerrit.wikimedia.org/r/267004 (owner: 10Yuvipanda) [09:56:52] 6operations, 10netops: Peer with SFMIX at ULSFO in 200 Paul - https://phabricator.wikimedia.org/T124843#1973527 (10faidon) 5Open>3stalled We've already considered it and we have been in touch with a couple of SFMIX people. It's hard because it may not make financial sense for us (= it could be very expens... [09:57:06] 6operations, 10netops: Peer with SFMIX at ULSFO in 200 Paul - https://phabricator.wikimedia.org/T124843#1973529 (10faidon) p:5Triage>3Low [09:58:32] (03CR) 10Yuvipanda: [C: 032 V: 032] "Tested to work \o/" [docker-images/debian] - 10https://gerrit.wikimedia.org/r/267004 (owner: 10Yuvipanda) [10:00:24] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:01:13] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:02:21] 6operations, 10Analytics-Cluster: analytics1017.eqiad.wmnet issues (no ssh in, no salt response) - https://phabricator.wikimedia.org/T125055#1973531 (10elukey) Hello Ariel, I believe that @Ottomata is still working on the host: ``` # This node was previously a Hadoop Worker, but is now waiting # to be rep... [10:22:48] 6operations, 6Services, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare mobileapps for the codfw switchover - https://phabricator.wikimedia.org/T125061#1973542 (10Joe) 3NEW [10:23:04] (03PS1) 10Mark Bergsma: Add IPv6 support to all monitors [debs/pybal] - 10https://gerrit.wikimedia.org/r/267008 [10:24:14] (03CR) 10jenkins-bot: [V: 04-1] Add IPv6 support to all monitors [debs/pybal] - 10https://gerrit.wikimedia.org/r/267008 (owner: 10Mark Bergsma) [10:27:53] 6operations, 6Language-Engineering, 6Services, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare cxserver/zotero for the codfw switchover - https://phabricator.wikimedia.org/T125065#1973574 (10Joe) 3NEW [10:28:51] (03CR) 10Giuseppe Lavagetto: [C: 031] "Apart from the tests needing fixes, LGTM" [debs/pybal] - 10https://gerrit.wikimedia.org/r/267008 (owner: 10Mark Bergsma) [10:31:38] (03CR) 10Ema: [C: 031] Add IPv6 support to all monitors [debs/pybal] - 10https://gerrit.wikimedia.org/r/267008 (owner: 10Mark Bergsma) [10:32:20] (03PS1) 10Yuvipanda: Add jessie + jessie backports image builder [docker-images/debian] - 10https://gerrit.wikimedia.org/r/267009 [10:34:36] 6operations, 7Icinga: icinga contacts and permissions for ema and elukey - https://phabricator.wikimedia.org/T124941#1973592 (10elukey) @Dzahn: I tried to logout from icinga but I didn't find how, so I used a Chrome incognito window :D elukey doesn't work but Elukey does work.. [10:35:13] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 788 [10:37:12] 6operations, 7Icinga: icinga contacts and permissions for ema and elukey - https://phabricator.wikimedia.org/T124941#1973596 (10MoritzMuehlenhoff) IIRC Icinga acts on the "cn" field from LDAP (and that's indeed capitalised in your case) [10:39:49] <_joe_> !log rolling reboot of jobrunners in eqiad [10:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:40:13] RECOVERY - check_mysql on db1008 is OK: Uptime: 759715 Threads: 2 Questions: 5444308 Slow queries: 5077 Opens: 2063 Flush tables: 2 Open tables: 412 Queries per second avg: 7.166 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:41:01] (03PS2) 10Yuvipanda: Add jessie + jessie backports image builder [docker-images/debian] - 10https://gerrit.wikimedia.org/r/267009 [10:42:37] !log rebooted parsoid systems in codfw for kernel update, rolling reboot for eqiad [10:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:49:41] (03PS2) 10Mark Bergsma: Add IPv6 support to all monitors [debs/pybal] - 10https://gerrit.wikimedia.org/r/267008 [10:53:14] PROBLEM - Host mw1006 is DOWN: PING CRITICAL - Packet loss = 100% [10:54:43] RECOVERY - Host mw1006 is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [10:55:23] PROBLEM - Host wtp1005 is DOWN: PING CRITICAL - Packet loss = 100% [10:56:34] RECOVERY - Host wtp1005 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [10:57:38] wtp is me, my icinga downtime was too short [10:59:05] PROBLEM - Host mw1163 is DOWN: PING CRITICAL - Packet loss = 100% [11:13:14] (03CR) 10Mark Bergsma: [C: 031] Add IPv6 support to all monitors [debs/pybal] - 10https://gerrit.wikimedia.org/r/267008 (owner: 10Mark Bergsma) [11:17:14] PROBLEM - Host mw1010 is DOWN: PING CRITICAL - Packet loss = 100% [11:18:43] RECOVERY - Host mw1010 is UP: PING OK - Packet loss = 0%, RTA = 1.63 ms [11:25:58] (03CR) 10Mark Bergsma: "Good start, but perhaps we should do some more work first on preparing PyBal for multiple ipvs control implementations in separate commits" (034 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/261375 (owner: 10Giuseppe Lavagetto) [11:26:13] PROBLEM - Host mw1166 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:13] RECOVERY - Host mw1166 is UP: PING OK - Packet loss = 0%, RTA = 1.59 ms [11:27:49] (03CR) 10Ema: [C: 031] "LGTM" [debs/pybal] - 10https://gerrit.wikimedia.org/r/267008 (owner: 10Mark Bergsma) [11:28:18] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Create a service location / discovery system for locating local/master resources easily across all WMF applications - https://phabricator.wikimedia.org/T125069#1973660 (10Joe) 3NEW [11:30:03] PROBLEM - Host mw1012 is DOWN: PING CRITICAL - Packet loss = 100% [11:30:34] RECOVERY - Host mw1012 is UP: PING OK - Packet loss = 0%, RTA = 2.70 ms [11:31:25] !log disabled puppet on analytics1027 due to some issues with camus and hdfs [11:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:35:23] PROBLEM - Host mw1002 is DOWN: PING CRITICAL - Packet loss = 100% [11:36:34] RECOVERY - Host mw1002 is UP: PING OK - Packet loss = 0%, RTA = 1.19 ms [11:42:03] PROBLEM - Host mw1015 is DOWN: PING CRITICAL - Packet loss = 100% [11:42:44] RECOVERY - Host mw1015 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [11:54:03] PROBLEM - Host mw1007 is DOWN: PING CRITICAL - Packet loss = 100% [11:54:21] (03CR) 10Mark Bergsma: [C: 04-1] "I agree with Brandon here." [debs/pybal] - 10https://gerrit.wikimedia.org/r/233043 (owner: 10Ori.livneh) [11:54:43] RECOVERY - Host mw1007 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [11:57:34] PROBLEM - Host mw1165 is DOWN: PING CRITICAL - Packet loss = 100% [11:58:18] (03PS2) 10Jcrespo: New parsercache servers for codfw datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265501 (https://phabricator.wikimedia.org/T121879) [11:58:53] RECOVERY - Host mw1165 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [11:59:54] RECOVERY - Host mw1163 is UP: PING OK - Packet loss = 0%, RTA = 2.12 ms [12:01:41] !log powercycled mw1163, was unreachable after reboot of the jobrunners (but now up again after powercycle via mgmt) [12:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:02:45] (03CR) 10Daniel Kinzler: Use custom generator for mobile search on Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254645 (https://phabricator.wikimedia.org/T85368) (owner: 10Bene) [12:05:55] PROBLEM - Host mw1164 is DOWN: PING CRITICAL - Packet loss = 100% [12:06:58] (03CR) 10Jcrespo: [C: 032] New parsercache servers for codfw datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265501 (https://phabricator.wikimedia.org/T121879) (owner: 10Jcrespo) [12:07:13] RECOVERY - Host mw1164 is UP: PING OK - Packet loss = 0%, RTA = 1.59 ms [12:07:29] !log pooling new parsercaches for codfw datacenter [12:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:11:12] !log jynus@mira Synchronized wmf-config/db-eqiad.php: New parsercache servers for codfw datacenter (duration: 02m 15s) [12:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:11:43] (and now the real thing) [12:14:42] !log jynus@mira Synchronized wmf-config/db-codfw.php: New parsercache servers for codfw datacenter (duration: 03m 10s) [12:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:17:33] PROBLEM - Host mw1003 is DOWN: PING CRITICAL - Packet loss = 100% [12:18:44] RECOVERY - Host mw1003 is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [12:19:00] The "Error connecting to 10.192.32.128: Unknown database 'parsercache'" is me, codfw-only [12:22:06] the "Table 'parsercache.pc253' doesn't exist", me too [12:23:55] !log generating empty schema for new codfw parsercaches [12:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:32:32] well, it "works" now, I do not know if it works works [12:38:05] 6operations, 10DBA, 10MediaWiki-Configuration, 6Release-Engineering-Team, and 2 others: codfw is in read only according to mediawiki - https://phabricator.wikimedia.org/T124795#1973722 (10jcrespo) @Joe, it is a configuration change. Should be "trivial". I opened this ticket because it has more repercussion... [12:48:16] 6operations, 10ops-codfw, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack/setup pc2004-2006 - https://phabricator.wikimedia.org/T121879#1973749 (10jcrespo) [12:51:40] Starting the reboot of all the Kafka nodes for the new kernel upgrades (kafka* hosts) [12:53:02] !log stopping kafka on kafka1012 + host reboot for kernel upgrade [12:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:53:05] 6operations, 10ops-codfw, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack/setup pc2004-2006 - https://phabricator.wikimedia.org/T121879#1973754 (10jcrespo) 5Open>3Resolved I've implemented the service on codfw, this was trivial there as I am not importing parsercache keys yet. I... [12:59:35] !log rebooting rutherfordium (peopleweb) for kernel update [12:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:01:43] PROBLEM - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [13:01:44] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [13:01:54] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [13:01:54] PROBLEM - IPsec on cp4004 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [13:01:54] PROBLEM - IPsec on cp4009 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [13:01:54] PROBLEM - IPsec on cp4008 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [13:01:54] PROBLEM - IPsec on cp4017 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [13:01:54] PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [13:01:55] PROBLEM - IPsec on cp4002 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [13:01:55] PROBLEM - IPsec on cp4003 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [13:01:56] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [13:01:56] PROBLEM - IPsec on cp4018 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [13:01:57] PROBLEM - IPsec on cp4010 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [13:01:57] PROBLEM - IPsec on cp4016 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [13:01:58] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [13:01:58] PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [13:02:14] PROBLEM - IPsec on cp3017 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [13:02:14] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [13:02:14] PROBLEM - IPsec on cp3019 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [13:02:14] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [13:02:14] PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [13:02:14] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [13:02:15] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [13:02:15] PROBLEM - IPsec on cp3031 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [13:02:15] PROBLEM - IPsec on cp3018 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [13:02:16] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [13:02:16] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [13:02:23] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [13:02:23] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [13:02:24] PROBLEM - IPsec on cp3005 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [13:02:24] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [13:02:24] PROBLEM - IPsec on cp3015 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [13:02:24] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [13:02:24] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [13:02:25] PROBLEM - IPsec on cp3013 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [13:02:25] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [13:02:26] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [13:02:37] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [13:02:37] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [13:02:38] PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [13:02:38] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [13:02:39] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [13:02:53] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [13:02:53] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [13:02:53] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [13:03:03] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [13:03:03] PROBLEM - IPsec on cp2021 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [13:03:04] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [13:03:04] PROBLEM - IPsec on cp4001 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [13:03:05] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [13:03:05] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [13:03:13] PROBLEM - IPsec on cp3004 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [13:03:13] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [13:03:13] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [13:03:13] PROBLEM - IPsec on cp3016 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [13:03:14] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [13:03:14] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [13:03:14] PROBLEM - IPsec on cp3014 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [13:03:15] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [13:03:15] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [13:03:50] ---^ I silenced icinga before restarting O_O [13:04:04] PROBLEM - Mobile HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [13:04:13] The only thing that worries me here is the high text [13:04:14] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [13:04:23] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [13:05:55] I think it is finished [13:06:54] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [13:10:34] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:11:23] RECOVERY - Mobile HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:11:24] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 72.41% of data above the critical threshold [10.0] [13:11:24] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1013 is CRITICAL: CRITICAL: 65.22% of data above the critical threshold [10.0] [13:11:34] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:11:34] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 72.41% of data above the critical threshold [10.0] [13:11:43] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:13:13] ---^ checking the underreplicated partions now [13:14:04] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:14:04] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 76.00% of data above the critical threshold [10.0] [13:14:14] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 78.26% of data above the critical threshold [10.0] [13:14:40] yeah 12:57 -> 13:00 there's a real 5xx spike on text, but already gone I *think* [13:15:56] elukey: there's no effective way to silence those ipsec alerts presently [13:16:12] if one node dies, all of its ipsec peers are going to alert on ipsec loss [13:16:22] and the kafka nodes talk to basically all of the non-eqiad caches [13:16:32] (over ipsec) [13:16:44] ahhh got it [13:17:33] taking a break [13:22:33] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [13:23:26] (03CR) 10Phuedx: [C: 04-2] "Blocked on I61431c2." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [13:25:21] (03CR) 10Phuedx: "Please also break this change into two: the first for enabling the feature on beta, and the second for enabling the feature in production." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [13:29:11] joal: lots of sudo errors from your account [13:30:03] an1027 [13:30:10] (03PS3) 10BBlack: cxserver, citoid -> cache_text cluster [dns] - 10https://gerrit.wikimedia.org/r/266739 (https://phabricator.wikimedia.org/T110476) [13:30:23] yes paravoid, please excuse me --> we are experiencing an issue with camus (kafka consumer from hadoop), and I made mistakes like launching a bash under hdfs user, then sudoing again [13:31:03] (03CR) 10BBlack: [C: 032] cxserver, citoid -> cache_text cluster [dns] - 10https://gerrit.wikimedia.org/r/266739 (https://phabricator.wikimedia.org/T110476) (owner: 10BBlack) [13:31:16] !log citoid and cxserver public hostnames moving to cache_text [13:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:36:54] PROBLEM - Apache HTTP on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:36:54] PROBLEM - HHVM rendering on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:36:55] PROBLEM - Apache HTTP on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:03] PROBLEM - HHVM rendering on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:03] PROBLEM - HHVM rendering on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:03] PROBLEM - Apache HTTP on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:03] PROBLEM - HHVM rendering on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:03] PROBLEM - HHVM rendering on mw1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:04] PROBLEM - HHVM rendering on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:04] PROBLEM - Apache HTTP on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:04] PROBLEM - Apache HTTP on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:04] PROBLEM - HHVM rendering on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:05] PROBLEM - HHVM rendering on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:13] PROBLEM - HHVM rendering on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:13] PROBLEM - Apache HTTP on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:14] PROBLEM - HHVM rendering on mw1222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:14] PROBLEM - HHVM rendering on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:14] PROBLEM - HHVM rendering on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:14] PROBLEM - HHVM rendering on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:14] PROBLEM - Apache HTTP on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:15] PROBLEM - Apache HTTP on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:15] PROBLEM - HHVM rendering on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:16] PROBLEM - Apache HTTP on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:16] PROBLEM - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:17] PROBLEM - HHVM rendering on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:17] PROBLEM - Apache HTTP on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:23] PROBLEM - Apache HTTP on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:23] PROBLEM - Apache HTTP on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:23] PROBLEM - Apache HTTP on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:23] PROBLEM - Apache HTTP on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:23] PROBLEM - Apache HTTP on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:24] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:24] PROBLEM - Apache HTTP on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:29] yikes [13:37:35] PROBLEM - Apache HTTP on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:43] PROBLEM - Apache HTTP on mw1222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:43] PROBLEM - HHVM rendering on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:43] PROBLEM - Apache HTTP on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:44] PROBLEM - HHVM rendering on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:44] PROBLEM - HHVM rendering on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:44] PROBLEM - Apache HTTP on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:44] PROBLEM - Apache HTTP on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:44] PROBLEM - HHVM rendering on mw1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:45] PROBLEM - HHVM rendering on mw1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:45] PROBLEM - HHVM rendering on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:53] PROBLEM - Apache HTTP on mw1137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:53] PROBLEM - Apache HTTP on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:53] PROBLEM - HHVM rendering on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:53] PROBLEM - HHVM rendering on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:53] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:54] PROBLEM - HHVM rendering on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:54] PROBLEM - HHVM rendering on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:55] PROBLEM - HHVM rendering on mw1130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:55] PROBLEM - Apache HTTP on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:55] PROBLEM - Apache HTTP on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:55] PROBLEM - Apache HTTP on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:56] PROBLEM - HHVM rendering on mw1125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:07] PROBLEM - HHVM rendering on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:07] PROBLEM - Apache HTTP on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:08] PROBLEM - HHVM rendering on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:08] PROBLEM - HHVM rendering on mw1127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:09] PROBLEM - HHVM rendering on mw1206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:09] PROBLEM - HHVM rendering on mw1137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:10] PROBLEM - HHVM rendering on mw1122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:34] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.103 second response time [13:38:34] RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.307 second response time [13:38:35] RECOVERY - HHVM rendering on mw1145 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.131 second response time [13:38:35] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.286 second response time [13:38:42] Search is broken. [13:38:43] RECOVERY - HHVM rendering on mw1138 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 2.431 second response time [13:38:43] RECOVERY - HHVM rendering on mw1140 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 3.407 second response time [13:38:43] RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.221 second response time [13:38:44] RECOVERY - HHVM rendering on mw1132 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 3.688 second response time [13:38:44] RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 1.143 second response time [13:38:45] RECOVERY - Apache HTTP on mw1235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.696 second response time [13:38:45] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 2.028 second response time [13:38:45] RECOVERY - HHVM rendering on mw1115 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 2.627 second response time [13:38:53] RECOVERY - Apache HTTP on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.161 second response time [13:38:54] RECOVERY - HHVM rendering on mw1144 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.122 second response time [13:38:55] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.044 second response time [13:38:55] RECOVERY - Apache HTTP on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.076 second response time [13:38:55] RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 2.158 second response time [13:38:55] RECOVERY - HHVM rendering on mw1235 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 2.179 second response time [13:38:56] RECOVERY - Apache HTTP on mw1129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.218 second response time [13:39:03] RECOVERY - HHVM rendering on mw1139 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 5.506 second response time [13:39:03] RECOVERY - Apache HTTP on mw1199 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.061 second response time [13:39:04] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:04] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:04] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page returned the unexpec [13:39:04] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page returned the unexpec [13:39:04] RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.617 second response time [13:39:14] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [13:39:22] looks like hhvm issues? [13:39:23] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:33] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:33] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [13:39:33] PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [13:39:33] PROBLEM - restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:34] PROBLEM - restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:34] PROBLEM - restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:34] PROBLEM - restbase endpoints health on restbase1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:34] PROBLEM - restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:44] PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - api_80 - Could not depool server mw1201.eqiad.wmnet because of too many down! [13:39:44] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - api_80 - Could not depool server mw1201.eqiad.wmnet because of too many down! [13:39:45] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - api_80 - Could not depool server mw1201.eqiad.wmnet because of too many down! [13:39:54] PROBLEM - Apache HTTP on mw1130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:54] PROBLEM - Apache HTTP on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:54] PROBLEM - HHVM rendering on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:04] PROBLEM - restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:40:04] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:40:04] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:40:04] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:40:04] PROBLEM - Apache HTTP on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:05] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:40:05] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:40:05] PROBLEM - Apache HTTP on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:05] PROBLEM - Apache HTTP on mw1125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:06] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:40:06] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:40:07] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:40:23] bblack: It's affecting search on Wikidata, that's the only thing I know. [13:40:23] PROBLEM - HHVM rendering on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:24] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [13:40:25] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [13:40:43] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.988 second response time [13:40:43] RECOVERY - HHVM rendering on mw1123 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 4.503 second response time [13:40:43] RECOVERY - HHVM rendering on mw1222 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.469 second response time [13:40:44] RECOVERY - HHVM rendering on mw1135 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.128 second response time [13:40:44] RECOVERY - Apache HTTP on mw1233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.090 second response time [13:40:45] RECOVERY - HHVM rendering on mw1136 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.971 second response time [13:40:45] RECOVERY - Apache HTTP on mw1231 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.081 second response time [13:40:53] RECOVERY - Apache HTTP on mw1120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.947 second response time [13:40:53] RECOVERY - Apache HTTP on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.582 second response time [13:40:54] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.640 second response time [13:40:54] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.846 second response time [13:40:54] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.473 second response time [13:40:54] RECOVERY - Apache HTTP on mw1203 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.192 second response time [13:40:55] RECOVERY - HHVM rendering on mw1223 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 1.253 second response time [13:40:55] RECOVERY - Apache HTTP on mw1126 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.899 second response time [13:40:55] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.957 second response time [13:41:03] RECOVERY - HHVM rendering on mw1190 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 5.508 second response time [13:41:03] RECOVERY - Apache HTTP on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.691 second response time [13:41:03] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.671 second response time [13:41:03] RECOVERY - Apache HTTP on mw1116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.560 second response time [13:41:03] RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.045 second response time [13:41:03] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.064 second response time [13:41:04] RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.503 second response time [13:41:07] (03CR) 10Bmansurov: "Why is it necessary to break the patch into two?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [13:41:13] RECOVERY - HHVM rendering on mw1192 is OK: HTTP OK: HTTP/1.1 200 OK - 65469 bytes in 0.195 second response time [13:41:13] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.588 second response time [13:41:13] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [13:41:14] RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.056 second response time [13:41:24] RECOVERY - HHVM rendering on mw1201 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.081 second response time [13:41:34] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.202 second response time [13:42:04] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - pattern not found - 8824 bytes in 0.083 second response time [13:42:05] PROBLEM - HHVM rendering on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:05] PROBLEM - HHVM rendering on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:05] PROBLEM - Apache HTTP on mw1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:05] PROBLEM - restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:42:05] PROBLEM - HHVM rendering on mw1119 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 6.419 second response time [13:42:13] PROBLEM - restbase endpoints health on xenon is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [13:42:13] PROBLEM - HHVM rendering on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:13] PROBLEM - Apache HTTP on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:14] PROBLEM - Apache HTTP on mw1193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:14] PROBLEM - Apache HTTP on mw1192 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 9.302 second response time [13:42:14] PROBLEM - HHVM rendering on mw1221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:14] PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:23] PROBLEM - HHVM rendering on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:23] PROBLEM - HHVM rendering on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:23] PROBLEM - HHVM rendering on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:23] PROBLEM - Apache HTTP on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:24] PROBLEM - HHVM rendering on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:24] PROBLEM - Apache HTTP on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:25] PROBLEM - Apache HTTP on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:25] PROBLEM - HHVM rendering on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:25] PROBLEM - HHVM rendering on mw1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:43] PROBLEM - HHVM rendering on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:54] RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.871 second response time [13:43:04] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.045 second response time [13:43:04] RECOVERY - restbase endpoints health on restbase1005 is OK: All endpoints are healthy [13:43:04] RECOVERY - HHVM rendering on mw1143 is OK: HTTP OK: HTTP/1.1 200 OK - 65472 bytes in 1.250 second response time [13:43:05] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [13:43:13] RECOVERY - HHVM rendering on mw1207 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.397 second response time [13:43:13] RECOVERY - restbase endpoints health on restbase1003 is OK: All endpoints are healthy [13:43:15] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.045 second response time [13:43:23] RECOVERY - HHVM rendering on mw1125 is OK: HTTP OK: HTTP/1.1 200 OK - 65464 bytes in 4.145 second response time [13:43:24] RECOVERY - Apache HTTP on mw1207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.041 second response time [13:43:24] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.548 second response time [13:43:33] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.190 second response time [13:43:33] RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.316 second response time [13:43:34] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.297 second response time [13:43:34] RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.930 second response time [13:43:34] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [13:43:44] RECOVERY - HHVM rendering on mw1232 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.161 second response time [13:43:44] RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.182 second response time [13:43:45] RECOVERY - Apache HTTP on mw1232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.036 second response time [13:43:45] RECOVERY - HHVM rendering on mw1193 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.196 second response time [13:43:45] RECOVERY - HHVM rendering on mw1208 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.429 second response time [13:43:45] RECOVERY - HHVM rendering on mw1203 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.601 second response time [13:43:45] RECOVERY - HHVM rendering on mw1205 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.183 second response time [13:43:45] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.509 second response time [13:43:45] RECOVERY - HHVM rendering on mw1126 is OK: HTTP OK: HTTP/1.1 200 OK - 65472 bytes in 3.984 second response time [13:43:53] RECOVERY - HHVM rendering on mw1142 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.445 second response time [13:43:53] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 65464 bytes in 4.544 second response time [13:43:53] RECOVERY - HHVM rendering on mw1129 is OK: HTTP OK: HTTP/1.1 200 OK - 65471 bytes in 2.133 second response time [13:43:53] RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.365 second response time [13:43:53] RECOVERY - Apache HTTP on mw1193 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.031 second response time [13:44:04] RECOVERY - HHVM rendering on mw1147 is OK: HTTP OK: HTTP/1.1 200 OK - 65464 bytes in 1.576 second response time [13:44:04] RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.038 second response time [13:44:04] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.040 second response time [13:44:23] PROBLEM - HHVM rendering on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:23] PROBLEM - Apache HTTP on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:24] PROBLEM - HHVM rendering on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:24] PROBLEM - HHVM rendering on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:24] RECOVERY - HHVM rendering on mw1139 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 2.359 second response time [13:44:33] PROBLEM - Apache HTTP on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:33] PROBLEM - HHVM rendering on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:34] PROBLEM - Apache HTTP on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:34] PROBLEM - Apache HTTP on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:43] PROBLEM - Apache HTTP on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:43] PROBLEM - Apache HTTP on mw1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:44] PROBLEM - Apache HTTP on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:44] PROBLEM - HHVM rendering on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:53] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [13:44:54] RECOVERY - Apache HTTP on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.333 second response time [13:45:03] PROBLEM - HHVM rendering on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:04] RECOVERY - HHVM rendering on mw1116 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 9.574 second response time [13:45:04] RECOVERY - Apache HTTP on mw1137 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.951 second response time [13:45:13] RECOVERY - HHVM rendering on mw1195 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 5.148 second response time [13:45:13] RECOVERY - Apache HTTP on mw1195 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.577 second response time [13:45:14] PROBLEM - HHVM rendering on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:14] RECOVERY - HHVM rendering on mw1133 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 1.118 second response time [13:45:14] RECOVERY - HHVM rendering on mw1137 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 1.218 second response time [13:45:15] RECOVERY - HHVM rendering on mw1127 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 2.836 second response time [13:45:43] RECOVERY - HHVM rendering on mw1146 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 2.039 second response time [13:45:44] PROBLEM - Apache HTTP on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:54] PROBLEM - Apache HTTP on mw1121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:04] PROBLEM - HHVM rendering on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:14] PROBLEM - Apache HTTP on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:14] PROBLEM - HHVM rendering on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:15] PROBLEM - Apache HTTP on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:23] RECOVERY - Apache HTTP on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.428 second response time [13:46:23] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.547 second response time [13:46:24] PROBLEM - HHVM rendering on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:24] RECOVERY - Apache HTTP on mw1225 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.737 second response time [13:46:24] PROBLEM - Apache HTTP on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:25] RECOVERY - HHVM rendering on mw1223 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 1.690 second response time [13:46:25] PROBLEM - Apache HTTP on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:25] RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.631 second response time [13:46:25] It's remarkably silent in here [13:46:33] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:33] PROBLEM - Apache HTTP on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:33] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.042 second response time [13:46:34] PROBLEM - Apache HTTP on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:38] hoo: we're here [13:46:39] investigating [13:46:43] hoo: the site is remarkbly mostly-up [13:46:43] PROBLEM - Apache HTTP on mw1127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:43] RECOVERY - Apache HTTP on mw1222 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.159 second response time [13:46:43] RECOVERY - HHVM rendering on mw1192 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 3.611 second response time [13:46:43] PROBLEM - Apache HTTP on mw1122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:43] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:46:44] PROBLEM - salt-minion processes on mw1120 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:46:44] PROBLEM - Apache HTTP on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:44] PROBLEM - Apache HTTP on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:46] ok, good to hear [13:46:49] we're still not sure the scope of the problem [13:46:51] the API is affected, the main pool is not [13:46:53] PROBLEM - Apache HTTP on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:53] RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy [13:46:54] PROBLEM - HHVM rendering on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:54] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 7.323 second response time [13:46:54] RECOVERY - HHVM rendering on mw1191 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 2.762 second response time [13:46:55] PROBLEM - HHVM rendering on mw1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:03] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 65464 bytes in 8.193 second response time [13:47:04] RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.985 second response time [13:47:04] PROBLEM - HHVM rendering on mw1125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:04] PROBLEM - Apache HTTP on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:13] RECOVERY - HHVM rendering on mw1122 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 6.658 second response time [13:47:13] PROBLEM - Apache HTTP on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:14] RECOVERY - Apache HTTP on mw1144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.764 second response time [13:47:23] PROBLEM - Apache HTTP on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:25] RECOVERY - HHVM rendering on mw1131 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 3.279 second response time [13:47:33] PROBLEM - HHVM rendering on mw1126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:33] PROBLEM - HHVM rendering on mw1203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:33] PROBLEM - HHVM rendering on mw1193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:34] PROBLEM - HHVM rendering on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:34] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.990 second response time [13:47:34] PROBLEM - HHVM rendering on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:34] PROBLEM - Apache HTTP on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:43] PROBLEM - Apache HTTP on mw1193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:43] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.779 second response time [13:47:45] RECOVERY - HHVM rendering on mw1132 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.480 second response time [13:47:53] PROBLEM - HHVM rendering on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:48:08] PROBLEM - LVS HTTP IPv4 on api.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:48:13] PROBLEM - HHVM rendering on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:48:13] PROBLEM - HHVM rendering on mw1222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:48:13] PROBLEM - HHVM rendering on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:48:13] PROBLEM - HHVM rendering on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:48:13] PROBLEM - HHVM rendering on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:48:13] PROBLEM - Apache HTTP on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:48:25] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:48:33] api.svc.eqiad.wmnet paged [13:48:34] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.675 second response time [13:48:34] PROBLEM - Apache HTTP on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:48:35] <_joe_> yes [13:48:44] RECOVERY - HHVM rendering on mw1200 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 1.845 second response time [13:48:44] PROBLEM - HHVM rendering on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:48:44] PROBLEM - restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:48:44] PROBLEM - restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:48:45] RECOVERY - HHVM rendering on mw1201 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 1.236 second response time [13:48:45] PROBLEM - Apache HTTP on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:48:54] RECOVERY - HHVM rendering on mw1130 is OK: HTTP OK: HTTP/1.1 200 OK - 65471 bytes in 1.208 second response time [13:48:54] RECOVERY - HHVM rendering on mw1231 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 1.706 second response time [13:48:54] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.759 second response time [13:48:54] RECOVERY - Apache HTTP on mw1221 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.121 second response time [13:48:54] PROBLEM - Apache HTTP on mw1137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:48:54] RECOVERY - Apache HTTP on mw1134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.409 second response time [13:48:55] PROBLEM - HHVM rendering on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:48:55] RECOVERY - HHVM rendering on mw1134 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.498 second response time [13:48:55] RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.716 second response time [13:48:56] RECOVERY - HHVM rendering on mw1121 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 1.479 second response time [13:48:56] RECOVERY - HHVM rendering on mw1148 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 1.228 second response time [13:48:57] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.599 second response time [13:49:13] PROBLEM - HHVM rendering on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:49:14] PROBLEM - Apache HTTP on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:49:14] PROBLEM - Apache HTTP on mw1125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:49:14] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:49:23] PROBLEM - HHVM rendering on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:49:24] PROBLEM - HHVM rendering on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:49:24] PROBLEM - Apache HTTP on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:49:24] PROBLEM - HHVM rendering on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:49:24] PROBLEM - Apache HTTP on mw1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:49:33] PROBLEM - HHVM rendering on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:49:34] RECOVERY - HHVM rendering on mw1120 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 1.738 second response time [13:49:34] RECOVERY - HHVM rendering on mw1117 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 3.606 second response time [13:49:44] RECOVERY - HHVM rendering on mw1140 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 6.778 second response time [13:49:55] RECOVERY - HHVM rendering on mw1135 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 2.501 second response time [13:50:03] RECOVERY - Apache HTTP on mw1233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.583 second response time [13:50:04] PROBLEM - Apache HTTP on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:04] PROBLEM - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:05] PROBLEM - Apache HTTP on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:05] PROBLEM - Apache HTTP on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:14] PROBLEM - Apache HTTP on mw1203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:14] ooh [13:50:14] PROBLEM - HHVM rendering on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:14] PROBLEM - Apache HTTP on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:14] PROBLEM - Apache HTTP on mw1126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:14] PROBLEM - Apache HTTP on mw1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:33] PROBLEM - HHVM rendering on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:33] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.829 second response time [13:50:34] PROBLEM - Apache HTTP on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:35] PROBLEM - HHVM rendering on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:43] RECOVERY - HHVM rendering on mw1225 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 5.701 second response time [13:50:43] RECOVERY - Apache HTTP on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.826 second response time [13:50:44] PROBLEM - HHVM rendering on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:44] RECOVERY - Apache HTTP on mw1124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.930 second response time [13:50:44] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.084 second response time [13:50:44] RECOVERY - HHVM rendering on mw1229 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 1.944 second response time [13:50:45] PROBLEM - Apache HTTP on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:53] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.753 second response time [13:50:54] PROBLEM - HHVM rendering on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:54] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.697 second response time [13:50:54] RECOVERY - HHVM rendering on mw1202 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 1.203 second response time [13:50:55] PROBLEM - HHVM rendering on mw1122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:55] RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.668 second response time [13:50:55] RECOVERY - Apache HTTP on mw1130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.100 second response time [13:51:03] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.471 second response time [13:51:04] PROBLEM - Apache HTTP on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:04] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.714 second response time [13:51:05] RECOVERY - HHVM rendering on mw1194 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 1.276 second response time [13:51:13] PROBLEM - HHVM rendering on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:14] PROBLEM - HHVM rendering on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:14] RECOVERY - Apache HTTP on mw1121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.106 second response time [13:51:23] PROBLEM - HHVM rendering on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:24] PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:33] PROBLEM - Apache HTTP on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:34] RECOVERY - HHVM rendering on mw1115 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 1.514 second response time [13:51:34] PROBLEM - Apache HTTP on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:35] PROBLEM - HHVM rendering on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:35] PROBLEM - Apache HTTP on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:45] RECOVERY - Apache HTTP on mw1120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.235 second response time [13:51:45] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.266 second response time [13:51:46] (03CR) 10Jhobs: [C: 031] "+1, but a recommendation." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266957 (https://phabricator.wikimedia.org/T123770) (owner: 10Bmansurov) [13:51:53] RECOVERY - HHVM rendering on mw1144 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 4.623 second response time [13:51:54] PROBLEM - Apache HTTP on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:14] PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [13:52:15] PROBLEM - Apache HTTP on mw1222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:23] PROBLEM - Apache HTTP on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:24] PROBLEM - HHVM rendering on mw1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:33] PROBLEM - Host kafka1012 is DOWN: PING CRITICAL - Packet loss = 100% [13:52:33] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:34] PROBLEM - Apache HTTP on mw1221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:34] PROBLEM - Apache HTTP on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:35] PROBLEM - HHVM rendering on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:35] PROBLEM - HHVM rendering on mw1130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:43] PROBLEM - HHVM rendering on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:43] PROBLEM - Apache HTTP on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:43] PROBLEM - HHVM rendering on mw1121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:43] PROBLEM - Apache HTTP on mw1206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:44] PROBLEM - Apache HTTP on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:44] PROBLEM - HHVM rendering on mw1206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:44] PROBLEM - HHVM rendering on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:45] PROBLEM - HHVM rendering on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:54] RECOVERY - HHVM rendering on mw1232 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 2.324 second response time [13:52:55] RECOVERY - HHVM rendering on mw1124 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.627 second response time [13:52:55] RECOVERY - HHVM rendering on mw1208 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.482 second response time [13:52:55] RECOVERY - HHVM rendering on mw1126 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 1.503 second response time [13:52:55] RECOVERY - HHVM rendering on mw1203 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.230 second response time [13:52:55] RECOVERY - Apache HTTP on mw1232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.023 second response time [13:52:55] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 1.835 second response time [13:52:56] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.406 second response time [13:52:56] RECOVERY - HHVM rendering on mw1205 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 2.982 second response time [13:53:03] RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.479 second response time [13:53:03] RECOVERY - HHVM rendering on mw1129 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 6.088 second response time [13:53:03] RECOVERY - HHVM rendering on mw1193 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 9.164 second response time [13:53:04] RECOVERY - HHVM rendering on mw1119 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 7.358 second response time [13:53:13] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.648 second response time [13:53:13] RECOVERY - HHVM rendering on mw1221 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 7.152 second response time [13:53:13] RECOVERY - Apache HTTP on mw1193 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.164 second response time [13:53:21] (03PS1) 10Faidon Liambotis: Remove kafka1012 from wmgKafkaServers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267022 [13:53:23] RECOVERY - HHVM rendering on mw1145 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 2.441 second response time [13:53:23] PROBLEM - HHVM rendering on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:53:24] RECOVERY - Apache HTTP on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.518 second response time [13:53:24] PROBLEM - HHVM rendering on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:53:25] RECOVERY - HHVM rendering on mw1199 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 2.953 second response time [13:53:33] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 4.523 second response time [13:53:33] PROBLEM - HHVM rendering on mw1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:53:44] RECOVERY - Apache HTTP on mw1116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.425 second response time [13:53:51] (03CR) 10Faidon Liambotis: [C: 032] Remove kafka1012 from wmgKafkaServers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267022 (owner: 10Faidon Liambotis) [13:53:53] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.968 second response time [13:53:54] RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.860 second response time [13:53:54] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.964 second response time [13:53:54] RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.113 second response time [13:54:04] RECOVERY - HHVM rendering on mw1143 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 1.020 second response time [13:54:05] RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy [13:54:13] RECOVERY - HHVM rendering on mw1207 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 3.907 second response time [13:54:14] (03PS2) 10Faidon Liambotis: Remove kafka1012 from wmgKafkaServers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267022 [13:54:16] (03CR) 10Bmansurov: Enable QuickSurveys in eswiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266957 (https://phabricator.wikimedia.org/T123770) (owner: 10Bmansurov) [13:54:18] (03CR) 10Phuedx: "Deploying to beta is less risky than deploying beta to production at the same time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [13:54:20] (03CR) 10Jhobs: [C: 031] "Same recommendations as eswiki patch." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266955 (https://phabricator.wikimedia.org/T123771) (owner: 10Bmansurov) [13:54:23] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.413 second response time [13:54:23] PROBLEM - HHVM rendering on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:54:24] RECOVERY - Apache HTTP on mw1207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.361 second response time [13:54:25] PROBLEM - Apache HTTP on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:54:33] PROBLEM - Apache HTTP on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:54:33] PROBLEM - HHVM rendering on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:54:34] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 4.935 second response time [13:54:44] PROBLEM - Apache HTTP on mw1130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:54:44] PROBLEM - Apache HTTP on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:54:44] PROBLEM - Apache HTTP on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:54:45] PROBLEM - HHVM rendering on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:54:45] PROBLEM - Apache HTTP on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:54:45] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 1.222 second response time [13:54:49] (03CR) 10Phuedx: [C: 04-1] "I61431c2 has been merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [13:54:53] RECOVERY - HHVM rendering on mw1142 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 3.795 second response time [13:54:53] PROBLEM - Apache HTTP on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:54:53] PROBLEM - HHVM rendering on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:04] RECOVERY - HHVM rendering on mw1147 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 1.925 second response time [13:55:05] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.858 second response time [13:55:13] PROBLEM - HHVM rendering on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:15] RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 4.402 second response time [13:55:23] RECOVERY - Apache HTTP on mw1235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.076 second response time [13:55:24] PROBLEM - HHVM rendering on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:24] RECOVERY - HHVM rendering on mw1123 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 3.543 second response time [13:55:33] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.768 second response time [13:55:33] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.598 second response time [13:55:33] RECOVERY - HHVM rendering on mw1139 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 8.107 second response time [13:55:33] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [13:55:33] PROBLEM - HHVM rendering on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:33] PROBLEM - Apache HTTP on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:34] RECOVERY - Apache HTTP on mw1199 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.535 second response time [13:55:34] PROBLEM - Apache HTTP on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:34] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.650 second response time [13:55:35] PROBLEM - Apache HTTP on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:35] PROBLEM - HHVM rendering on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:36] PROBLEM - Apache HTTP on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:44] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.786 second response time [13:55:53] RECOVERY - HHVM rendering on mw1116 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.159 second response time [13:55:54] RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.887 second response time [13:56:04] RECOVERY - Apache HTTP on mw1137 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.051 second response time [13:56:13] RECOVERY - HHVM rendering on mw1125 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 4.152 second response time [13:56:14] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.448 second response time [13:56:23] PROBLEM - Apache HTTP on mw1124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:56:23] RECOVERY - HHVM rendering on mw1137 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 9.278 second response time [13:56:24] PROBLEM - Apache HTTP on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:56:24] RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.442 second response time [13:56:25] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.458 second response time [13:56:34] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.707 second response time [13:56:35] RECOVERY - HHVM rendering on mw1146 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.799 second response time [13:56:35] RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 4.882 second response time [13:56:43] PROBLEM - HHVM rendering on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:56:43] PROBLEM - HHVM rendering on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:56:44] PROBLEM - HHVM rendering on mw1203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:56:44] PROBLEM - HHVM rendering on mw1193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:56:44] PROBLEM - Apache HTTP on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:56:44] PROBLEM - HHVM rendering on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:56:53] PROBLEM - Apache HTTP on mw1121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:56:54] PROBLEM - Apache HTTP on mw1193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:56:54] PROBLEM - HHVM rendering on mw1221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:56:54] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:56:55] RECOVERY - HHVM rendering on mw1138 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 2.567 second response time [13:57:15] RECOVERY - LVS HTTP IPv4 on api.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 16564 bytes in 1.068 second response time [13:57:16] PROBLEM - HHVM rendering on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:57:17] PROBLEM - Apache HTTP on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:57:17] PROBLEM - HHVM rendering on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:57:17] PROBLEM - HHVM rendering on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:57:27] RECOVERY - Apache HTTP on mw1231 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.543 second response time [13:57:28] RECOVERY - HHVM rendering on mw1136 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 1.845 second response time [13:57:36] RECOVERY - Apache HTTP on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.388 second response time [13:57:36] RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 7.472 second response time [13:57:36] PROBLEM - Apache HTTP on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:57:36] PROBLEM - HHVM rendering on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:57:47] PROBLEM - Apache HTTP on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:57:48] RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.488 second response time [13:57:56] PROBLEM - Apache HTTP on mw1127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:57:56] PROBLEM - Apache HTTP on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:58:06] PROBLEM - HHVM rendering on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:58:07] RECOVERY - HHVM rendering on mw1195 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 1.378 second response time [13:58:08] PROBLEM - HHVM rendering on mw1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:58:08] PROBLEM - Apache HTTP on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:58:12] !log faidon@mira Synchronized wmf-config/InitialiseSettings.php: depool kafka1012 (duration: 02m 10s) [13:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:58:16] RECOVERY - Apache HTTP on mw1195 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.929 second response time [13:58:16] RECOVERY - HHVM rendering on mw1127 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.130 second response time [13:58:16] PROBLEM - HHVM rendering on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:58:17] RECOVERY - HHVM rendering on mw1122 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 1.064 second response time [13:58:18] RECOVERY - HHVM rendering on mw1133 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 5.619 second response time [13:58:26] RECOVERY - Apache HTTP on mw1144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.058 second response time [13:58:27] RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.328 second response time [13:58:36] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [13:58:37] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [13:58:37] RECOVERY - HHVM rendering on mw1131 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.134 second response time [13:58:37] RECOVERY - HHVM rendering on mw1193 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.273 second response time [13:58:37] RECOVERY - HHVM rendering on mw1208 is OK: HTTP OK: HTTP/1.1 200 OK - 65471 bytes in 0.446 second response time [13:58:37] RECOVERY - HHVM rendering on mw1203 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 1.778 second response time [13:58:46] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.042 second response time [13:58:46] RECOVERY - Apache HTTP on mw1193 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.035 second response time [13:58:46] PROBLEM - HHVM rendering on mw1124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:58:46] PROBLEM - HHVM rendering on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:58:47] PROBLEM - Apache HTTP on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:58:56] RECOVERY - HHVM rendering on mw1221 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 6.765 second response time [13:58:56] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.046 second response time [13:58:56] RECOVERY - HHVM rendering on mw1120 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.123 second response time [13:58:56] RECOVERY - HHVM rendering on mw1117 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.144 second response time [13:58:57] RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.040 second response time [13:58:57] RECOVERY - HHVM rendering on mw1140 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.134 second response time [13:58:58] RECOVERY - HHVM rendering on mw1132 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.120 second response time [13:59:06] RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 9.946 second response time [13:59:06] RECOVERY - HHVM rendering on mw1145 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 8.503 second response time [13:59:07] RECOVERY - HHVM rendering on mw1115 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.195 second response time [13:59:17] PROBLEM - Apache HTTP on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:59:17] PROBLEM - HHVM rendering on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:59:17] RECOVERY - Apache HTTP on mw1129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.047 second response time [13:59:18] RECOVERY - Apache HTTP on mw1120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.047 second response time [13:59:18] RECOVERY - HHVM rendering on mw1144 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.126 second response time [13:59:18] RECOVERY - HHVM rendering on mw1135 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.136 second response time [13:59:18] RECOVERY - Apache HTTP on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.032 second response time [13:59:19] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.035 second response time [13:59:19] RECOVERY - Apache HTTP on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.214 second response time [13:59:20] RECOVERY - Apache HTTP on mw1225 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.034 second response time [13:59:20] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.038 second response time [13:59:21] RECOVERY - Apache HTTP on mw1203 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.046 second response time [13:59:26] RECOVERY - HHVM rendering on mw1190 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.133 second response time [13:59:26] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [13:59:26] RECOVERY - HHVM rendering on mw1222 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 4.523 second response time [13:59:26] RECOVERY - HHVM rendering on mw1235 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 4.397 second response time [13:59:26] PROBLEM - Apache HTTP on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:59:26] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.038 second response time [13:59:26] RECOVERY - Apache HTTP on mw1126 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.032 second response time [13:59:38] RECOVERY - Apache HTTP on mw1222 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.023 second response time [13:59:46] RECOVERY - HHVM rendering on mw1192 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.143 second response time [13:59:46] PROBLEM - Apache HTTP on mw1122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:59:47] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.033 second response time [13:59:47] RECOVERY - Apache HTTP on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.026 second response time [13:59:47] RECOVERY - HHVM rendering on mw1143 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.121 second response time [13:59:47] RECOVERY - restbase endpoints health on restbase1006 is OK: All endpoints are healthy [13:59:47] RECOVERY - restbase endpoints health on restbase1004 is OK: All endpoints are healthy [13:59:47] RECOVERY - restbase endpoints health on restbase1003 is OK: All endpoints are healthy [13:59:48] RECOVERY - restbase endpoints health on restbase1002 is OK: All endpoints are healthy [13:59:48] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [13:59:49] RECOVERY - restbase endpoints health on restbase1005 is OK: All endpoints are healthy [13:59:49] RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy [13:59:56] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.120 second response time [13:59:56] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.103 second response time [13:59:57] RECOVERY - HHVM rendering on mw1200 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.117 second response time [13:59:57] RECOVERY - HHVM rendering on mw1207 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.125 second response time [13:59:57] RECOVERY - HHVM rendering on mw1225 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.090 second response time [13:59:58] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.098 second response time [13:59:58] RECOVERY - HHVM rendering on mw1191 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.123 second response time [13:59:58] RECOVERY - HHVM rendering on mw1201 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.101 second response time [14:00:06] RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy [14:00:07] RECOVERY - Apache HTTP on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.025 second response time [14:00:07] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.043 second response time [14:00:07] RECOVERY - Apache HTTP on mw1221 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.025 second response time [14:00:07] RECOVERY - HHVM rendering on mw1130 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.102 second response time [14:00:07] RECOVERY - HHVM rendering on mw1231 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.091 second response time [14:00:07] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.033 second response time [14:00:08] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.039 second response time [14:00:08] RECOVERY - Apache HTTP on mw1124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.046 second response time [14:00:09] RECOVERY - Apache HTTP on mw1134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.053 second response time [14:00:09] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [14:00:10] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [14:00:10] RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.031 second response time [14:00:11] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.054 second response time [14:00:26] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.033 second response time [14:00:26] RECOVERY - HHVM rendering on mw1202 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.102 second response time [14:00:26] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [14:00:26] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.039 second response time [14:00:27] RECOVERY - restbase endpoints health on restbase1001 is OK: All endpoints are healthy [14:00:27] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [14:00:27] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [14:00:27] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [14:00:28] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [14:00:28] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [14:00:29] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [14:00:39] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.043 second response time [14:00:39] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.056 second response time [14:00:39] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.096 second response time [14:00:39] RECOVERY - HHVM rendering on mw1194 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.110 second response time [14:00:39] RECOVERY - HHVM rendering on mw1124 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.126 second response time [14:00:39] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 65463 bytes in 0.143 second response time [14:00:39] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [14:00:40] RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.052 second response time [14:00:40] RECOVERY - HHVM rendering on mw1119 is OK: HTTP OK: HTTP/1.1 200 OK - 65471 bytes in 0.132 second response time [14:00:41] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [14:00:46] RECOVERY - Apache HTTP on mw1121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.051 second response time [14:00:46] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.031 second response time [14:01:07] RECOVERY - Apache HTTP on mw1235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.053 second response time [14:01:07] RECOVERY - Apache HTTP on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.032 second response time [14:01:07] RECOVERY - HHVM rendering on mw1199 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.126 second response time [14:01:07] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.107 second response time [14:01:07] RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.101 second response time [14:01:16] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.031 second response time [14:01:18] RECOVERY - Apache HTTP on mw1233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.033 second response time [14:01:26] RECOVERY - Apache HTTP on mw1199 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.029 second response time [14:02:54] (03CR) 10Bmansurov: Add sampling rates for mobile web language switcher (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [14:03:27] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.453 second response time [14:06:02] (03PS3) 10Bmansurov: Add sampling rates for mobile web language switcher on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) [14:07:50] (03PS1) 10Bmansurov: Add sampling rates for mobile web language switcher in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267025 (https://phabricator.wikimedia.org/T123932) [14:09:07] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:09:32] !log rebooting serpens/seaborgium for kernel update [14:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:09:56] RECOVERY - Mobile HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:10:40] (03PS2) 10Bmansurov: Enable QuickSurveys on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266955 (https://phabricator.wikimedia.org/T123771) [14:11:44] (03PS3) 10Bmansurov: Enable QuickSurveys on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266957 (https://phabricator.wikimedia.org/T123770) [14:11:56] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:11:57] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:12:36] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:12:46] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:13:07] (03PS3) 10BBlack: Text VCL: add support for legacy rb passes [puppet] - 10https://gerrit.wikimedia.org/r/266747 (https://phabricator.wikimedia.org/T110475) [14:13:09] (03PS3) 10BBlack: cache_parsoid: remove restbase pass-through [puppet] - 10https://gerrit.wikimedia.org/r/266748 (https://phabricator.wikimedia.org/T110475) [14:13:11] (03PS3) 10BBlack: cache_parsoid: remove citoid+cxserver pass-through [puppet] - 10https://gerrit.wikimedia.org/r/266741 (https://phabricator.wikimedia.org/T110476) [14:14:10] 6operations, 6Collaboration-Team-Backlog, 10Flow: Flow messages are not editable and new topics can't be posted - https://phabricator.wikimedia.org/T125080#1973919 (10matmarex) There was a little API outage a few minutes ago, now fixed. [14:14:55] (03CR) 10BBlack: [C: 032] Text VCL: add support for legacy rb passes [puppet] - 10https://gerrit.wikimedia.org/r/266747 (https://phabricator.wikimedia.org/T110475) (owner: 10BBlack) [14:17:32] 6operations, 6Collaboration-Team-Backlog, 10Flow: Flow messages are not editable and new topics can't be posted - https://phabricator.wikimedia.org/T125080#1973937 (10hashar) Parsoid relies on the MediaWiki API and the later was totally unresponsive. Incident occurred roughly between 13:30 UTC and 14:00 UTC.... [14:18:15] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Actually done better in https://gerrit.wikimedia.org/r/#/c/265541/" [puppet] - 10https://gerrit.wikimedia.org/r/266996 (https://phabricator.wikimedia.org/T125029) (owner: 10Mobrovac) [14:19:11] (03PS3) 10Alexandros Kosiaris: Cleanup SCA from *oid services [puppet] - 10https://gerrit.wikimedia.org/r/265541 [14:19:21] (03PS3) 10BBlack: restbase legacy hostnames -> cache_text cluster [dns] - 10https://gerrit.wikimedia.org/r/266749 (https://phabricator.wikimedia.org/T110475) [14:19:54] (03PS4) 10Alexandros Kosiaris: Cleanup SCA from *oid services [puppet] - 10https://gerrit.wikimedia.org/r/265541 (https://phabricator.wikimedia.org/T125029) [14:20:28] (03CR) 10Alexandros Kosiaris: [C: 032] "It's been a week, I 've cleaned up pybal configuration yesterday, merging" [puppet] - 10https://gerrit.wikimedia.org/r/265541 (https://phabricator.wikimedia.org/T125029) (owner: 10Alexandros Kosiaris) [14:22:34] (03PS5) 10Alexandros Kosiaris: Cleanup SCA from *oid services [puppet] - 10https://gerrit.wikimedia.org/r/265541 (https://phabricator.wikimedia.org/T125029) [14:22:36] PROBLEM - HHVM rendering on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:24:18] RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 65462 bytes in 0.706 second response time [14:24:26] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 69.57% of data above the critical threshold [5000000.0] [14:24:32] 6operations, 6Collaboration-Team-Backlog, 10Flow: Flow messages are not editable and new topics can't be posted (API outage) - https://phabricator.wikimedia.org/T125080#1973967 (10matmarex) [14:26:43] 6operations, 6Discovery, 10MediaWiki-Logging: MediaWiki monolog doesn't handle Kafka failures gracefully - https://phabricator.wikimedia.org/T125084#1973969 (10faidon) 3NEW [14:28:38] PROBLEM - Host seaborgium is DOWN: PING CRITICAL - Packet loss = 100% [14:29:26] 6operations, 6Collaboration-Team-Backlog, 10Flow: Flow messages are not editable and new topics can't be posted (API outage) - https://phabricator.wikimedia.org/T125080#1973978 (10Trizek-WMF) p:5High>3Unbreak! Still happening, time to time. I manage to get this error message while posting a reply: [d56... [14:32:03] (03CR) 10Alexandros Kosiaris: [C: 032] Cleanup SCA from *oid services [puppet] - 10https://gerrit.wikimedia.org/r/265541 (https://phabricator.wikimedia.org/T125029) (owner: 10Alexandros Kosiaris) [14:32:58] 6operations, 6Services: Split the API MediaWiki appserver pool into two external/internal pools - https://phabricator.wikimedia.org/T125085#1973988 (10faidon) 3NEW [14:33:03] mark: ^^^ [14:33:37] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [14:38:57] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [14:39:47] RECOVERY - Host seaborgium is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [14:47:38] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [14:53:06] 6operations, 6Discovery, 10MediaWiki-Logging: MediaWiki monolog doesn't handle Kafka failures gracefully - https://phabricator.wikimedia.org/T125084#1974045 (10Joe) I just found out that there seems to be a bug in HHVM's fsockopen implementation, so that when you try to connect to a dead host, it will not re... [14:53:54] (03CR) 10Alexandros Kosiaris: [C: 031] "Fully support this, but I suppose we need a meeting for this ?" [puppet] - 10https://gerrit.wikimedia.org/r/266919 (owner: 10Papaul) [15:02:34] 6operations, 6Discovery, 10MediaWiki-Logging: MediaWiki monolog doesn't handle Kafka failures gracefully - https://phabricator.wikimedia.org/T125084#1974068 (10Joe) My (very early) testing seems to show that HHVM 3.11 behaves correctly in this case. [15:04:34] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 62.96% of data above the critical threshold [5000000.0] [15:08:29] 6operations, 6Collaboration-Team-Backlog, 10Flow: Flow messages are not editable and new topics can't be posted (API outage) - https://phabricator.wikimedia.org/T125080#1974082 (10mark) >>! In T125080#1973978, @Trizek-WMF wrote: > Still happening, time to time. > > I manage to get this error message while... [15:11:35] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [15:14:56] 6operations, 6Collaboration-Team-Backlog, 10Flow: Flow messages are not editable and new topics can't be posted (API outage) - https://phabricator.wikimedia.org/T125080#1974098 (10Trizek-WMF) p:5Unbreak!>3High Nothing new. [15:20:02] 6operations, 6Discovery, 10MediaWiki-Logging, 7HHVM: MediaWiki monolog doesn't handle Kafka failures gracefully - https://phabricator.wikimedia.org/T125084#1974104 (10Joe) [15:21:09] 6operations, 6Discovery, 10MediaWiki-Logging, 7HHVM: MediaWiki monolog doesn't handle Kafka failures gracefully - https://phabricator.wikimedia.org/T125084#1973969 (10Joe) So the workaround while we deploy a new HHVM package (which will take some time) is to depool kafka servers whenever we need to reboot... [15:23:31] !log powering up kafka1012 [15:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:28:53] RECOVERY - Host kafka1012 is UP: PING OK - Packet loss = 0%, RTA = 3.32 ms [15:31:49] 6operations, 10RESTBase-Cassandra: replace default Cassandra superuser - https://phabricator.wikimedia.org/T113622#1974127 (10Eevans) a:3Eevans [15:32:17] 6operations, 10RESTBase, 10RESTBase-Cassandra: secure Cassandra/RESTBase cluster - https://phabricator.wikimedia.org/T94329#1974130 (10Eevans) [15:32:45] PROBLEM - Check size of conntrack table on kafka1012 is CRITICAL: Connection refused by host [15:32:45] PROBLEM - dhclient process on kafka1012 is CRITICAL: Connection refused by host [15:32:54] PROBLEM - RAID on kafka1012 is CRITICAL: Connection refused by host [15:32:54] PROBLEM - salt-minion processes on kafka1012 is CRITICAL: Connection refused by host [15:32:54] PROBLEM - configured eth on kafka1012 is CRITICAL: Connection refused by host [15:33:15] PROBLEM - IPsec on kafka1012 is CRITICAL: Connection refused by host [15:33:23] PROBLEM - DPKG on kafka1012 is CRITICAL: Connection refused by host [15:33:24] PROBLEM - Disk space on kafka1012 is CRITICAL: Connection refused by host [15:33:34] PROBLEM - puppet last run on kafka1012 is CRITICAL: Connection refused by host [15:33:41] PROBLEM - Kafka Broker Server on kafka1012 is CRITICAL: Connection refused by host [15:33:55] PROBLEM - SSH on kafka1012 is CRITICAL: Connection refused [15:34:03] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp San Francisco page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp San Francisco page via mobile-sections-lead returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Barack Obama page via mobil [15:34:24] PROBLEM - jmxtrans on kafka1012 is CRITICAL: Connection refused by host [15:34:33] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp San Francisco page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp San Francisco page via mobile-sections-lead returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Barack Obama page via mobil [15:34:46] 6operations, 10ops-codfw: Codfw-mw* IDRAC firmware upgrade - https://phabricator.wikimedia.org/T125088#1974148 (10Papaul) 3NEW a:3Papaul [15:36:43] !log kafka1012: manually edited fstab, s/sdb1/sdb3/, s/sdc3/sdc1/, and now the filesystems mount and data looks right [15:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:37:03] !log rebooting kafka1012 [15:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:38:11] 6operations, 10ops-codfw: Codfw-mw* IDRAC firmware upgrade - https://phabricator.wikimedia.org/T125088#1974168 (10Papaul) [15:39:13] !log mobileapps deployed 869ec35 [15:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:39:34] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 28 ESP OK [15:39:34] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 28 ESP OK [15:39:34] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 38 ESP OK [15:39:34] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 28 ESP OK [15:39:34] RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 20 ESP OK [15:39:34] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 38 ESP OK [15:39:35] RECOVERY - IPsec on cp3003 is OK: Strongswan OK - 28 ESP OK [15:39:35] !log kafka1012 booted up normally [15:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:39:43] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 38 ESP OK [15:39:44] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 38 ESP OK [15:39:44] RECOVERY - jmxtrans on kafka1012 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar [15:39:44] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 38 ESP OK [15:39:44] RECOVERY - IPsec on cp2009 is OK: Strongswan OK - 20 ESP OK [15:39:45] RECOVERY - IPsec on cp3016 is OK: Strongswan OK - 20 ESP OK [15:39:45] RECOVERY - IPsec on cp3009 is OK: Strongswan OK - 28 ESP OK [15:39:45] RECOVERY - IPsec on cp3040 is OK: Strongswan OK - 28 ESP OK [15:39:45] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 38 ESP OK [15:39:53] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 28 ESP OK [15:39:54] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 20 ESP OK [15:39:55] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 38 ESP OK [15:39:55] RECOVERY - Check size of conntrack table on kafka1012 is OK: OK: nf_conntrack is 0 % full [15:39:55] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 38 ESP OK [15:39:55] RECOVERY - dhclient process on kafka1012 is OK: PROCS OK: 0 processes with command name dhclient [15:40:03] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 38 ESP OK [15:40:03] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 38 ESP OK [15:40:03] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 38 ESP OK [15:40:04] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 38 ESP OK [15:40:04] RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 20 ESP OK [15:40:04] RECOVERY - IPsec on cp3017 is OK: Strongswan OK - 20 ESP OK [15:40:04] RECOVERY - IPsec on cp3019 is OK: Strongswan OK - 20 ESP OK [15:40:04] RECOVERY - IPsec on cp3018 is OK: Strongswan OK - 20 ESP OK [15:40:05] RECOVERY - IPsec on cp3031 is OK: Strongswan OK - 28 ESP OK [15:40:05] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 38 ESP OK [15:40:06] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 38 ESP OK [15:40:06] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 38 ESP OK [15:40:07] RECOVERY - IPsec on cp3014 is OK: Strongswan OK - 28 ESP OK [15:40:07] RECOVERY - IPsec on cp3015 is OK: Strongswan OK - 20 ESP OK [15:40:24] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 28 ESP OK [15:40:24] RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 38 ESP OK [15:40:24] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 28 ESP OK [15:40:24] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 38 ESP OK [15:40:24] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 38 ESP OK [15:40:24] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 38 ESP OK [15:40:24] RECOVERY - IPsec on cp3013 is OK: Strongswan OK - 28 ESP OK [15:40:25] RECOVERY - IPsec on cp3005 is OK: Strongswan OK - 28 ESP OK [15:40:25] RECOVERY - IPsec on cp3006 is OK: Strongswan OK - 28 ESP OK [15:40:26] RECOVERY - IPsec on cp3041 is OK: Strongswan OK - 28 ESP OK [15:40:26] RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 28 ESP OK [15:40:27] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 38 ESP OK [15:40:27] RECOVERY - DPKG on kafka1012 is OK: All packages OK [15:40:28] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 166 ESP OK [15:40:33] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 38 ESP OK [15:40:34] RECOVERY - Disk space on kafka1012 is OK: DISK OK [15:40:44] RECOVERY - IPsec on cp2021 is OK: Strongswan OK - 20 ESP OK [15:40:44] RECOVERY - IPsec on cp3012 is OK: Strongswan OK - 28 ESP OK [15:40:44] RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 28 ESP OK [15:40:45] RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 28 ESP OK [15:40:45] RECOVERY - IPsec on cp3022 is OK: Strongswan OK - 20 ESP OK [15:40:53] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 38 ESP OK [15:40:54] RECOVERY - IPsec on cp4009 is OK: Strongswan OK - 28 ESP OK [15:40:54] RECOVERY - IPsec on cp4019 is OK: Strongswan OK - 20 ESP OK [15:40:54] RECOVERY - IPsec on cp4002 is OK: Strongswan OK - 20 ESP OK [15:40:54] RECOVERY - IPsec on cp4012 is OK: Strongswan OK - 20 ESP OK [15:40:54] RECOVERY - IPsec on cp4004 is OK: Strongswan OK - 20 ESP OK [15:40:54] RECOVERY - IPsec on cp4008 is OK: Strongswan OK - 28 ESP OK [15:40:54] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 20 ESP OK [15:40:55] RECOVERY - IPsec on cp4010 is OK: Strongswan OK - 28 ESP OK [15:40:55] RECOVERY - IPsec on cp4003 is OK: Strongswan OK - 20 ESP OK [15:40:56] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 38 ESP OK [15:40:56] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 38 ESP OK [15:40:57] RECOVERY - IPsec on cp4017 is OK: Strongswan OK - 28 ESP OK [15:40:57] RECOVERY - IPsec on cp4016 is OK: Strongswan OK - 28 ESP OK [15:41:14] RECOVERY - IPsec on cp4001 is OK: Strongswan OK - 20 ESP OK [15:41:14] RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 38 ESP OK [15:41:14] RECOVERY - IPsec on cp3020 is OK: Strongswan OK - 20 ESP OK [15:42:32] RECOVERY - Kafka Broker Server on kafka1012 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [15:42:33] RECOVERY - puppet last run on kafka1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:42:44] (03CR) 10BBlack: [C: 032] restbase legacy hostnames -> cache_text cluster [dns] - 10https://gerrit.wikimedia.org/r/266749 (https://phabricator.wikimedia.org/T110475) (owner: 10BBlack) [15:46:59] (03PS1) 10Muehlenhoff: Upgrade to 1.0.2f [debs/openssl] - 10https://gerrit.wikimedia.org/r/267033 [15:47:28] (03CR) 10Muehlenhoff: [C: 032 V: 032] Upgrade to 1.0.2f [debs/openssl] - 10https://gerrit.wikimedia.org/r/267033 (owner: 10Muehlenhoff) [15:53:31] 6operations, 7Icinga: icinga contacts and permissions for ema and elukey - https://phabricator.wikimedia.org/T124941#1974188 (10Dzahn) @Elukey does it let you execute commands though, in addition to letting you login? [15:55:18] 6operations, 7Icinga: icinga contacts and permissions for ema and elukey - https://phabricator.wikimedia.org/T124941#1974189 (10elukey) @Dzahn: yep! [15:55:30] 6operations, 10Beta-Cluster-Infrastructure, 6Labs: Duplicate IP address DNS entry - https://phabricator.wikimedia.org/T125040#1974191 (10hashar) integration-t102108-jessie-new2 is an old one that got deleted but apparently is still registered in LDAP :( [15:58:44] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1022 is OK: OK: Less than 50.00% above the threshold [1.0] [15:58:45] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [15:59:16] 6operations, 7Icinga: icinga contacts and permissions for ema and elukey - https://phabricator.wikimedia.org/T124941#1974196 (10Dzahn) @elukey ok, great, then it works, and as Moritz already said, it's actually capitalized in your case. I think if @ema can also execute commands we can calll this resolved. (un... [15:59:44] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1013 is OK: OK: Less than 50.00% above the threshold [1.0] [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160128T1600). [16:00:04] Addshore urandom Krenair bmansurov: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:10] *waves* [16:00:21] xD [16:00:26] here [16:00:41] urandom? [16:01:09] addshore, so this is working well on all currently enabled sites, right? [16:01:15] yup [16:01:27] (03PS2) 10Alex Monk: wgRCWatchCategoryMembership true on wikipedias & commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264733 (owner: 10Addshore) [16:01:28] no issues have been reported from dewiki in the past week [16:01:34] (03CR) 10Alex Monk: [C: 032] wgRCWatchCategoryMembership true on wikipedias & commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264733 (owner: 10Addshore) [16:01:37] and it has been on mediawiki and test for many weeks now [16:01:58] Krenair: present! [16:02:04] (03Merged) 10jenkins-bot: wgRCWatchCategoryMembership true on wikipedias & commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264733 (owner: 10Addshore) [16:03:06] addshore, syncing [16:03:10] awesome [16:04:58] (03CR) 10Phuedx: Add sampling rates for mobile web language switcher on beta labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [16:04:59] !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/264733/ (duration: 02m 11s) [16:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:05:04] addshore, ^ [16:05:08] checking [16:05:34] looks good on enwiki [16:07:56] urandom, non-closed security blocker? [16:08:18] "I don't consider the system meeting our security standards until T120409: RESTBase should honor wiki-wide deletion/suppression of users is fixed. @GWicke says that will be fixed in January." [16:08:52] that's not a regression, and he later went on to say he was ok with deployment if the greater issue was delt with this quarter [16:08:55] let me find the quote [16:08:58] 6operations, 10Analytics-Cluster: analytics1017.eqiad.wmnet issues (no ssh in, no salt response) - https://phabricator.wikimedia.org/T125055#1974228 (10Ottomata) Yeah, I had tried to reinstall it back during the all staff, but had some issue and never finished. [16:09:15] yes, I see that [16:09:29] (03CR) 10Alex Monk: [C: 032] Enable EventBus on remaining (applicable) wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266564 (https://phabricator.wikimedia.org/T116786) (owner: 10Eevans) [16:10:27] (03Merged) 10jenkins-bot: Enable EventBus on remaining (applicable) wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266564 (https://phabricator.wikimedia.org/T116786) (owner: 10Eevans) [16:13:29] urandom, syncing [16:13:29] Krenair: thanks; i'm monitoring! [16:13:29] !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/266564/ (duration: 02m 12s) [16:13:29] urandom, ^ [16:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:13:29] urandom, Krenair: we already avoided that issue by disabling caching for end points that expose any user information [16:14:28] (03CR) 10Phuedx: [C: 04-1] "-1 for your attention." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [16:14:37] Krenair: it's looking good so far [16:15:07] (03CR) 10Bmansurov: Add sampling rates for mobile web language switcher on beta labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [16:16:30] bmansurov, hey [16:16:36] hello [16:16:41] (03PS3) 10Alex Monk: Enable QuickSurveys on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266955 (https://phabricator.wikimedia.org/T123771) (owner: 10Bmansurov) [16:16:54] (03CR) 10Alex Monk: [C: 032] Enable QuickSurveys on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266955 (https://phabricator.wikimedia.org/T123771) (owner: 10Bmansurov) [16:17:18] (03Merged) 10jenkins-bot: Enable QuickSurveys on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266955 (https://phabricator.wikimedia.org/T123771) (owner: 10Bmansurov) [16:17:24] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1014 is OK: OK: Less than 50.00% above the threshold [1.0] [16:17:25] (03Draft1) 10Addshore: wgRCWatchCategoryMembership true on wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264735 [16:18:08] bmansurov, syncing [16:18:15] cool [16:19:34] 7Blocked-on-Operations, 6operations, 10MediaWiki-API, 10Traffic, 7Monitoring: Set up action API latency / error rate metrics & alerts - https://phabricator.wikimedia.org/T123854#1974238 (10GWicke) [16:19:40] (03CR) 10Luke081515: [C: 031] wgRCWatchCategoryMembership true on wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264735 (owner: 10Addshore) [16:19:58] !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/266955/ (duration: 02m 14s) [16:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:20:07] bmansurov, ^ [16:20:17] 7Blocked-on-Operations, 6operations, 10MediaWiki-API, 6Services, and 2 others: Set up action API latency / error rate metrics & alerts - https://phabricator.wikimedia.org/T123854#1939492 (10GWicke) [16:20:52] Krenair: does that mean: all set? [16:21:38] bmansurov, it should be live now [16:21:51] ok [16:22:06] your second patch needs manual rebasing [16:22:44] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 50.00% above the threshold [1.0] [16:22:46] Krenair: ok, but did my patch just break https://fa.wikipedia.org/wiki/%D8%B5%D9%81%D8%AD%D9%87%D9%94_%D8%A7%D8%B5%D9%84%DB%8C? [16:23:21] ugh [16:23:46] revert is syncing [16:24:14] bd808, can we have some way to force sync-master to happen afterwards? [16:24:58] Krenair: is something wrong with the patch? [16:25:05] yes [16:25:10] !log upgrading packages (incl kernel) on all codfw caches [16:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:25:30] !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: rv (duration: 02m 10s) [16:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:25:33] (03CR) 10Phuedx: Add sampling rates for mobile web language switcher on beta labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [16:26:15] bmansurov, 2016-01-28 16:25:21 mw1054 fawiki exception ERROR: [fc284632] /wiki/[removed] InvalidArgumentException from line 62 of /srv/mediawiki/php-1.27.0-wmf.10/extensions/QuickSurveys/includes/SurveyFactory.php: The "Reader-segmentation-1" survey doesn't have any platforms. {"exception_id":"fc284632"} [16:26:15] [Exception InvalidArgumentException] (/srv/mediawiki/php-1.27.0-wmf.10/extensions/QuickSurveys/includes/SurveyFactory.php:62) The "Reader-segmentation-1" survey doesn't have any platforms. [16:26:45] Krenair: ok thanks, i'll push a fix [16:27:11] I need to push the revert first [16:27:23] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [16:27:44] (03PS1) 10Tim Landscheidt: shinken: Add role::labs::instance as hostgroup to all instances [puppet] - 10https://gerrit.wikimedia.org/r/267039 (https://phabricator.wikimedia.org/T123271) [16:27:53] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [16:28:24] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [16:28:24] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [16:29:28] (03PS1) 10Bmansurov: Enable QuickSurveys on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267040 (https://phabricator.wikimedia.org/T123771) [16:29:43] (03PS3) 10BBlack: role::cache: install conftool scripts [puppet] - 10https://gerrit.wikimedia.org/r/263620 (owner: 10Giuseppe Lavagetto) [16:29:45] (03CR) 10jenkins-bot: [V: 04-1] Enable QuickSurveys on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267040 (https://phabricator.wikimedia.org/T123771) (owner: 10Bmansurov) [16:29:49] !log mobileapps deployed 7583148, reverting in part 869ec35 [16:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:30:19] (03PS1) 10Alex Monk: Revert "Enable QuickSurveys on fawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267041 [16:30:52] (03CR) 10Alex Monk: [C: 032] "already in prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267041 (owner: 10Alex Monk) [16:32:01] we have a small spike in 503s, affecting esams more than others [16:32:01] (03Merged) 10jenkins-bot: Revert "Enable QuickSurveys on fawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267041 (owner: 10Alex Monk) [16:32:20] it will take a few more minutes to see the shape of it accurately, though [16:32:45] probably the InvalidArgumentException pasted above [16:33:26] 6operations, 10MediaWiki-API, 6Services, 10Traffic, 7Monitoring: Set up action API latency / error rate metrics & alerts - https://phabricator.wikimedia.org/T123854#1974270 (10faidon) [16:33:52] bblack, was it limited to fawiki? [16:34:00] or is it not easy to tell? [16:34:05] (03PS2) 10Bmansurov: Enable QuickSurveys on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267040 (https://phabricator.wikimedia.org/T123771) [16:34:14] Krenair: not easy to tell from the graph, anyways [16:34:19] ok [16:34:23] I can dig in oxygen a bit [16:34:52] IR (most of fawiki probably) does map to esams though, so it makes sense [16:34:58] the rate is much lower on other datacenters [16:35:18] (03PS3) 10Bmansurov: Enable QuickSurveys on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267040 (https://phabricator.wikimedia.org/T123771) [16:35:34] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:35:34] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:35:41] Krenair: this is the fixed patch: https://gerrit.wikimedia.org/r/#/c/267040/ [16:36:11] yeah oxygen confirms fawiki [16:36:52] (03PS3) 10Alex Monk: Autopromotion: remove deprecated onView event, fix INGROUPS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264219 (owner: 10Cenarium) [16:36:57] (03CR) 10Alex Monk: [C: 032] Autopromotion: remove deprecated onView event, fix INGROUPS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264219 (owner: 10Cenarium) [16:37:16] (03PS4) 10Bmansurov: Enable QuickSurveys on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266957 (https://phabricator.wikimedia.org/T123770) [16:37:26] (03Merged) 10jenkins-bot: Autopromotion: remove deprecated onView event, fix INGROUPS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264219 (owner: 10Cenarium) [16:37:53] !log Downloaded and `chmod +x`'d mira:/srv/mediawiki-staging/.git/hooks/commit-msg [16:37:54] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [16:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:38:56] there's a bunch of things in that directory which aren't useful [16:39:08] lrwxrwxrwx 1 reedy wikidev 25 Sep 11 2014 post-commit -> /a/common/logmsg-git-hook [16:39:08] lrwxrwxrwx 1 reedy wikidev 25 Sep 11 2014 post-merge -> /a/common/logmsg-git-hook [16:39:08] lrwxrwxrwx 1 reedy wikidev 25 Sep 11 2014 post-rewrite -> /a/common/logmsg-git-hook [16:39:15] but: ls: cannot access /a/common: No such file or directory [16:39:26] that doesn't sound good :) [16:39:53] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [5000000.0] [16:41:14] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1018 is OK: OK: Less than 50.00% above the threshold [1.0] [16:41:18] sigh [16:41:21] why is sync-masters so slow [16:41:56] I asked the same when i was syncing earlier [16:42:04] I wsa hoping someone from releng would know :) [16:42:31] it was like that during the deploy on tuesday, I think it didn't get to the top of the debug list [16:42:43] Krenair: I need to read the code but part of my intent with sync-master was to make the rsync fanout servers pull from the closest master so that would make putting it later in the process very bad [16:42:49] !log krenair@mira Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/264219/ (duration: 02m 12s) [16:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:42:54] PROBLEM - puppet last run on cp2020 is CRITICAL: CRITICAL: Puppet has 1 failures [16:43:50] ^ certainly me from apt-get upgrade, ignorable [16:44:04] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 78.26% of data above the critical threshold [5000000.0] [16:45:45] If sync-master is slow then there is some bottleneck in tin pulling rsync changes from mira. A root could investigate by running the rsync command inside /usr/local/bin/scap-master-sync with a --verbose flag as a first step. [16:45:45] !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/264219/ (duration: 02m 12s) [16:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:46:25] 6operations, 6Discovery, 10MediaWiki-Logging, 7HHVM: MediaWiki monolog doesn't handle Kafka failures gracefully - https://phabricator.wikimedia.org/T125084#1974286 (10Ottomata) Hmm, I think we could put kafka clusters into pybal, and use LVS for bootstrapping/metadata lookups. Then clients could use somet... [16:47:18] (03PS4) 10Alex Monk: Enable QuickSurveys on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267040 (https://phabricator.wikimedia.org/T123771) (owner: 10Bmansurov) [16:47:22] (03CR) 10Alex Monk: [C: 032] Enable QuickSurveys on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267040 (https://phabricator.wikimedia.org/T123771) (owner: 10Bmansurov) [16:47:44] RECOVERY - Host mw1228 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [16:47:53] (03Merged) 10jenkins-bot: Enable QuickSurveys on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267040 (https://phabricator.wikimedia.org/T123771) (owner: 10Bmansurov) [16:48:20] 6operations, 6Discovery, 10MediaWiki-Logging, 7HHVM: MediaWiki monolog doesn't handle Kafka failures gracefully - https://phabricator.wikimedia.org/T125084#1974293 (10faidon) >>! In T125084#1974286, @Ottomata wrote: > Hmm, I think we could put kafka clusters into pybal, and use LVS for bootstrapping/metada... [16:48:22] !log mw1172, mw1178,mw1217, mw1257 powering off task# T124642 [16:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:49:32] bmansurov, okay, this is looking better on mw1017 [16:49:40] (03CR) 10Jhobs: [C: 031] Enable QuickSurveys on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266957 (https://phabricator.wikimedia.org/T123770) (owner: 10Bmansurov) [16:49:43] ok [16:49:51] syncing out to the rest of the cluster [16:50:08] cool [16:50:13] (03PS5) 10Bmansurov: Enable QuickSurveys on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266957 (https://phabricator.wikimedia.org/T123770) [16:51:54] RECOVERY - Host mw1257 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [16:51:59] !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/267040/ (duration: 02m 13s) [16:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:52:04] bmansurov, ^ [16:52:15] ok [16:53:24] RECOVERY - Host mw1178 is UP: PING OK - Packet loss = 0%, RTA = 1.50 ms [16:54:04] RECOVERY - Host mw1172 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [16:55:03] PROBLEM - puppet last run on mw1132 is CRITICAL: CRITICAL: Puppet has 1 failures [16:55:24] RECOVERY - Host mw1217 is UP: PING OK - Packet loss = 0%, RTA = 1.71 ms [16:56:08] bmansurov, all good? [16:56:34] Krenair: yes, the survey just appeared: https://fa.wikipedia.org/wiki/%D8%B4%DB%8C%D9%86%D8%AF%D9%86%D8%AF?quicksurvey=true [16:56:36] thank you [16:56:46] (03CR) 10Alex Monk: [C: 032] Enable QuickSurveys on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266957 (https://phabricator.wikimedia.org/T123770) (owner: 10Bmansurov) [16:57:23] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [16:57:40] (03Merged) 10jenkins-bot: Enable QuickSurveys on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266957 (https://phabricator.wikimedia.org/T123770) (owner: 10Bmansurov) [16:58:05] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [16:59:10] bmansurov, looks fine on mw1017, syncing [16:59:16] ok [17:00:04] RobH cmjohnson1: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160128T1700). [17:00:04] ebernhardson urandom: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:00:11] Krenair: thanks for your help! [17:00:28] 6operations, 6Discovery, 10MediaWiki-Logging, 7HHVM: MediaWiki monolog doesn't handle Kafka failures gracefully - https://phabricator.wikimedia.org/T125084#1974366 (10jcrespo) Reminder to delete workaround documentation [[ https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Kafka/Administration#Safe_Brok... [17:00:36] oh yes [17:01:16] !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/266957/ (duration: 02m 15s) [17:01:20] bmansurov, ^ [17:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:01:26] ok [17:02:07] Krenair: works. Thanks! [17:02:37] (03PS2) 10RobH: [staging]: match restbase config to current Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/266299 (https://phabricator.wikimedia.org/T123869) (owner: 10Eevans) [17:02:57] urandom: heyas you about? [17:03:04] robh: i am! [17:03:10] I'm going to puppetswat your restbase change if its a good time? [17:03:22] robh: that would be great [17:03:40] is this production? [17:03:43] 6operations, 6Discovery, 10MediaWiki-Logging, 7HHVM: MediaWiki monolog doesn't handle Kafka failures gracefully - https://phabricator.wikimedia.org/T125084#1974371 (10hashar) [17:03:44] urandom: it seems to just be modifying heiradata so nothing will fire off of this on the hosts other than a backend config change? [17:03:46] gwicke: no [17:03:49] okay [17:03:55] its in [staging] [17:04:01] so i assumed that meant no prod change? [17:04:04] robh: correct [17:04:09] robh: correct [17:04:14] (to both) [17:04:17] excellent [17:04:22] 6operations, 6Discovery, 10MediaWiki-Logging, 7HHVM: MediaWiki monolog doesn't handle Kafka failures gracefully - https://phabricator.wikimedia.org/T125084#1973969 (10hashar) Edited task detail to point to {T88732} which is the log/redis issue faidon was referring to. Got solved by using syslog/UDP for tr... [17:04:30] (03CR) 10RobH: [C: 032] "puppetswatting now with urandom" [puppet] - 10https://gerrit.wikimedia.org/r/266299 (https://phabricator.wikimedia.org/T123869) (owner: 10Eevans) [17:05:01] urandom: its now live on palladium [17:05:07] did you want me to salt fire puppet on any hosts? [17:05:24] rephrase: want me to force a puppet run on any hosts? [17:05:27] robh: i added one more pupppet swat patch, https://gerrit.wikimedia.org/r/#/c/266663/ which should be reasonable but i couldn't find anyone to review it ahead of time [17:05:35] robh: sure [17:05:57] (03PS5) 10Ottomata: Removing code that generates pageviews using legacy definition [puppet] - 10https://gerrit.wikimedia.org/r/265656 (https://phabricator.wikimedia.org/T124244) (owner: 10Nuria) [17:06:12] (03CR) 10Ottomata: [C: 032 V: 032] Removing code that generates pageviews using legacy definition [puppet] - 10https://gerrit.wikimedia.org/r/265656 (https://phabricator.wikimedia.org/T124244) (owner: 10Nuria) [17:06:14] PROBLEM - mediawiki-installation DSH group on mw1217 is CRITICAL: Host mw1217 is not in mediawiki-installation dsh group [17:06:16] robh: is it going out host-by-host? [17:06:29] urandom: i was asking which hosts you wanted it fired on? [17:06:32] all restbase? [17:07:04] right now its now live so when any affects hosts do their normal puppet call in it will go live to them [17:07:20] robh: that should be fine [17:07:26] ok, cool, then done =] [17:07:44] robh: thank you sir! [17:07:57] ebernhardson: so i can try to get to that, but i was going to poke at the older patchset first [17:08:02] sure [17:08:15] granted both are your patchsets [17:08:19] ;] [17:09:15] RECOVERY - puppet last run on cp2020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:09:32] urandom: puppet runs in a random order across hosts, and only updates the configs [17:09:39] it' won't restart the service [17:09:44] gwicke: i know [17:09:59] gwicke: i can force a puppet run [17:10:00] kk [17:10:23] for a rolling deploy, you need to disable puppet manually on all nodes first [17:10:59] or rely on nothing restarting with the changed config [17:11:23] (03PS1) 10RobH: Add alert for elasticsearch 50th percentile prefix search time [puppet] - 10https://gerrit.wikimedia.org/r/267052 (https://phabricator.wikimedia.org/T124542) [17:11:59] gwicke: i'm confused, won't the config change when puppet runs? [17:12:32] yes, it will [17:12:45] it just doesnt force the service refresh (right?) [17:13:02] yes, and changes the configs in a random / un-coordinated manner [17:13:22] Krenair: do you have terminal scrollback of a slow sync-master? (for a bug report) [17:13:50] 6operations, 6Discovery, 10MediaWiki-Logging, 7HHVM: MediaWiki monolog doesn't handle Kafka failures gracefully - https://phabricator.wikimedia.org/T125084#1974418 (10elukey) Added also a note in https://wikitech.wikimedia.org/wiki/Service_restarts [17:13:52] no [17:14:14] PROBLEM - mediawiki-installation DSH group on mw1172 is CRITICAL: Host mw1172 is not in mediawiki-installation dsh group [17:15:07] (03PS1) 10EBernhardson: Point CirrusSearch queries to local datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267053 [17:15:09] (03PS2) 10RobH: Add alert for elasticsearch 50th percentile prefix search time [puppet] - 10https://gerrit.wikimedia.org/r/267052 (https://phabricator.wikimedia.org/T124542) [17:15:49] greg-g, getting one [17:15:51] !log disabling pupplet on restbase staging hosts [17:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:16:00] (03CR) 10jenkins-bot: [V: 04-1] Point CirrusSearch queries to local datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267053 (owner: 10EBernhardson) [17:16:09] (03CR) 10RobH: [C: 032] Add alert for elasticsearch 50th percentile prefix search time [puppet] - 10https://gerrit.wikimedia.org/r/267052 (https://phabricator.wikimedia.org/T124542) (owner: 10RobH) [17:17:06] 6operations, 10ops-eqiad: decomission the netapps in EQIAD: nas1001-a, nas1001-b - https://phabricator.wikimedia.org/T124156#1974428 (10akosiaris) [17:17:08] 6operations, 10fundraising-tech-ops: make sure netapp fundraising share gets wiped - https://phabricator.wikimedia.org/T118535#1974426 (10akosiaris) 5Open>3Resolved A week has passed, volume has been wiped, resolving. [17:17:29] ok, rolling the change to icinga, lets see if it likes it [17:17:46] and if not i know to not force icinga to refresh and crash =P [17:17:51] (now) [17:17:51] !log krenair@mira Synchronized README: testing (duration: 02m 08s) [17:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:18:01] greg-g: [17:18:01] 17:15:42 Started sync-masters [17:18:02] sync-masters: 100% (ok: 1; fail: 0; left: 0) [17:18:02] 17:17:31 Finished sync-masters (duration: 01m 49s) [17:18:23] !log pushing icinga updates (shouldnt affect service but others shouldnt also try to update neon right now) [17:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:18:45] !log forcing puppet run on cerium.eqiad.wmnet (restbase staging) [17:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:21:26] !log restarting restbase on cerium.eqiad.wmnet [17:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:23:58] hrmm, im having issues verifying the icinga change, still workign on it. [17:24:19] puppet says no errors, but i have errors regarding path when trying to checkconfig (it looks in the user homedir for info) [17:24:29] RECOVERY - puppet last run on mw1132 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:24:32] Krenair: ty [17:24:44] 6operations, 10ops-eqiad, 5Patch-For-Review: mw1172, mw1178,mw1217, mw1257 are unresponsive, mgmt interface unreachable - https://phabricator.wikimedia.org/T124642#1974490 (10Cmjohnson) a:3Joe Fixed all the servers and verified mgmt interface is reachable. assigning back @joe [17:25:49] !log forcing puppet run on xenon.eqiad.wmnet (restbase staging) [17:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:25:53] 7Blocked-on-Operations, 6operations, 10RESTBase, 10hardware-requests: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#1974502 (10Cmjohnson) Update, we did receive the wrong ssds for the HP restbase servers but the correct disks for the Dell. I added 2 new ssds to restbase1... [17:27:04] (03PS1) 10Alexandros Kosiaris: Remove nas1001-a.eqiad.wmnet, nas1001-b.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/267054 (https://phabricator.wikimedia.org/T124156) [17:27:17] !log restarting restbase on xenon.eqiad.wmnet (restbase staging) [17:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:30:10] !log forcing puppet run on praseodymium.eqiad.wmnet, and restarting restbase (staging env) [17:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:30:55] ok last puppet swat patch up [17:31:00] (03PS1) 10Alexandros Kosiaris: Remove nas1001-a.eqiad.wmnet, nas1001-b.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/267056 (https://phabricator.wikimedia.org/T124156) [17:32:09] PROBLEM - mediawiki-installation DSH group on mw1257 is CRITICAL: Host mw1257 is not in mediawiki-installation dsh group [17:33:09] (03PS3) 10RobH: Allow access to graphite/events/get_data [puppet] - 10https://gerrit.wikimedia.org/r/266663 (owner: 10EBernhardson) [17:33:20] ebernhardson: this apache change is much less scary than many others ;] [17:33:39] on a single host for a single appending location path [17:33:45] robh: :) [17:34:11] I always forget to puppetswat my patches :( [17:34:53] !log forcing puppet run on restbase200[1-3].codfw.wmnet (restbase staging) [17:34:55] ebernhardson: so when this is live [17:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:35:07] that url https://graphite.wikimedia.org/events/get_data?from=-1h&until=now&tags=esloadtest should show more than []? [17:35:19] robh: change from=-24h [17:35:39] robh: i was pushing events yesterday during a load test, and it's very usefull to mark changes to the load test (concurrency, query patterns) directly into the graphs [17:36:31] (03CR) 10RobH: [C: 032] "puppetswatting this apache change (since it is a single host and appending to existing entry)" [puppet] - 10https://gerrit.wikimedia.org/r/266663 (owner: 10EBernhardson) [17:36:36] robh: end result, this link should no longer have errors: https://grafana.wikimedia.org/dashboard/db/elasticsearch-codfw-load-testing?from=1453909557409&to=1453918006118 [17:36:44] 6operations, 10Beta-Cluster-Infrastructure, 6Labs: Duplicate IP address DNS entry - https://phabricator.wikimedia.org/T125040#1974545 (10Andrew) 5Open>3Resolved a:3Andrew It's in designate, not ldap. But, yes, looks like we leaked one; I've cleaned it up. [17:36:47] cool, its going live now [17:36:50] PROBLEM - puppet last run on wtp2013 is CRITICAL: CRITICAL: puppet fail [17:36:51] PROBLEM - mediawiki-installation DSH group on mw1178 is CRITICAL: Host mw1178 is not in mediawiki-installation dsh group [17:37:42] * robh waits on puppetrun on graphite1001 [17:38:15] !log restarting restbase on restbase200[1-3].codfw.wmnet (restbase staging) [17:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:38:27] ebernhardson: success [17:38:34] no more error on refresh [17:38:47] robh: sweet! thanks. annotations come up now too [17:38:57] !log neglected to log i ifinished icinga/neon updates and its back to normal service (never interrrupted) [17:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:39:02] next step i'll have to figure out how to get deployment annotatiosn pushed from scap into there so i can have other nice annotations :) [17:39:16] puppetswat over [17:39:29] ostriches: everyone forgets, this was the first week i had patches in the past 3 puppetswat rotations [17:39:44] !log finished deploying configuration change (https://gerrit.wikimedia.org/r/266299) to restbase staging [17:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:41:27] 6operations, 10Analytics-Cluster, 10EventBus, 6Services: Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters. - https://phabricator.wikimedia.org/T123954#1974576 (10Nuria) [17:41:59] robh: thank you! [17:42:28] very welcome [17:43:09] (03PS2) 10Alexandros Kosiaris: Remove nas1001-a.eqiad.wmnet, nas1001-b.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/267054 (https://phabricator.wikimedia.org/T124156) [17:43:15] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Remove nas1001-a.eqiad.wmnet, nas1001-b.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/267054 (https://phabricator.wikimedia.org/T124156) (owner: 10Alexandros Kosiaris) [17:43:51] !log rebooting analytics1002.eqiad.wmnet (Hadoop master's slave) for kernel upgrade [17:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:44:19] 6operations, 10ops-eqiad, 5Patch-For-Review: decomission the netapps in EQIAD: nas1001-a, nas1001-b - https://phabricator.wikimedia.org/T124156#1974602 (10akosiaris) [17:46:13] 6operations, 10Mathoid, 6Services, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare mathoid for the codfw switchover - https://phabricator.wikimedia.org/T125058#1974616 (10mobrovac) [17:52:05] (03PS2) 10Chad: Also keep /srv/patches in sync between masters [puppet] - 10https://gerrit.wikimedia.org/r/266773 [17:52:33] (03CR) 10Chad: Also keep /srv/patches in sync between masters (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/266773 (owner: 10Chad) [17:52:59] 6operations, 7HTTPS: ssl certificate replacement: ticket.wikimedia.org (expires 2016-02-16) - https://phabricator.wikimedia.org/T122320#1974651 (10RobH) Email sent to the otrs admins: > > OTRS Admins, > > This planned maintenance is being tracked on https://phabricator.wikimedia.org/T122320 > > The SSL ce... [17:53:19] (03CR) 10jenkins-bot: [V: 04-1] Also keep /srv/patches in sync between masters [puppet] - 10https://gerrit.wikimedia.org/r/266773 (owner: 10Chad) [17:54:42] (03PS3) 10Chad: Also keep /srv/patches in sync between masters [puppet] - 10https://gerrit.wikimedia.org/r/266773 [17:58:11] 6operations, 6Commons, 6Multimedia: "401 Unauthorized" when trying to view any uploaded files in he.wikivoyage - https://phabricator.wikimedia.org/T48863#1974925 (10Luke081515) [17:59:03] (03Abandoned) 10Mobrovac: SCA: Remove Citoid, Mathoid and Graphoid [puppet] - 10https://gerrit.wikimedia.org/r/266996 (https://phabricator.wikimedia.org/T125029) (owner: 10Mobrovac) [18:00:05] yurik gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160128T1800). Please do the needful. [18:00:27] (03PS1) 10Yurik: Update graph settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267060 [18:00:34] 6operations, 6Labs, 10Tool-Labs: Relax restrictions on .htaccess - https://phabricator.wikimedia.org/T48003#1975308 (10Luke081515) [18:00:48] no parsoid deploy [18:01:02] graphoid needs deployment [18:01:02] did a new slot open up for these deploys? [18:01:07] subbu, ^ [18:01:18] i will deploy it shortly [18:01:35] ah, ok. [18:02:02] PROBLEM - Host hafnium is DOWN: PING CRITICAL - Packet loss = 100% [18:02:25] ^ hafnium is me, forgot to mark downtime in icinga [18:02:42] RECOVERY - Host hafnium is UP: PING OK - Packet loss = 0%, RTA = 2.05 ms [18:04:24] RECOVERY - puppet last run on wtp2013 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [18:09:20] 6operations: Mail alias needed vpe-staff to route to eng-mgt - https://phabricator.wikimedia.org/T105431#1976770 (10Dereckson) [18:12:06] (03PS2) 10Yurik: Update graph settings - should be noop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267060 [18:13:38] 6operations, 7HTTPS: ssl certificate replacement: ticket.wikimedia.org (expires 2016-02-16) - https://phabricator.wikimedia.org/T122320#1977465 (10RobH) It seems this was the pushback date for the OTRS upgrade, but it wasnt on the deployments calendar so I wasnt aware. [18:16:02] 6operations, 7HTTPS: ssl certificate replacement: ticket.wikimedia.org (expires 2016-02-16) - https://phabricator.wikimedia.org/T122320#1977854 (10RobH) (Though it was already on the otrs landing page so shame on me for missing it.) Since this ssl expiry isn't until the 16th, I'll simply wait until after the... [18:16:21] PROBLEM - Hadoop ResourceManager on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.resourcemanager.ResourceManager [18:16:50] ottomata: ^ you looking into that or? [18:16:56] it's me sorry [18:17:09] cool [18:17:09] didn't silence icing [18:17:16] 6operations, 10Analytics: Requests to (hard) redirect pages return their target's contents but are counted as pageviews to the redirect page - https://phabricator.wikimedia.org/T125015#1978088 (10Milimetric) p:5Triage>3Normal [18:17:18] sorry today is a bad day [18:17:23] apologies [18:17:24] I just wanted to make sure it wasnt unexpected, no worries [18:18:03] yeah theoretically we are using failover and it shouldn't have happened [18:18:17] 6operations, 10Analytics-Cluster, 10hardware-requests: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#1978243 (10Milimetric) [18:18:19] maybe it will recover in a bit, 1002 should be the active now [18:19:12] ah no zero processes because we stopped, ok. Silencing icinga [18:21:01] !log deployed graphoid [18:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:21:44] the new catwatch stuff [18:21:52] opt out seems broken [18:22:07] !log rebooting analytics1001 for new kernel upgrade [18:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:22:48] 6operations, 10Graphoid, 6Services, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare graphoid for the codfw switchover - https://phabricator.wikimedia.org/T125060#1978828 (10mobrovac) [18:23:19] 6operations, 6Services, 3Mobile-Content-Service, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare mobileapps for the codfw switchover - https://phabricator.wikimedia.org/T125061#1978835 (10mobrovac) [18:23:40] (03CR) 10Bmansurov: Add sampling rates for mobile web language switcher on beta labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [18:24:54] 6operations, 7Monitoring, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Add RAID monitoring for HP servers - https://phabricator.wikimedia.org/T97998#1978849 (10jcrespo) [18:25:31] RECOVERY - Hadoop ResourceManager on analytics1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.resourcemanager.ResourceManager [18:26:07] 6operations, 7Monitoring, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Add RAID monitoring for HP servers - https://phabricator.wikimedia.org/T97998#1978862 (10jcrespo) Making this a blocker for codfw rollout due to hard dependency from MySQL HP servers on dallas. [18:29:54] 6operations, 5WMF-deploy-2016-01-19_(1.27.0-wmf.11): Rise in "parent, LightProcess exiting" fatals on mw1019 since 1.27.0-wmf.11 deploy - https://phabricator.wikimedia.org/T124956#1978929 (10Tgr) Does not coincide with the deploy: https://logstash.wikimedia.org/#dashboard/temp/AVKJdYMNptxhN1XaDFUZ https://logs... [18:44:32] (03PS3) 10Dzahn: Remove absented uuid-generator script [puppet] - 10https://gerrit.wikimedia.org/r/214625 (owner: 10Alexandros Kosiaris) [18:45:44] (03PS4) 10Dzahn: Remove absented uuid-generator script [puppet] - 10https://gerrit.wikimedia.org/r/214625 (owner: 10Alexandros Kosiaris) [18:45:53] (03CR) 10Dzahn: [C: 032] Remove absented uuid-generator script [puppet] - 10https://gerrit.wikimedia.org/r/214625 (owner: 10Alexandros Kosiaris) [18:46:30] (03CR) 10Dzahn: "removed dependency, rebased, ready to merge, +2 ,leaving the submit to you" [puppet] - 10https://gerrit.wikimedia.org/r/214625 (owner: 10Alexandros Kosiaris) [18:49:34] (03CR) 10Dzahn: "..or i can do it, wasn't sure which way you prefer" [puppet] - 10https://gerrit.wikimedia.org/r/214625 (owner: 10Alexandros Kosiaris) [18:51:13] (03CR) 10Dzahn: "hrmm. yea, i'm afraid so, unless we can get Mark to ack it as an addendum to the access request for carbon from last meeting" [puppet] - 10https://gerrit.wikimedia.org/r/266919 (owner: 10Papaul) [18:53:50] (03CR) 10Mobrovac: [C: 031] cache_parsoid: remove restbase pass-through [puppet] - 10https://gerrit.wikimedia.org/r/266748 (https://phabricator.wikimedia.org/T110475) (owner: 10BBlack) [18:56:54] tgr, ori, bd808: can the train move forward today without a fix to stats update? https://phabricator.wikimedia.org/T125054 [18:57:03] (03PS7) 10Eevans: [production]: match restbase config to current Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/266297 (https://phabricator.wikimedia.org/T123869) [18:57:10] 6operations, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#1979046 (10MaxSem) 3NEW [18:57:21] (03PS5) 10Dzahn: statsite: default to localhost, override as needed [puppet] - 10https://gerrit.wikimedia.org/r/204275 (owner: 10Filippo Giunchedi) [18:57:51] (03PS6) 10Dzahn: statsite: default to localhost, override as needed [puppet] - 10https://gerrit.wikimedia.org/r/204275 (owner: 10Filippo Giunchedi) [18:58:43] (03CR) 10Dzahn: "fixed path conflict, rebased manually" [puppet] - 10https://gerrit.wikimedia.org/r/204275 (owner: 10Filippo Giunchedi) [18:59:20] (03CR) 10Mobrovac: [C: 031] "If738e7401a60e47ca45b875ed3abefa40dcebe11 showed that life is good in staging, so let's extend the pleasures to prod." [puppet] - 10https://gerrit.wikimedia.org/r/266297 (https://phabricator.wikimedia.org/T123869) (owner: 10Eevans) [19:00:04] marxarelli: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160128T1900). [19:01:10] (03CR) 10Bmansurov: Add sampling rates for mobile web language switcher on beta labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [19:02:20] (03CR) 10Dzahn: "@AndyRussG are you still interested in this?" [puppet] - 10https://gerrit.wikimedia.org/r/182141 (owner: 10AndyRussG) [19:02:23] (03PS4) 10Bmansurov: Add sampling rates for mobile web language switcher on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) [19:05:13] (03CR) 10Dzahn: "@Coren i'm gonna be bold and abandon this, it's been a couple months and the linked ticket is closed as resolved" [puppet] - 10https://gerrit.wikimedia.org/r/240684 (https://phabricator.wikimedia.org/T113298) (owner: 10coren) [19:05:21] (03Abandoned) 10Dzahn: Fix ssh public key for junikowski [puppet] - 10https://gerrit.wikimedia.org/r/240684 (https://phabricator.wikimedia.org/T113298) (owner: 10coren) [19:06:36] (03PS3) 10Dzahn: ganglia diskstat.py: pep8 fixes all over the place [puppet] - 10https://gerrit.wikimedia.org/r/264997 (owner: 10Chad) [19:06:39] marxarelli: I don't think that one graph should be a blocker, no [19:07:00] bd808: rgr that. anything else i need to be aware of atm? [19:07:24] didn't see anything else on the -ops ml in in phab [19:07:34] or fatalmonitor for that matter [19:07:46] (03CR) 10AndyRussG: "Hi! Sorry for the delay. Yes, it would be important to have some comments in here. I'll get back to it a bit later today... Thanks for the" [puppet] - 10https://gerrit.wikimedia.org/r/182141 (owner: 10AndyRussG) [19:08:36] anomie: are either of your pending patches blockers for wikipedia rollout today? [19:08:56] * bd808 hasn't had time to review and continues to munch sandwich [19:09:17] bd808: Not necessarily blockers, but https://gerrit.wikimedia.org/r/#/c/267066/ would be good to have. [19:10:17] (03CR) 10Dzahn: [C: 032] "looks good, i'll keep an eye on it nevertheless" [puppet] - 10https://gerrit.wikimedia.org/r/264997 (owner: 10Chad) [19:10:56] (03PS2) 10Dzahn: ganglia compat.py: couple of pep8 fixes, mostly whitespace [puppet] - 10https://gerrit.wikimedia.org/r/265100 (owner: 10Chad) [19:12:21] (03CR) 10Dzahn: [C: 032] ganglia compat.py: couple of pep8 fixes, mostly whitespace [puppet] - 10https://gerrit.wikimedia.org/r/265100 (owner: 10Chad) [19:14:45] anomie: if you can get that +2'd soonish, i'll merge/sync a backport to wmf.11 [19:15:09] tgr: ^ [19:18:57] 6operations, 7Monitoring, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Add RAID monitoring for HP servers - https://phabricator.wikimedia.org/T97998#1979159 (10jcrespo) AFAIK, hpcacucli is non-free. This is the basic, free, debian-included option to do that: ``` $ cciss_vol_status --verbose /dev/sg0 Cont... [19:20:31] (03PS2) 10Dzahn: ganglia nginx_status.py: bunch of pep8 fixes [puppet] - 10https://gerrit.wikimedia.org/r/265101 (owner: 10Chad) [19:22:08] (03CR) 10Dzahn: [C: 032] ganglia nginx_status.py: bunch of pep8 fixes [puppet] - 10https://gerrit.wikimedia.org/r/265101 (owner: 10Chad) [19:23:53] (03PS2) 10Dzahn: ganglia: util.py: bunch of pep8 fixes [puppet] - 10https://gerrit.wikimedia.org/r/265102 (owner: 10Chad) [19:24:15] (03CR) 10Dzahn: [C: 032] "almost all of it just changes comment lines" [puppet] - 10https://gerrit.wikimedia.org/r/265102 (owner: 10Chad) [19:25:10] (03PS1) 10Bmansurov: Bump up the QuickSurveys sampling rates for es and fa wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267071 (https://phabricator.wikimedia.org/T123770) [19:26:31] !log kafka preferred-replica-election to rebalanace analytics-eqiad brokers [19:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:27:44] (03PS2) 10Dzahn: ganglia: udp2log_socket.py: bunch of pep8 fixes [puppet] - 10https://gerrit.wikimedia.org/r/265103 (owner: 10Chad) [19:28:23] marxarelli: hi, can I ask you a favor? I need to get 2 patches swat deployed. [19:28:35] marxarelli: one is for beta cluster, the other is for production [19:28:42] marxarelli: will you be able to help? [19:28:45] (03PS3) 10Yuvipanda: Add jessie + jessie backports image builder [docker-images/debian] - 10https://gerrit.wikimedia.org/r/267009 [19:28:57] bmansurov: not right now, we're trying to diagnose a potential rollback worthy issue [19:29:01] bmansurov: sure. the train is held up atm anyway [19:29:11] heh, two takes on the same situation [19:29:37] marxarelli, greg-g: OK, thanks. When should I come back approximately? [19:29:53] (03CR) 10Dzahn: [C: 032] ganglia: udp2log_socket.py: bunch of pep8 fixes [puppet] - 10https://gerrit.wikimedia.org/r/265103 (owner: 10Chad) [19:30:49] bmansurov: unclear right now, how about this afternoon's swat? [19:30:51] (03PS2) 10Dzahn: ganglia: udp_stats.py: bunch of pep8 fixes [puppet] - 10https://gerrit.wikimedia.org/r/265104 (owner: 10Chad) [19:31:08] greg-g: sounds good, I'll add my patches to the wiki page. thanks [19:31:22] bmansurov: in general, that's what you should do in this situation, add it to the swat window [19:31:35] greg-g: ok [19:31:50] that's what swat is, it's not "any old time" ;) [19:32:09] got it [19:32:17] bmansurov: sorry for the confusion. i'm just getting the hang of deploys :) [19:32:40] all good [19:35:25] marxarelli: I should add my patches to "MediaWiki train" right? [19:35:25] it has different background color ;) [19:35:25] bmansurov: no, swat [19:35:25] bmansurov: the train deploy is for branch cuts which will include anything in the master branch [19:35:25] you want swat [19:35:25] *weekly* branch cuts [19:35:26] oh, the evening SWAT is at the top ;) [19:35:26] https://wikitech.wikimedia.org/wiki/SWAT_deploys [19:36:14] thansk [19:36:17] * [19:43:08] !log added tgr and marxarelli to security group on phab [19:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:48:56] (03PS1) 10Ori.livneh: Revert "Autopromotion: remove deprecated onView event, fix INGROUPS" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267072 [19:49:07] (03PS2) 10Ori.livneh: Revert "Autopromotion: remove deprecated onView event, fix INGROUPS" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267072 [19:50:00] (03CR) 10Ori.livneh: [C: 032] Revert "Autopromotion: remove deprecated onView event, fix INGROUPS" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267072 (owner: 10Ori.livneh) [19:50:11] interesting [19:50:24] (03Merged) 10jenkins-bot: Revert "Autopromotion: remove deprecated onView event, fix INGROUPS" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267072 (owner: 10Ori.livneh) [19:50:39] no idea what in that change would cause such issues [19:51:17] I guess we're about to find out whether it does actually cause any issues though [19:53:03] (03CR) 10Yuvipanda: [C: 032 V: 032] Add jessie + jessie backports image builder [docker-images/debian] - 10https://gerrit.wikimedia.org/r/267009 (owner: 10Yuvipanda) [19:53:43] (03CR) 10Dzahn: [C: 032] ganglia: udp_stats.py: bunch of pep8 fixes [puppet] - 10https://gerrit.wikimedia.org/r/265104 (owner: 10Chad) [19:54:30] (03PS4) 10BBlack: role::cache: install conftool scripts [puppet] - 10https://gerrit.wikimedia.org/r/263620 (owner: 10Giuseppe Lavagetto) [19:55:18] !log ori@mira Synchronized wmf-config: Iea2573ccfbe: Revert "Autopromotion: remove deprecated onView event, fix INGROUPS" (duration: 02m 13s) [19:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:56:17] (03CR) 10BBlack: [C: 032] role::cache: install conftool scripts [puppet] - 10https://gerrit.wikimedia.org/r/263620 (owner: 10Giuseppe Lavagetto) [19:58:45] 6operations: fix all "top-scope variable being used without an explicit namespace" across the puppet repo - https://phabricator.wikimedia.org/T125042#1979267 (10Dzahn) You can help me with this by merging/checking something from this list: https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet... [20:00:51] !log depool -> reboot cp4011 (ulsfo mobile, currently unused for traffic - testing local conftool-scripts depool + new kernel) [20:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:03:08] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:03:36] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:04:58] (03PS2) 10Dzahn: racktables: fix top-scope variable without explicit namespace [puppet] - 10https://gerrit.wikimedia.org/r/266962 [20:05:27] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 164 not-conn: cp4011_v4, cp4011_v6 [20:05:53] (03CR) 10Dzahn: [C: 032] racktables: fix top-scope variable without explicit namespace [puppet] - 10https://gerrit.wikimedia.org/r/266962 (owner: 10Dzahn) [20:06:00] cp4011-related ipsec criticals to be expected, sorry [20:07:16] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 166 ESP OK [20:07:39] (03CR) 10Jdlrobson: "Wouldn't it make more sense to set wgMFIgnoreEventLoggingBucketing = false so we can test all schemas?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [20:08:35] (03CR) 10Bmansurov: "It depends on whether we want to test other schemas. Do we?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [20:09:25] (03PS2) 10Dzahn: lists: fix top-scope var, arrow alignment [puppet] - 10https://gerrit.wikimedia.org/r/266983 [20:09:39] (03CR) 10Dzahn: [C: 032] lists: fix top-scope var, arrow alignment [puppet] - 10https://gerrit.wikimedia.org/r/266983 (owner: 10Dzahn) [20:10:16] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:10:37] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:12:14] find . -maxdepth 4 -name .git -type d -print -execdir bash -c 'git remote rename gerrit origin || : ; git-review -s || :;' [20:12:19] wrong term [20:13:44] 6operations: fix all "top-scope variable being used without an explicit namespace" across the puppet repo - https://phabricator.wikimedia.org/T125042#1979327 (10Dzahn) https://gerrit.wikimedia.org/r/266962 https://gerrit.wikimedia.org/r/266963 https://gerrit.wikimedia.org/r/266964 https://gerrit.wikimedia.org/r/... [20:23:21] (03CR) 10Phuedx: [C: 04-1] "+1 to the approach – only enable what we want to test." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [20:23:26] 6operations, 10DBA: db1024 (s2 master) will run out of disk space in ~4 months - https://phabricator.wikimedia.org/T122048#1979356 (10jcrespo) This has accelerated in the last week, it may happen the 20 of February, scheduling a master failover soon. [20:25:35] !log depool -> reboot cp4008 (ulsfo text, trying new kernel with live traffic) [20:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:26:52] (03PS2) 10Dzahn: monitoring: fix top-scope vars without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266963 [20:27:10] (03CR) 10Dzahn: [C: 032] monitoring: fix top-scope vars without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266963 (owner: 10Dzahn) [20:30:07] RECOVERY - puppet last run on analytics1026 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [20:30:28] (03PS5) 10Bmansurov: Add sampling rates for mobile web language switcher on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) [20:32:03] 6operations, 10Analytics-Cluster: analytics1026 - '/bin/mount /mnt/hdfs' returned 1: fuse: - https://phabricator.wikimedia.org/T125009#1979374 (10Ottomata) 5Open>3Resolved a:3Ottomata Weird! I umounted /mnt/hdfs, ran puppet twice, and now everything is fine. [20:32:35] 6operations, 10Citoid, 10Graphoid, 10Mathoid, and 2 others: Remove Graphoid, Mathoid and Citoid from sca100x - https://phabricator.wikimedia.org/T125029#1979380 (10Dzahn) @mobrovac looks done meanwhile? ``` Linux sca1001 3.13.0-44-generic #73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014 x86_64 Ubuntu 14.04.3... [20:34:04] 6operations, 10Citoid, 10Graphoid, 10Mathoid, and 2 others: Remove Graphoid, Mathoid and Citoid from sca100x - https://phabricator.wikimedia.org/T125029#1979412 (10Dzahn) ah, looks like they are still running on sca1002. want me to kill those process and clean up? [20:35:42] 6operations, 10Analytics-Cluster, 10hardware-requests: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#1970744 (10Ottomata) [20:36:04] 6operations, 7Swift: File not found after reupload - https://phabricator.wikimedia.org/T125140#1979421 (10matmarex) [20:37:36] 6operations, 10Analytics-Cluster, 10hardware-requests: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#1970744 (10Ottomata) [20:37:38] 6operations, 10Analytics, 5Patch-For-Review: Increase HADOOP_HEAPSIZE (-Xmx) for hive-server2 - https://phabricator.wikimedia.org/T76343#1979429 (10Ottomata) [20:49:34] ori: seeing a lot of "Undefined variable: wmgAutopromoteOnceonView in /srv/mediawiki/wmf-config/CommonSettings.php on line 1042" [20:49:40] is that related to your config sync? [20:50:06] !log dduvall@mira Synchronized php-1.27.0-wmf.11: syncing 1.27.0-wmf.11 for T125114 and https://gerrit.wikimedia.org/r/#/c/267128/ (duration: 03m 30s) [20:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:50:16] marxarelli: was there a brief spike that went away (which would indicate the standard race condition -- there were two files) , or are they still ongoing? [20:50:43] 6operations, 10Citoid, 10Graphoid, 10Mathoid, and 2 others: Remove Graphoid, Mathoid and Citoid from sca100x - https://phabricator.wikimedia.org/T125029#1979477 (10mobrovac) >>! In T125029#1979412, @Dzahn wrote: > ah, looks like they are still running on sca1002. want me to kill those process and clean up?... [20:50:48] ori: 153305 of them in fatalmonitor but stable [20:50:56] (03PS2) 10RobH: add jforrester to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/266428 [20:51:06] marxarelli: stable as in they've stopped? [20:51:11] ori: yep [20:51:15] yeah, race cond [20:52:04] !log sca1002 - stop mathoid,graphoid,citoid [20:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:52:52] ori: got it, ty [20:54:01] (03CR) 10RobH: [C: 032] add jforrester to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/266428 (owner: 10RobH) [20:54:23] !log sca1001 - stop mathoid,graphoid,citoid [20:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:55:45] 6operations, 10Citoid, 10Graphoid, 10Mathoid, and 2 others: Remove Graphoid, Mathoid and Citoid from sca100x - https://phabricator.wikimedia.org/T125029#1979497 (10Dzahn) stopped all 3 services on both machines, ran puppet, confirmed they did not get restarted anymore [20:55:51] (03CR) 10jenkins-bot: [V: 04-1] add jforrester to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/266428 (owner: 10RobH) [20:57:01] (03CR) 10Gergő Tisza: "Seems like the revert had no effect on the spike." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267072 (owner: 10Ori.livneh) [20:57:06] robh: Puppet doesn't like me. [20:57:18] yea it was a rebase of a file without changes though [20:57:31] so im not convinced something isnt borked, im going ot go reference te log file [20:57:55] * James_F nods. [20:58:42] oops, that one looks like a jenkins issue [20:58:46] 20:54:55 ERROR: could not install deps [nose, -rmodules/admin/data/requirements.txt]; v = InvocationError('/home/jenkins/workspace/operations-puppet-tox-py27-jessie/.tox/py27/bin/pip install nose [20:59:05] unusual [20:59:19] yep [20:59:26] hashar: ;) hi [20:59:29] https://integration.wikimedia.org/ci/job/operations-puppet-tox-py27-jessie/428/console [21:00:06] i suppose it technically didnt happen unless it happens twice. i'll rebase again and see if it misfires a second time. [21:00:30] i saw something about changes to that tox.ini that is mentioned .. [21:00:32] mutante: hello [21:00:34] there isnt an easier way to retrigger is there? [21:00:40] oh, i'll wait cuz hashar is here =] [21:00:47] hashar: should we report that error above as a new bug? [21:00:58] "could not install deps" [21:00:58] well [21:01:00] hashar: https://integration.wikimedia.org/ci/job/operations-puppet-tox-py27-jessie/428/console has an error in the output that appears jenkins specific [21:01:01] from the console output [21:01:08] 'Connection to pypi.python.org timed out. (timeout=15° [21:01:15] ooh [21:01:18] so seems pypi.python.org had some issue [21:01:21] yea, a ton of them [21:01:22] interesting [21:01:23] and hopefully we are not blacklisted [21:01:42] 6operations, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#1979513 (10BBlack) Are the 4x tile servers in codfw "maps-test200x"? Are those being renamed / reused as production? [21:01:46] want anyone to make a phab task for ya? (I can if you do) [21:02:01] or who would investigate if we are blacklisted? [21:02:13] just 'recheck' [21:02:17] it is probably transient [21:02:27] hopefully [21:02:47] whats the best way to retrigger the testing? just change and recommit? [21:03:02] (i didnt know if there is a better way via the gerrit interface that im missing) [21:03:32] 6operations, 10Citoid, 10Graphoid, 10Mathoid, and 2 others: Remove Graphoid, Mathoid and Citoid from sca100x - https://phabricator.wikimedia.org/T125029#1979522 (10Dzahn) the issue we saw on icinga is also gone. done? [21:03:48] robh: you can comment in gerrit "recheck" (without quotes) [21:04:03] cool, thx! [21:04:12] the comment being added generate an event that is consumed by Zuul [21:04:16] (03CR) 10RobH: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/266428 (owner: 10RobH) [21:04:48] and whenever a comment == 'recheck', Zuul enter the related Gerrit change in 'test' exactly as if a new patchset had been produced [21:05:15] then as usual, it will shows up at https://integration.wikimedia.org/zuul/ [21:05:34] oh, do i need to remove the gerrit review already on it/ [21:05:47] oh [21:05:54] likely i do before i recheck right? [21:05:56] robh: can you comment again [21:05:58] 6operations, 10DBA, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: prepare for mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#1979529 (10jcrespo) Doing s2 next week due to T122048 [21:06:02] 6operations, 10Citoid, 10Graphoid, 10Mathoid, and 2 others: Remove Graphoid, Mathoid and Citoid from sca100x - https://phabricator.wikimedia.org/T125029#1979531 (10mobrovac) 5Open>3Resolved a:3mobrovac Confirmed it's all good, so resolving. Thnx @Dzahn! [21:06:17] it doesn't accept 'recheck' if you remove a vote at the same time (the regex is stricter.. ) [21:06:17] hashar: my first one didn't seem to fire the event that i could see [21:06:21] 6operations, 10Citoid, 10Graphoid, 10Mathoid, 6Services: Remove Graphoid, Mathoid and Citoid from sca100x - https://phabricator.wikimedia.org/T125029#1979534 (10mobrovac) [21:06:34] "Patch Set 2: -Code-Review\nrecheck" .. it is not matched [21:06:37] :( [21:06:50] comment again but dont touch the current verified-1 from gerrit right? [21:07:03] (03PS1) 10Dzahn: Revert "monitoring: fix top-scope vars without namespace" [puppet] - 10https://gerrit.wikimedia.org/r/267133 [21:07:26] yeah [21:07:47] (03CR) 10Dzahn: "this results in a change in servicegroups for some servics, but only codfw.. looking.. maybe revert" [puppet] - 10https://gerrit.wikimedia.org/r/266963 (owner: 10Dzahn) [21:08:20] (03CR) 10RobH: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/266428 (owner: 10RobH) [21:08:22] (03PS1) 10Anomie: Grant autocreateaccount to anons on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267134 (https://phabricator.wikimedia.org/T125133) [21:08:33] 6operations, 10DBA, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: prepare for mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#1979542 (10jcrespo) a:3jcrespo We will switchover from db1024 -> db1018 [21:08:57] marxarelli: so, post that last sync, how are we? cc anomie tgr [21:09:22] if it fixed T125114, then we're good to go on full rollout, right? (per https://phabricator.wikimedia.org/T125143 ) [21:09:23] looks much better this time [21:09:26] 6operations, 10DBA, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: prepare for mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#1979544 (10jcrespo) p:5Normal>3High [21:09:31] greg-g: I'm still working on T125133. At the moment it looks like there's at least two things wrong, one of which is https://gerrit.wikimedia.org/r/267134 [21:09:51] hashar: proof again if an error only happens once it didnt happen =] [21:10:06] anomie: ok, just to double check, agreement was that 125133 wasn't a blocker, right? [21:10:06] (unless you were fixing things on the backend of course) [21:10:10] robh: yeah transient error of doom [21:10:14] greg-g: i can verify that login works [21:10:19] (03CR) 10RobH: [C: 032] add jforrester to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/266428 (owner: 10RobH) [21:10:27] s/agreement/concensus/ [21:10:28] greg-g: Last I heard, correct. I haven't been paying a lot of attention though. [21:10:31] robh: maybe one day we will have our npm/gem/pip mirrors ;D [21:10:34] because I've been trying to fix it ;)_ [21:10:37] anomie: k, thank you [21:10:43] anomie: :) appreciated [21:10:52] (03CR) 10Gergő Tisza: [C: 031] Grant autocreateaccount to anons on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267134 (https://phabricator.wikimedia.org/T125133) (owner: 10Anomie) [21:11:08] +1, anomie rocks. [21:11:20] marxarelli: then let's get this damned train to the final station, eh? [21:11:39] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1979545 (10BBlack) Status for `parsoid-lb.eqiad.wikimedia.org`: only a small handful of requests are still flowi... [21:11:47] greg-g: w00t w00t [21:12:07] I'll be so happy to not think about 1.27-wmf.11 again :) [21:12:43] i will always have fond memories [21:12:45] next version 1.27-wmf.11b ? [21:12:53] s/memories/trauma/ [21:12:56] * greg-g kicks mutante [21:13:42] * ori logs out greg-g [21:14:19] (03PS1) 10Dduvall: all wikis to 1.27.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267138 [21:15:02] (03CR) 10Dduvall: [C: 032] all wikis to 1.27.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267138 (owner: 10Dduvall) [21:15:11] 10Ops-Access-Requests, 6operations: Grant James F. (`jforrester`) access to Hive (`analytics-privatedata-users`) - https://phabricator.wikimedia.org/T124719#1979568 (10RobH) 5stalled>3Resolved a:3RobH No objections were noted and this access is now live. [21:15:14] ori: good for me I'm automatically logged back in [21:15:20] James_F: you are all set [21:15:22] heh [21:15:25] (too soon?) [21:15:48] robh: Thanks! [21:15:49] (03Merged) 10jenkins-bot: all wikis to 1.27.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267138 (owner: 10Dduvall) [21:16:04] !log dduvall@mira rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.27.0-wmf.11 [21:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:19:56] (03PS1) 10BBlack: lvs: convert lvs1004-12 to etcd [puppet] - 10https://gerrit.wikimedia.org/r/267141 [21:19:58] (03PS1) 10BBlack: lvs: convert all eqiad lvs to etcd [puppet] - 10https://gerrit.wikimedia.org/r/267142 [21:20:58] (03CR) 10Dzahn: [C: 032] "it changed "mysql_codfw" and "openldap_corp_mirror_codfw" service groups to "misc_codfw"" [puppet] - 10https://gerrit.wikimedia.org/r/267133 (owner: 10Dzahn) [21:21:21] (03PS2) 10Dzahn: Revert "monitoring: fix top-scope vars without namespace" [puppet] - 10https://gerrit.wikimedia.org/r/267133 [21:22:04] alright, with the train only being 16 minutes late, I'm going to head out to a coffee shop (I seem to get my best annual plan work done there), bbiaf [21:26:15] tgr, anomie: seeing a new fatal "Destructor threw an object exception: exception 'BadMethodCallException' with message 'Call to a member function getTitle() on a non-object (NULL)'" [21:26:43] backtrace shows it's from sessionmanager [21:27:20] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Create a service location / discovery system for locating local/master resources easily across all WMF applications - https://phabricator.wikimedia.org/T125069#1979609 (10ori) [21:27:20] (03PS2) 10BBlack: lvs: convert lvs1004-12 to etcd [puppet] - 10https://gerrit.wikimedia.org/r/267141 [21:27:32] (03CR) 10BBlack: [C: 032 V: 032] lvs: convert lvs1004-12 to etcd [puppet] - 10https://gerrit.wikimedia.org/r/267141 (owner: 10BBlack) [21:27:34] !log converting backup/inactive eqiad LVS/pybal to etcd [21:27:35] :/ [21:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:27:44] marxarelli: task it up? [21:27:58] greg-g: kk [21:31:19] marxarelli: if you're busy I can (I just got to the stack trace) [21:31:48] !log ori@mira Synchronized php-1.27.0-wmf.11/extensions/AbuseFilter: I13fcc3ce4: Updated mediawiki/core Project: mediawiki/extensions/AbuseFilter 19baa3b6e51b8fe6baf6e3ce7e590060e8e6eec9 (duration: 01m 11s) [21:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:35:06] (03PS1) 10Dzahn: releases: make Apache config work with 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/267148 (https://phabricator.wikimedia.org/T124261) [21:36:01] (03PS1) 10Chad: pep8: don't use a lambda in check_legal_html.py [puppet] - 10https://gerrit.wikimedia.org/r/267149 [21:36:56] (03PS2) 10Dzahn: releases: make Apache config work with 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/267148 (https://phabricator.wikimedia.org/T124261) [21:38:15] (03CR) 10Jdlrobson: [C: 031] "Did you find someone to SWAT this @bmansurov ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [21:39:11] (03CR) 10Dzahn: [C: 032] releases: make Apache config work with 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/267148 (https://phabricator.wikimedia.org/T124261) (owner: 10Dzahn) [21:47:30] (03PS1) 10Dzahn: varnish/misc-web: switch releases to bromine backend [puppet] - 10https://gerrit.wikimedia.org/r/267151 (https://phabricator.wikimedia.org/T124261) [21:47:59] (03PS2) 10Dzahn: varnish/misc-web: switch releases to bromine backend [puppet] - 10https://gerrit.wikimedia.org/r/267151 (https://phabricator.wikimedia.org/T124261) [21:48:29] (03CR) 10Dzahn: [C: 032] varnish/misc-web: switch releases to bromine backend [puppet] - 10https://gerrit.wikimedia.org/r/267151 (https://phabricator.wikimedia.org/T124261) (owner: 10Dzahn) [21:55:59] !log caesium - stopped apache [21:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:56:11] PROBLEM - HTTP on caesium is CRITICAL: Connection refused [21:56:41] everybody still sees releases.wikimedia.org as normal, right [21:57:06] assuming it's just a directory index of stuff, yes :) [21:57:11] just switched it over, and lgtm was just a bit surprised to not see access in logs [21:57:14] yes, it is [21:57:48] ok, one more Ubuntu can be killed [21:58:10] (03PS2) 10BBlack: lvs: convert all eqiad lvs to etcd [puppet] - 10https://gerrit.wikimedia.org/r/267142 [21:58:24] (03CR) 10BBlack: [C: 032] lvs: convert all eqiad lvs to etcd [puppet] - 10https://gerrit.wikimedia.org/r/267142 (owner: 10BBlack) [21:58:31] (03CR) 10BBlack: [V: 032] lvs: convert all eqiad lvs to etcd [puppet] - 10https://gerrit.wikimedia.org/r/267142 (owner: 10BBlack) [21:58:34] maybe we can add some design to that directory index [21:58:42] just using Apache config i mean :p [21:58:42] !log converting active eqiad LVS/pybal to etcd [21:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:59:15] !log releases.wm.org - switched backend to bromine [21:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:01:09] !log dduvall@mira Synchronized php-1.27.0-wmf.11/extensions/WikimediaEvents/WikimediaEventsHooks.php: deploying fix for T125151 (duration: 01m 15s) [22:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:09:12] !log eqiad pybal->etcd conversion done [22:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:19:00] (03PS1) 10BBlack: eqiad: add text nodes to cache_mobile frontends [puppet] - 10https://gerrit.wikimedia.org/r/267159 (https://phabricator.wikimedia.org/T109286) [22:19:02] (03PS1) 10BBlack: eqiad: remove mobile frontends from cache_mobile [puppet] - 10https://gerrit.wikimedia.org/r/267160 (https://phabricator.wikimedia.org/T109286) [22:20:15] (03CR) 10BBlack: [C: 032] eqiad: add text nodes to cache_mobile frontends [puppet] - 10https://gerrit.wikimedia.org/r/267159 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [22:23:17] !log starting cache_mobile->cache_text conversion in eqiad - https://phabricator.wikimedia.org/T109286 [22:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:24:43] RECOVERY - HTTP on caesium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.006 second response time [22:29:25] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [24.0] [22:29:57] (03PS1) 10Hashar: contint: php5-ldap on all slaves [puppet] - 10https://gerrit.wikimedia.org/r/267165 (https://phabricator.wikimedia.org/T125158) [22:30:03] PROBLEM - HTTP on caesium is CRITICAL: Connection refused [22:33:13] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [22:33:47] (03CR) 10Andrew Bogott: [C: 032] contint: php5-ldap on all slaves [puppet] - 10https://gerrit.wikimedia.org/r/267165 (https://phabricator.wikimedia.org/T125158) (owner: 10Hashar) [22:44:06] (03PS1) 10MtDu: Change some Nepali Wikibooks configurations * sitename * logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267170 (https://phabricator.wikimedia.org/T124881) [22:46:51] So, Krenair, I broke logging for you somehow? Or your ability to read lots? [22:47:16] not exactly [22:47:40] but go to silver and ls -l /var/log/apache2 [22:47:45] It’s just because I added daily rotation, right? [22:48:01] at some point, it was error.log that was group=deployer [22:48:16] oh yeah [22:48:23] I don't remember why it had to be there [22:48:23] I wonder why that changed? logrotate shouldn’t do that [22:48:40] I think there was some reason the log wasn't going to fluorine? [22:48:48] or maybe I just couldn't find it on fluorine but that contained it [22:48:51] or something [22:49:39] * andrewbogott looks on fluorine [22:51:13] I should probably do some OSM patch review... [22:51:30] I don’t think I even know where error.log goes on fluorine [22:51:46] but I also don’t understand why those logs on silver aren’t written as www-data since that’s the user apache runs as [22:52:14] maybe they go in apache2.log? [22:52:53] RECOVERY - HTTP on caesium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.011 second response time [22:53:01] yeah, looks like [22:53:10] well, they would, if silvers logs went there at all [22:54:31] I think it does send logs there [22:57:44] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [24.0] [22:58:23] !log restoring MobileWebSectionUsage_14321266 from db1047 to dbstore1002 using mysqlimport [22:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:59:10] jynus: fyi ^^ [23:01:30] (03PS1) 10Dzahn: releases: beautify directory index page [puppet] - 10https://gerrit.wikimedia.org/r/267175 [23:08:22] (03PS2) 10Dzahn: releases: beautify directory index page [puppet] - 10https://gerrit.wikimedia.org/r/267175 [23:08:25] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [23:09:49] (03PS3) 10Dzahn: releases: beautify directory index page [puppet] - 10https://gerrit.wikimedia.org/r/267175 (https://phabricator.wikimedia.org/T124261) [23:12:02] 6operations, 5Patch-For-Review: make the releases.wm.org index page look nicer - https://phabricator.wikimedia.org/T125164#1980139 (10Dzahn) 3NEW a:3Dzahn [23:12:23] 6operations, 5Patch-For-Review: make the releases.wm.org index page look nicer - https://phabricator.wikimedia.org/T125164#1980139 (10Dzahn) releases: beautify directory index page Currently https://releases.wikimedia.org is just a default directory index that isn't very pretty, make it look a bit nicer by:... [23:12:35] (03PS4) 10Dzahn: releases: beautify directory index page [puppet] - 10https://gerrit.wikimedia.org/r/267175 (https://phabricator.wikimedia.org/T125164) [23:13:20] 6operations, 6Discovery, 10MediaWiki-Logging, 7HHVM: MediaWiki monolog doesn't handle Kafka failures gracefully - https://phabricator.wikimedia.org/T125084#1980157 (10MaxSem) For the reference, upstream fix: https://github.com/facebook/hhvm/commit/88e9ca810d1af78b63cf1668841fa38b2b0a01ba [23:14:17] 6operations, 5Patch-For-Review: make the releases.wm.org index page look nicer - https://phabricator.wikimedia.org/T125164#1980162 (10Dzahn) before: {F3289385} [23:15:22] (03PS5) 10Dzahn: releases: beautify directory index page [puppet] - 10https://gerrit.wikimedia.org/r/267175 (https://phabricator.wikimedia.org/T125164) [23:15:32] (03CR) 10Dzahn: [C: 032] releases: beautify directory index page [puppet] - 10https://gerrit.wikimedia.org/r/267175 (https://phabricator.wikimedia.org/T125164) (owner: 10Dzahn) [23:16:12] (03CR) 10MarcoAurelio: [C: 04-1] Change some Nepali Wikibooks configurations * sitename * logo (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267170 (https://phabricator.wikimedia.org/T124881) (owner: 10MtDu) [23:16:33] 6operations, 6Performance-Team, 7Performance: Update HHVM package to recent release - https://phabricator.wikimedia.org/T119637#1980166 (10MaxSem) [23:16:35] 6operations, 6Discovery, 10MediaWiki-Logging, 7HHVM: MediaWiki monolog doesn't handle Kafka failures gracefully - https://phabricator.wikimedia.org/T125084#1980165 (10MaxSem) [23:17:27] uhm [23:17:34] are the heaps LightProcess errors during scap normal? [23:17:52] tgr: they have been on mira :/ [23:17:57] !log tgr@mira Synchronized php-1.27.0-wmf.11/includes/session/SessionManager.php: T125161 (duration: 01m 11s) [23:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:18:17] we could probably disable lightprocess for cli [23:18:26] it's just noise, but it's alarming noise [23:20:04] (03PS1) 10Dzahn: releases: fix path to custom html header [puppet] - 10https://gerrit.wikimedia.org/r/267184 (https://phabricator.wikimedia.org/T125164) [23:20:34] !log ori@mira Synchronized php-1.27.0-wmf.11/includes/api/ApiStashEdit.php: Ia4196eba9: Add ParserOutputStashForEdit hook for extension cache warming (duration: 01m 10s) [23:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:23:57] (03PS2) 10Dzahn: releases: fix path to custom html header [puppet] - 10https://gerrit.wikimedia.org/r/267184 (https://phabricator.wikimedia.org/T125164) [23:24:22] (03CR) 10Dzahn: [C: 032] releases: fix path to custom html header [puppet] - 10https://gerrit.wikimedia.org/r/267184 (https://phabricator.wikimedia.org/T125164) (owner: 10Dzahn) [23:25:21] (03PS1) 10Dereckson: Santiago Editatón throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267186 (https://phabricator.wikimedia.org/T124284) [23:27:31] 6operations, 5Patch-For-Review: make the releases.wm.org index page look nicer - https://phabricator.wikimedia.org/T125164#1980189 (10Dzahn) now: {F3289421} {F3289423} [23:31:35] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 73.08% of data above the critical threshold [5000000.0] [23:32:30] (03PS2) 10Dereckson: Santiago Editatón throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267186 (https://phabricator.wikimedia.org/T125081) [23:33:06] 6operations, 5Patch-For-Review: move releases.wm.org to bromine (was: request VM for releases.wm.org) - https://phabricator.wikimedia.org/T124261#1980212 (10Dzahn) the backend for this service has been switched over to bromine now. in addition to subscriptions on this ticket i will send a mail to all "release... [23:33:40] 6operations: move releases.wm.org to bromine (was: request VM for releases.wm.org) - https://phabricator.wikimedia.org/T124261#1980216 (10Dzahn) [23:33:48] 6operations: move releases.wm.org to bromine (was: request VM for releases.wm.org) - https://phabricator.wikimedia.org/T124261#1980217 (10Dzahn) 5Open>3Resolved [23:33:50] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1980219 (10Dzahn) [23:36:31] 6operations: decom caesium - https://phabricator.wikimedia.org/T125165#1980223 (10Dzahn) 3NEW [23:36:31] 6operations: decom caesium - https://phabricator.wikimedia.org/T125165#1980223 (10Dzahn) a:3Dzahn [23:36:31] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1931694 (10Dzahn) [23:37:06] (03CR) 10Jdlrobson: Add sampling rates for mobile web language switcher in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267025 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [23:37:34] (03PS3) 10Dereckson: Use extension registration for Graph [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266433 (https://phabricator.wikimedia.org/T119117) [23:37:53] (03CR) 10Dereckson: "Rebased against 267060." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266433 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [23:38:05] (03PS2) 10MtDu: Change some Nepali Wikibooks configurations * sitename * logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267170 (https://phabricator.wikimedia.org/T124881) [23:38:41] 10Ops-Access-Requests, 6operations: Addition to parsoid-roots - https://phabricator.wikimedia.org/T125166#1980242 (10ssastry) 3NEW [23:40:31] (03PS3) 10Dereckson: Change Nepali Wikibooks sitename and logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267170 (https://phabricator.wikimedia.org/T124881) (owner: 10MtDu) [23:40:32] 10Ops-Access-Requests, 6operations: Addition to parsoid-roots - https://phabricator.wikimedia.org/T125166#1980264 (10ssastry) [23:40:33] (03CR) 10BryanDavis: [C: 04-1] "Should be setting $wgMFSchemaMobileWebLanguageSwitcherSampleRate not $wmgMFSchemaMobileWebLanguageSwitcherSampleRate." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267025 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [23:41:17] (03CR) 10Dereckson: [C: 031] Change Nepali Wikibooks sitename and logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267170 (https://phabricator.wikimedia.org/T124881) (owner: 10MtDu) [23:42:47] 6operations: decom caesium - https://phabricator.wikimedia.org/T125165#1980274 (10Dzahn) [23:43:02] (03PS2) 10BryanDavis: Add sampling rates for mobile web language switcher in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267025 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [23:43:33] (03PS1) 10Dzahn: decom: remove caesium from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/267188 (https://phabricator.wikimedia.org/T125165) [23:43:56] (03CR) 10jenkins-bot: [V: 04-1] decom: remove caesium from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/267188 (https://phabricator.wikimedia.org/T125165) (owner: 10Dzahn) [23:44:00] (03CR) 10MtDu: "Thanks for updating the commit message and for the review!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267170 (https://phabricator.wikimedia.org/T124881) (owner: 10MtDu) [23:45:05] (03CR) 10Jdlrobson: [C: 031] Add sampling rates for mobile web language switcher in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267025 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [23:45:14] (03PS2) 10Dzahn: decom: remove caesium from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/267188 (https://phabricator.wikimedia.org/T125165) [23:45:44] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [23:46:01] 10Ops-Access-Requests, 6operations: add subbu to parsoid-roots - https://phabricator.wikimedia.org/T125166#1980289 (10Dzahn) [23:46:06] !log on ruthenium installing build dependencies and compiling uprightdiff for test [23:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:47:42] (03CR) 10Dzahn: [C: 032] decom: remove caesium from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/267188 (https://phabricator.wikimedia.org/T125165) (owner: 10Dzahn) [23:51:30] !log caesium - stop puppet, shutdown server, remove from icinga, clean puppet cert ... [23:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:52:48] ottomata: ping [23:53:09] joal: ping? [23:57:09] 6operations, 5Patch-For-Review: decom caesium - https://phabricator.wikimedia.org/T125165#1980335 (10Dzahn) server has been shutdown, removed from icinga, puppet cert has been cleaned, salt key revoked, but there are more remnants to remove in the repo [23:57:26] 6operations, 5Patch-For-Review: decom caesium - https://phabricator.wikimedia.org/T125165#1980336 (10Dzahn) a:5Dzahn>3Papaul [23:59:16] (03PS1) 10Alex Monk: Revert "wgRCWatchCategoryMembership true on wikipedias & commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267189 [23:59:30] (03PS2) 10Alex Monk: Revert "wgRCWatchCategoryMembership true on wikipedias & commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267189 [23:59:35] (03CR) 10Alex Monk: [C: 032] Revert "wgRCWatchCategoryMembership true on wikipedias & commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267189 (owner: 10Alex Monk)