[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170105T0000). Please do the needful. [00:00:04] matt_flaschen: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:24] !log aaron@tin Synchronized wmf-config/logging.php: No-op sync of 7e103f21a3555fc0b8f7fdea4fd8df4cb7cb939e (duration: 00m 42s) [00:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:33] !log gerrit slowdown reported around 23:55 UTC, was back to normal after 2 minutes (T148478) - attaching latest jvm_gc log [00:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:07] It looks like a lot of people saw that error earlier AaronSchulz / RoanKattouw [00:02:13] (see also ops list) [00:02:16] 06Operations, 10Gerrit, 13Patch-For-Review: Gerrit: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 / 30/11/2016 / 16/12/2016 - https://phabricator.wikimedia.org/T148478#2918410 (10Dzahn) {F5231081} [00:02:34] Okay, starting SWAT now. [00:02:48] e.g. https://twitter.com/erikaherzog/status/816784936460054528 [00:02:57] PROBLEM - Check systemd state on restbase-dev1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:04:37] (03Abandoned) 10Aaron Schulz: Fix "shard" logging processor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330602 (owner: 10Aaron Schulz) [00:04:45] AaronSchulz: given the impact, an incident report should be filed. thanks [00:05:50] (03PS1) 10Aaron Schulz: Add DB "shard" column to logstash log entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330612 [00:07:51] I can't test https://gerrit.wikimedia.org/r/#/c/330611/ in production without running the script for real, which I'll do today or tomorrow, but not during SWAT. [00:08:15] So I'll merge all the Flow-repo changes together, then test the other one. [00:11:05] (03CR) 10Filippo Giunchedi: [C: 032] Allocate instances for restbase-dev1* [dns] - 10https://gerrit.wikimedia.org/r/330609 (https://phabricator.wikimedia.org/T153880) (owner: 10Filippo Giunchedi) [00:15:34] (03PS6) 10Filippo Giunchedi: [WIP]: Enable Cassandra on restbase-dev100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/328667 (https://phabricator.wikimedia.org/T153880) (owner: 10Eevans) [00:15:40] AaronSchulz: can you ack the request for an incident report here, please? :) [00:16:32] I was busy writing it [00:17:10] !log krypton - stop exim, umount orphaned "scan" tmpfs (there is no clamav here) [00:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:56] AaronSchulz: best answer I could have expected :) [00:20:29] (03PS1) 10Andrew Bogott: Nova: Replace admin token with novaadmin name/password [puppet] - 10https://gerrit.wikimedia.org/r/330613 (https://phabricator.wikimedia.org/T150776) [00:20:31] (03PS1) 10Andrew Bogott: Nova: Use keystone v3 api [puppet] - 10https://gerrit.wikimedia.org/r/330614 [00:20:33] (03PS1) 10Andrew Bogott: Nova: Update filter factory to newer middleware module [puppet] - 10https://gerrit.wikimedia.org/r/330615 (https://phabricator.wikimedia.org/T150776) [00:22:14] (03PS2) 10Andrew Bogott: Nova: Update filter factory to newer middleware module [puppet] - 10https://gerrit.wikimedia.org/r/330615 (https://phabricator.wikimedia.org/T150776) [00:22:16] (03PS2) 10Andrew Bogott: Nova: Use keystone v3 api [puppet] - 10https://gerrit.wikimedia.org/r/330614 [00:22:18] (03PS7) 10Filippo Giunchedi: [WIP]: Enable Cassandra on restbase-dev100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/328667 (https://phabricator.wikimedia.org/T153880) (owner: 10Eevans) [00:23:47] RECOVERY - Disk space on krypton is OK: DISK OK [00:23:48] (03CR) 10Filippo Giunchedi: [C: 031] "I've updated the instances addresses and cluster name in regex.yaml, the rest LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/328667 (https://phabricator.wikimedia.org/T153880) (owner: 10Eevans) [00:23:55] !log krypton - chmod 751 /var/spool/exim4/ to fix Icinga alerts about unaccesible tmpfs (nagios user could not access), it was 751 on other hosts like ununpentium [00:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:06] Okay, Jenkins finally merged the Flow patches (after a bogus error), so deploying them now. [00:26:57] bd808: I'm ok to merge https://gerrit.wikimedia.org/r/#/c/303923/ btw if you want to give it a try [00:27:02] 06Operations, 10Gerrit, 13Patch-For-Review: Gerrit: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 / 30/11/2016 / 16/12/2016 / 04/01/2017 / 05/01/2017 - https://phabricator.wikimedia.org/T148478#2918427 (10Paladox) [00:27:30] 06Operations, 10Gerrit, 13Patch-For-Review: Gerrit: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 / 30/11/2016 / 16/12/2016 / 04/01/2017 - https://phabricator.wikimedia.org/T148478#2726403 (10Paladox) [00:27:42] godog: I guess the worst that will happen is that l10nupdate breaks until we revert [00:27:50] so +1 from me [00:27:56] RECOVERY - Check systemd state on restbase-dev1003 is OK: OK - running: The system is fully operational [00:28:08] (03PS7) 10Filippo Giunchedi: l10nupdate: acquire scap lock before changing files [puppet] - 10https://gerrit.wikimedia.org/r/303923 (https://phabricator.wikimedia.org/T72752) (owner: 10BryanDavis) [00:28:10] when does the cron for that kick off? /me looks [00:28:34] is it like 1am UTC or sth like that? [00:29:16] 06Operations, 10Gerrit, 13Patch-For-Review: Gerrit: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 / 30/11/2016 / 16/12/2016 / 04/01/2017 - https://phabricator.wikimedia.org/T148478#2918429 (10Paladox) should the priority be set to high so we can figure out what is really actuall... [00:29:36] 02:00Z apparently [00:29:45] 06Operations, 10Gerrit, 13Patch-For-Review: Gerrit: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 / 30/11/2016 / 16/12/2016 / 04/01/2017 - https://phabricator.wikimedia.org/T148478#2726455 (10jcrespo) BTW, there was a short gerrit outage (icinga timeouts, at least), around 1am o... [00:30:06] PROBLEM - Check systemd state on restbase-dev1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:30:09] (03CR) 10Filippo Giunchedi: [C: 032] l10nupdate: acquire scap lock before changing files [puppet] - 10https://gerrit.wikimedia.org/r/303923 (https://phabricator.wikimedia.org/T72752) (owner: 10BryanDavis) [00:30:34] ok! {{done}} [00:30:55] godog: :) now we wait! [00:30:59] 06Operations, 10Gerrit, 13Patch-For-Review: Gerrit: Investigate why gerrit slowed down on 2016 / 2017 - https://phabricator.wikimedia.org/T148478#2918445 (10Paladox) [00:31:20] hopefully not too long on the lock {{File:Sting.ogg}} [00:31:29] 06Operations, 10Gerrit, 13Patch-For-Review: Gerrit: Investigate why gerrit slowed down on 2016 / 2017 - https://phabricator.wikimedia.org/T148478#2918447 (10demon) >>! In T148478#2918442, @jcrespo wrote: > BTW, there was a short gerrit outage (icinga timeouts, at least), around 1am on 2017-01-01, maybe relat... [00:31:32] 06Operations, 10Gerrit, 13Patch-For-Review: Gerrit: Investigate why gerrit slowed down on 2016 / 2017 - https://phabricator.wikimedia.org/T148478#2918448 (10Paladox) >>! In T148478#2918442, @jcrespo wrote: > BTW, there was a short gerrit outage (icinga timeouts, at least), around 1am on 2017-01-01, maybe rel... [00:32:02] on a normal day there should never be contention [00:32:42] there were a couple of times in the long ago that SWAT bled over which is what made me think of the need to lock there [00:33:44] 06Operations, 10Gerrit, 13Patch-For-Review: Gerrit: Investigate why gerrit slowed down on 2016 / 2017 - https://phabricator.wikimedia.org/T148478#2918449 (10jcrespo) Check my comment update if that can give a clue^ [00:34:51] 06Operations, 10Gerrit, 13Patch-For-Review: Gerrit: Investigate why gerrit slowed down on 2016 / 2017 - https://phabricator.wikimedia.org/T148478#2918450 (10Dzahn) We were planning to make Gerrit log to logstash, but it's not doing it just yet, a patch for that was waiting to be done during tomorrow's mainte... [00:35:11] (03PS2) 10Filippo Giunchedi: hieradata: set realserver_ips for role prometheus::ops [puppet] - 10https://gerrit.wikimedia.org/r/327554 (https://phabricator.wikimedia.org/T148408) [00:37:57] !log servermon - weird behaviour in the "pending package upgrades" list? exim4 package was shown as pending on lots of hosts, after next upgrade it disppears from list, even though about half the servers should still be listed [00:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:06] RECOVERY - Check systemd state on restbase-dev1002 is OK: OK - running: The system is fully operational [00:41:01] !log planet1001/2001, phab2001 - upgrade exim4, exim4-daemon-heavy [00:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:16] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: set realserver_ips for role prometheus::ops [puppet] - 10https://gerrit.wikimedia.org/r/327554 (https://phabricator.wikimedia.org/T148408) (owner: 10Filippo Giunchedi) [00:42:31] !log phab2001 - same chmod 751 on exim4 dirs that i manually did on krypton is done by puppet here, fully automatic, not sure why krypton was a one-off [00:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:05] (03PS2) 10Dzahn: openstack: switch tftp server from carbon to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328597 (https://phabricator.wikimedia.org/T123733) [00:48:28] (03CR) 10Dzahn: [C: 032] "this does not affect VMs in labs, if at all it would affect real hardware in labs, the "ironic" thing aka, metal-labs" [puppet] - 10https://gerrit.wikimedia.org/r/328597 (https://phabricator.wikimedia.org/T123733) (owner: 10Dzahn) [00:52:49] !log mattflaschen@tin Synchronized php-1.29.0-wmf.6/extensions/Flow: Two Flow fixes related to production database/content inconsistencies. (duration: 00m 59s) [00:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:55] bd808, any idea about "No module named sh" on scap sync-dir? [00:53:33] Heh, ostriches says it's harmless. [00:54:21] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2080.codfw.wmnet [00:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:45] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2080.codfw.wmnet [00:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:07] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2081.codfw.wmnet [00:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:20] Yvette: I did! [00:55:23] huh, api appserver has an extra confirm step in confctl that these dont... [00:55:38] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2082.codfw.wmnet [00:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:41] matt_flaschen: Known issue with plugins at the moment. Ignore it. Improved error messaging so it doesn't look so scary is in the pipeline for next release [00:55:44] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2083.codfw.wmnet [00:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:51] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2084.codfw.wmnet [00:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:23] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2085.codfw.wmnet [00:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:41] Thanks, ostriches [00:56:51] ahh, they dont have a pybal for the service of jobrunning handling in confctl hence no confirm like image and api. [00:57:01] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2086.codfw.wmnet [00:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:12] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2087.codfw.wmnet [00:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:23] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2088.codfw.wmnet [00:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:34] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2089.codfw.wmnet [00:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:43] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2090.codfw.wmnet [00:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:02] (03PS1) 10Chad: Move branch.py to branch.notpy until error messaging improves [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330618 [01:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170105T0100). [01:00:28] (03CR) 10Dzahn: [C: 04-1] "typo, "wwf" != "wmf", sneaked in from existing wwf6479" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/330369 (owner: 10Papaul) [01:01:32] (03PS1) 10Dzahn: fix typo, wwf6479.mgmt -> wmf6479.mgmt [dns] - 10https://gerrit.wikimedia.org/r/330619 [01:01:58] 06Operations, 10ops-codfw: decommission old mw appservers to make room for new systems - https://phabricator.wikimedia.org/T154621#2918496 (10RobH) [01:02:00] (03PS2) 10Dzahn: fix typo, wwf6479.mgmt -> wmf6479.mgmt [dns] - 10https://gerrit.wikimedia.org/r/330619 [01:02:33] (03CR) 10Dzahn: [C: 032] fix typo, wwf6479.mgmt -> wmf6479.mgmt [dns] - 10https://gerrit.wikimedia.org/r/330619 (owner: 10Dzahn) [01:02:35] (03CR) 10Chad: [C: 032] Move branch.py to branch.notpy until error messaging improves [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330618 (owner: 10Chad) [01:03:09] (03Merged) 10jenkins-bot: Move branch.py to branch.notpy until error messaging improves [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330618 (owner: 10Chad) [01:03:19] (03CR) 10jenkins-bot: Move branch.py to branch.notpy until error messaging improves [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330618 (owner: 10Chad) [01:04:09] !log mattflaschen@tin Synchronized php-1.29.0-wmf.7/extensions/Flow: Flow script to add more troubleshooting information to a maintenance script (duration: 00m 56s) [01:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:49] Flow changes complete. Just need the config. [01:05:05] (03PS2) 10Dzahn: DNS: Add mgmt and production DNS entries for elastic2025-elastic2036 Bug:T154251 [dns] - 10https://gerrit.wikimedia.org/r/330369 (owner: 10Papaul) [01:05:19] (03CR) 10Mattflaschen: [C: 032] Disable NewUserMessage gomwiki to prevent corruptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330601 (https://phabricator.wikimedia.org/T131957) (owner: 10Mattflaschen) [01:05:45] (03CR) 10Dzahn: "fixed original typo in https://gerrit.wikimedia.org/r/#/c/330619/ and amended here" [dns] - 10https://gerrit.wikimedia.org/r/330369 (owner: 10Papaul) [01:05:52] (03Merged) 10jenkins-bot: Disable NewUserMessage gomwiki to prevent corruptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330601 (https://phabricator.wikimedia.org/T131957) (owner: 10Mattflaschen) [01:06:02] (03CR) 10jenkins-bot: Disable NewUserMessage gomwiki to prevent corruptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330601 (https://phabricator.wikimedia.org/T131957) (owner: 10Mattflaschen) [01:06:35] (03CR) 10Dzahn: [C: 032] DNS: Add mgmt and production DNS entries for elastic2025-elastic2036 Bug:T154251 [dns] - 10https://gerrit.wikimedia.org/r/330369 (owner: 10Papaul) [01:06:55] !log demon@tin Synchronized scap/plugins: (no message) (duration: 00m 40s) [01:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:17] (03CR) 10Dzahn: "nitpick, could you please remove the literal tab character and indent like the other lines with spaces" [puppet] - 10https://gerrit.wikimedia.org/r/330178 (https://phabricator.wikimedia.org/T113696) (owner: 10DatGuy) [01:10:29] 06Operations, 10ops-codfw: rack/setup/install mw2251-mw2260 - https://phabricator.wikimedia.org/T152698#2918515 (10RobH) [01:10:56] !log robh@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw2090.codfw.wmnet [01:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:06] 06Operations, 10ops-codfw: decommission old mw appservers to make room for new systems - https://phabricator.wikimedia.org/T154621#2918519 (10RobH) [01:12:37] (03CR) 10Filippo Giunchedi: [C: 04-1] "> I've just looked at one of the first performance alerts Peter and" [puppet] - 10https://gerrit.wikimedia.org/r/328673 (https://phabricator.wikimedia.org/T153167) (owner: 10Gilles) [01:13:32] !log mattflaschen@tin Synchronized wmf-config/InitialiseSettings.php: Disable NewUserMessage on gomwiki (duration: 00m 41s) [01:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:50] SWAT complete (and the gomwiki one, which is ready to test, works) [01:17:55] Sorry for running over. [01:18:53] (03PS1) 10RobH: decommission mw2075-2089 [puppet] - 10https://gerrit.wikimedia.org/r/330621 [01:20:31] 06Operations, 10ops-codfw: decommission old mw appservers to make room for new systems - https://phabricator.wikimedia.org/T154621#2918524 (10RobH) [01:22:46] 06Operations, 10ops-codfw: decommission old mw appservers to make room for new systems - https://phabricator.wikimedia.org/T154621#2918525 (10RobH) All systems have been depooled from pybal and should stop getting loads. I'll disable and shutdown the systems tomorrow, and then continue with the decommissionin... [01:32:14] !log servermon - after the next update by cron - package data is back [01:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:59] !log rolling out exim4 upgrades (DSA 3747-1) on parsoid, maps, cp-esams [01:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:37] (03CR) 10Andrew Bogott: [C: 032] Nova: Add identity_uri config setting [puppet] - 10https://gerrit.wikimedia.org/r/330607 (https://phabricator.wikimedia.org/T150776) (owner: 10Andrew Bogott) [01:40:43] (03PS2) 10Andrew Bogott: Nova: Add identity_uri config setting [puppet] - 10https://gerrit.wikimedia.org/r/330607 (https://phabricator.wikimedia.org/T150776) [01:41:34] !log rolling out exim4 upgrades (DSA 3747-1) on ganeti, cp-ulsfo, wdqs, thumbor, db-es-codfw [01:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:06] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 1802.330878 Seconds [01:48:43] !log rolling out exim4 upgrades (DSA 3747-1) on redis-codfw (rdb2005 needed manual) and all of dc-ulsfo, dc-esams [01:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:49:06] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 11.205704 Seconds [01:50:41] (03PS1) 10Andrew Bogott: Nova: Add the admin_uri nova config setting [puppet] - 10https://gerrit.wikimedia.org/r/330622 (https://phabricator.wikimedia.org/T150776) [01:51:54] (03CR) 10Andrew Bogott: [C: 032] Nova: Add the admin_uri nova config setting [puppet] - 10https://gerrit.wikimedia.org/r/330622 (https://phabricator.wikimedia.org/T150776) (owner: 10Andrew Bogott) [01:55:40] (03PS2) 10Andrew Bogott: Nova: Replace admin token with novaadmin name/password [puppet] - 10https://gerrit.wikimedia.org/r/330613 (https://phabricator.wikimedia.org/T150776) [01:55:42] (03PS3) 10Andrew Bogott: Nova: Update filter factory to newer middleware module [puppet] - 10https://gerrit.wikimedia.org/r/330615 (https://phabricator.wikimedia.org/T150776) [01:55:44] (03PS3) 10Andrew Bogott: Nova: Use keystone v3 api [puppet] - 10https://gerrit.wikimedia.org/r/330614 [02:09:28] (03PS4) 10Gergő Tisza: Set $wgSoftBlockRanges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324215 (owner: 10Anomie) [02:23:49] (03CR) 10Andrew Bogott: [C: 032] Nova: Replace admin token with novaadmin name/password [puppet] - 10https://gerrit.wikimedia.org/r/330613 (https://phabricator.wikimedia.org/T150776) (owner: 10Andrew Bogott) [02:27:46] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:30:00] (03CR) 10MZMcBride: "> If someone's bot gets logged out, we want it to be prevented from editing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324215 (owner: 10Anomie) [02:30:39] (03CR) 10Andrew Bogott: [C: 032] Nova: Update filter factory to newer middleware module [puppet] - 10https://gerrit.wikimedia.org/r/330615 (https://phabricator.wikimedia.org/T150776) (owner: 10Andrew Bogott) [02:33:26] PROBLEM - puppet last run on analytics1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:36:10] (03CR) 10Andrew Bogott: [C: 032] Nova: Use keystone v3 api [puppet] - 10https://gerrit.wikimedia.org/r/330614 (owner: 10Andrew Bogott) [02:56:46] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [03:02:26] RECOVERY - puppet last run on analytics1054 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [03:08:15] (03CR) 10Krinkle: Include DB shard as a logstash column (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328618 (owner: 10Aaron Schulz) [03:09:28] (03PS2) 10Krinkle: Add DB "shard" column to logstash log entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330612 (owner: 10Aaron Schulz) [03:09:30] (03CR) 10Krinkle: "Fixed in Id8f44727." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328618 (owner: 10Aaron Schulz) [03:09:38] (03CR) 10Krinkle: [C: 031] Add DB "shard" column to logstash log entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330612 (owner: 10Aaron Schulz) [03:22:06] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 780.95 seconds [03:24:06] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [03:29:06] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 288.91 seconds [03:30:49] (03PS4) 10Andrew Bogott: Labs: Remove obsolete code [puppet] - 10https://gerrit.wikimedia.org/r/326312 (owner: 10Tim Landscheidt) [03:31:19] (03PS7) 10Krinkle: static.php should use deployed branch for invalid hashes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312254 (https://phabricator.wikimedia.org/T146363) (owner: 10Brion VIBBER) [03:40:14] (03CR) 10Andrew Bogott: [C: 032] Labs: Remove obsolete code [puppet] - 10https://gerrit.wikimedia.org/r/326312 (owner: 10Tim Landscheidt) [03:53:06] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [03:56:09] (03PS1) 10Andrew Bogott: Openstack: Forward some custom config changes to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/330626 [04:05:36] PROBLEM - puppet last run on labstore1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:33:36] RECOVERY - puppet last run on labstore1004 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [04:50:45] 06Operations, 10MediaWiki-ResourceLoader, 06Performance-Team, 10Traffic: Expires header for load.php should be relative to request time instead of cache time - https://phabricator.wikimedia.org/T105657#2918703 (10Krinkle) In an attempt to verify whether or not we can observe there being more startup reques... [04:57:46] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:26:47] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [05:35:08] !log rolling out exim4 upgrades (DSA 3747-1) on cp-eqiad, memcached-eqiad [05:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:46] PROBLEM - puppet last run on logstash1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:39:41] !log rolling out exim4 upgrades (DSA 3747-1) on prometheus, aqs, db-es [05:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:54] !log rolling out exim4 upgrades (DSA 3747-1) on db-core-codfw, etcd, graphite, kafka-analytics-canary, kafka-analytics, logstash [05:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:55] (03CR) 10BryanDavis: "Looks like this needs some additional work:" [puppet] - 10https://gerrit.wikimedia.org/r/303923 (https://phabricator.wikimedia.org/T72752) (owner: 10BryanDavis) [05:49:18] !log rolling out exim4 upgrades (DSA 3747-1) on swift-fe-codfw, swift-be-codfw, ALL remaining mw [05:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:09] 06Operations, 10MediaWiki-ResourceLoader, 06Performance-Team, 10Traffic: Expires header for load.php should be relative to request time instead of cache time - https://phabricator.wikimedia.org/T105657#2918747 (10Catrope) Keeping the max-age at 5 mins and forcing `Age: 0` sounds good to me. To respond to @... [06:04:26] PROBLEM - puppet last run on cerium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:06:46] RECOVERY - puppet last run on logstash1006 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:08:22] (03PS1) 10Dzahn: icinga: use Letsencrypt for SSL cert, spend less donor money on prime numbers [puppet] - 10https://gerrit.wikimedia.org/r/330633 [06:09:15] (03CR) 10jerkins-bot: [V: 04-1] icinga: use Letsencrypt for SSL cert, spend less donor money on prime numbers [puppet] - 10https://gerrit.wikimedia.org/r/330633 (owner: 10Dzahn) [06:11:05] (03PS2) 10Dzahn: icinga: use Letsencrypt for SSL cert, spend less donor money on prime numbers [puppet] - 10https://gerrit.wikimedia.org/r/330633 [06:11:56] godog: ugh. the locking broke things a bit. Opening in write mode took ownership of the file and now the deploy users can't lock it [06:12:00] (03CR) 10jerkins-bot: [V: 04-1] icinga: use Letsencrypt for SSL cert, spend less donor money on prime numbers [puppet] - 10https://gerrit.wikimedia.org/r/330633 (owner: 10Dzahn) [06:15:07] !log sudo -u l10nupdate rm /var/lock/scap on tin to clean up lock left by bad l10nupdate locking attempt [06:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:23] (03PS3) 10Dzahn: icinga: use Letsencrypt for SSL cert, spend less donor money on prime numbers [puppet] - 10https://gerrit.wikimedia.org/r/330633 [06:16:46] PROBLEM - Varnishkafka log producer on cp4019 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf [06:17:46] RECOVERY - Varnishkafka log producer on cp4019 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf [06:19:50] (03PS4) 10Dzahn: icinga: use Letsencrypt for SSL cert, spend less donor money on prime numbers [puppet] - 10https://gerrit.wikimedia.org/r/330633 [06:21:27] (03PS5) 10Dzahn: icinga: use Letsencrypt for SSL cert, spend less donor money on prime numbers [puppet] - 10https://gerrit.wikimedia.org/r/330633 (https://phabricator.wikimedia.org/T133717) [06:21:29] (03PS1) 10BryanDavis: l10nupdate: clean up scap lock file after release [puppet] - 10https://gerrit.wikimedia.org/r/330636 (https://phabricator.wikimedia.org/T72752) [06:24:24] (03CR) 10Dzahn: "expiry date: 2017-02-06" [puppet] - 10https://gerrit.wikimedia.org/r/330633 (https://phabricator.wikimedia.org/T133717) (owner: 10Dzahn) [06:29:00] (03PS6) 10Dzahn: icinga: use Letsencrypt for SSL cert, spend less donor money on prime numbers [puppet] - 10https://gerrit.wikimedia.org/r/330633 (https://phabricator.wikimedia.org/T133717) [06:32:26] RECOVERY - puppet last run on cerium is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:34:26] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:42:06] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:44:46] PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:50:14] (03PS1) 10Papaul: DHCP: Add dhcp entries for elastic2025-elastic2036 Bug:T154251 [puppet] - 10https://gerrit.wikimedia.org/r/330637 [06:51:56] PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:55:56] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:58:46] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:01:53] (03PS1) 10Papaul: Adding install params for elastic2025-elastic2036 Bug:T154251 [puppet] - 10https://gerrit.wikimedia.org/r/330638 [07:02:26] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [07:07:50] !log Compressing pagelinks tables across all the wikis - db1044 - T150438 [07:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:54] T150438: Meta ticket: Deploy InnoDB compression where possible - https://phabricator.wikimedia.org/T150438 [07:08:54] !log Compressing revision tables across all the wikis - db1015 - T153739 [07:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:57] T153739: Defragment db1015 - https://phabricator.wikimedia.org/T153739 [07:09:11] !log Compressing pagelinks tables across all the wikis - db1044 - T153826 [07:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:16] T153826: Defragment db1044 - https://phabricator.wikimedia.org/T153826 [07:11:46] !log Compressing revision tables across all the wikis - db1038 - T154465 [07:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:50] T154465: Defragment db1038 - https://phabricator.wikimedia.org/T154465 [07:26:46] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [07:53:07] 06Operations, 15User-Elukey: hhvm root:adm owned log files cause failures for logrotate - https://phabricator.wikimedia.org/T146464#2918822 (10elukey) From `/etc/rsyslog.conf`: ``` # # Set the default permissions for all log files. # $FileOwner root $FileGroup adm $FileCreateMode 0640 $DirCreateMode 0755 $Uma... [07:54:14] !log chown www-data:www-data all the root:adm hhvm log files on mw eqiad hosts (T132324) [07:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:18] T132324: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324 [07:57:28] (03PS3) 10DatGuy: Redirect https://toolserver.org/~magnus/ [puppet] - 10https://gerrit.wikimedia.org/r/330178 (https://phabricator.wikimedia.org/T113696) [08:04:11] 06Operations, 15User-Elukey: hhvm root:adm owned log files cause failures for logrotate - https://phabricator.wikimedia.org/T146464#2918842 (10elukey) Occurred also on eqiad hosts. Examples: * jobrunner ``` mw1166.eqiad.wmnet: -rw-r----- 1 root adm 4.1M Oct 16 06:15 /var/log/hhvm/error.log-20161015 ```... [08:14:05] (03PS1) 10Gilles: Pass the filtered request headers to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/330646 (https://phabricator.wikimedia.org/T151066) [08:20:27] (03PS1) 10Gilles: Add PoolCounter configuration to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/330647 (https://phabricator.wikimedia.org/T151066) [08:23:46] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:42:06] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#2918906 (10Joe) >>! In T149617#2918215, @cscott wrote: > Note that Parsoid also loads a bunch o... [08:42:19] (03PS2) 10Muehlenhoff: Don't apply NTP Icinga check to standard::ntp::timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330411 [08:46:06] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [08:53:06] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [09:01:46] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 20 failures. Last run 2 minutes ago with 20 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [09:26:46] RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK [09:30:06] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [09:32:02] (03PS4) 10Pnorman: osm: install prerequisite packages for meddo [puppet] - 10https://gerrit.wikimedia.org/r/328176 (https://phabricator.wikimedia.org/T153289) (owner: 10Gehel) [09:34:11] (03CR) 10Muehlenhoff: [C: 032] Don't apply NTP Icinga check to standard::ntp::timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330411 (owner: 10Muehlenhoff) [09:39:33] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Various issues, please see the comments" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [09:46:06] RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK [09:56:07] RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK [10:02:25] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2918990 (10MoritzMuehlenhoff) [10:02:28] 06Operations, 10ops-eqiad, 10hardware-requests, 06Services (watching): Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2918988 (10MoritzMuehlenhoff) 05Resolved>03Open aqs100[1-3] are still listed in site.pp [10:10:58] (03PS1) 10Muehlenhoff: Move another host to timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330649 [10:13:51] (03CR) 10Muehlenhoff: [C: 032] Move another host to timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330649 (owner: 10Muehlenhoff) [10:25:07] RECOVERY - Check systemd state on multatuli is OK: OK - running: The system is fully operational [10:32:36] RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK [10:38:35] (03PS2) 10Muehlenhoff: Switch swift in esams to systemd-timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330404 (https://phabricator.wikimedia.org/T150257) [10:43:07] (03CR) 10Muehlenhoff: "PCC: http://puppet-compiler.wmflabs.org/5032/" [puppet] - 10https://gerrit.wikimedia.org/r/330404 (https://phabricator.wikimedia.org/T150257) (owner: 10Muehlenhoff) [11:03:56] !installing pcsc-lite security updates [11:05:36] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 2 minutes ago with 10 failures. Failed resources (up to 3 shown): Service[salt-minion],Service[ssh],Service[nagios-nrpe-server],Package[tzdata] [11:10:26] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603#2917240 (10akosiaris) This time around sca does not exhibit any errors. I am the only one currently using this box, so it's a good time to reboot. I 'll try to do that same dance as for T152339 and see what happens. [11:10:34] (03PS3) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) [11:10:59] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2919175 (10Joe) 05Open>03Resolved a:03Joe [11:11:14] !log uploaded firejail 0.9.44+wmf2 for jessie-wikimedia to carbon [11:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:27] (03CR) 10Volans: "Replies inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [11:13:21] !log rebooting bast3001, T154603 [11:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:24] T154603: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603 [11:15:06] PROBLEM - Host bast3001 is DOWN: PING CRITICAL - Packet loss = 100% [11:15:08] aah, that explains where my ssh session went ;) [11:15:16] RECOVERY - Host bast3001 is UP: PING OK - Packet loss = 0%, RTA = 83.79 ms [11:15:32] addshore: hm I did check I was the only one using it [11:15:48] I think I logged in literally seconds before you rebooted it :D [11:15:55] ah ok [11:16:06] RECOVERY - MD RAID on bast3001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [11:16:13] akosiaris: I was using it too [11:16:38] (03CR) 10Zfilipin: build: update rubocop to 0.39 and tweak config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/330470 (owner: 10Hashar) [11:16:49] no big deal though :) [11:16:50] <_joe_> bad akosiaris [11:16:54] ema: ah, damn, not logged in, just proxying through it [11:16:56] sigh [11:16:57] sorry [11:17:07] I 'll do my penance [11:17:28] I 'll write 100 times on a blackboard "I will not reboot boxes without a proper warning" [11:18:29] ok, this time around the box is actually suffering from a failed disk [11:18:45] ema: addshore: it's up btw [11:19:02] akosiaris: cool, thanks [11:32:36] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [11:37:15] (03PS1) 10Muehlenhoff: Also exclude time servers when using timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330660 [11:39:27] 06Operations: Production error message points users to donate link, that is likely to also produce the same error message - https://phabricator.wikimedia.org/T154627#2919232 (10Aklapper) Messages are in /operations/mediawiki-config/errorpages/ [11:40:55] (03CR) 10Alexandros Kosiaris: "Instead of creating a custom check, it might make sense to instead do via nrpe a" [puppet] - 10https://gerrit.wikimedia.org/r/330411 (owner: 10Muehlenhoff) [11:46:49] (03CR) 10Muehlenhoff: "Faidon pointed me to this plugin used by the Debian system administrators, I'll integrate that: https://anonscm.debian.org/cgit/mirror/dsa" [puppet] - 10https://gerrit.wikimedia.org/r/330411 (owner: 10Muehlenhoff) [11:48:14] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603#2919251 (10akosiaris) So, sda did not indeed exhibit any errors, sdb was kicked out 2 of the 3 arrays (it was kept in the swap array as there was no read/write activity there) and I was unable to get any info from... [11:51:31] 06Operations, 07Wikimedia-Multiple-active-datacenters: Prepare and improve the datacenter switchover procedure - https://phabricator.wikimedia.org/T154658#2919253 (10Joe) [12:04:54] (03CR) 10Hashar: "0.40 requires a lot more change, so I am going to bump the version in small increments to make it easier to review the changes :)" [puppet] - 10https://gerrit.wikimedia.org/r/330470 (owner: 10Hashar) [12:09:42] (03PS4) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) [12:10:21] (03CR) 10Volans: "Done also the proposal for the unified role" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [12:18:52] (03PS11) 10Mobrovac: RESTBase-Cassandra: Add the topk reporter [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) [12:19:17] (03CR) 10Mobrovac: RESTBase-Cassandra: Add the topk reporter (0317 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) (owner: 10Mobrovac) [12:24:26] !log installing firejail security updates on scb [12:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:49] !log mobrovac@tin Starting deploy [citoid/deploy@da96f4b]: (no message) [12:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:03] just a restart ^ [12:36:16] * mobrovac failed to provide a msg to scap3 [12:36:55] !log mobrovac@tin Finished deploy [citoid/deploy@da96f4b]: (no message) (duration: 01m 06s) [12:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:46] !log mobrovac@tin Starting deploy [cxserver/deploy@0279029]: Restart for firejail upgrade [12:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:26] !log mobrovac@tin Finished deploy [cxserver/deploy@0279029]: Restart for firejail upgrade (duration: 00m 39s) [12:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:22] !log mobrovac@tin Starting deploy [graphoid/deploy@151f26c]: Restart for firejail upgrade [12:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:04] !log mobrovac@tin Finished deploy [graphoid/deploy@151f26c]: Restart for firejail upgrade (duration: 00m 43s) [12:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:26] !log mobrovac@tin Starting deploy [mathoid/deploy@79fdd56]: Restart for firejail upgrade [12:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:07] !log mobrovac@tin Finished deploy [mathoid/deploy@79fdd56]: Restart for firejail upgrade (duration: 00m 41s) [12:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:31] !log mobrovac@tin Starting deploy [mobileapps/deploy@c39bd1f]: Restart for firejail upgrade [12:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:46] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1770 bytes in 0.086 second response time [12:43:53] (03CR) 10Alexandros Kosiaris: "OK, looks simple enough, it should do the trick. And actually reports on timesyncd's status and not indirectly via the NTP offset it is pr" [puppet] - 10https://gerrit.wikimedia.org/r/330411 (owner: 10Muehlenhoff) [12:44:28] (03CR) 10Muehlenhoff: Cumin: allow connection to the targets (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [12:45:37] !log mobrovac@tin Finished deploy [mobileapps/deploy@c39bd1f]: Restart for firejail upgrade (duration: 04m 05s) [12:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:46] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1736 bytes in 0.106 second response time [12:54:38] 06Operations, 06Analytics-Kanban, 15User-Elukey: Yarn node manager JVM memory leaks - https://phabricator.wikimedia.org/T153951#2919345 (10elukey) >>! In T153951#2917859, @Nuria wrote: > Also here: https://issues.apache.org/jira/browse/HADOOP-11105 regarding class: org.apache.hadoop.metrics2.impl.MetricsSyst... [12:56:49] 06Operations, 10ops-eqiad: mw1239: memory scrubbing error - https://phabricator.wikimedia.org/T148421#2919356 (10elukey) @Cmjohnson - ping :) [12:57:36] !log mobrovac@tin Starting deploy [electron-render/deploy@b2a820e]: Restart for firejail upgrade [12:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:36] PROBLEM - pdfrender on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 5252: Connection refused [13:01:16] !log mobrovac@tin Finished deploy [electron-render/deploy@b2a820e]: Restart for firejail upgrade (duration: 03m 40s) [13:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:01] !log mobrovac@tin Starting deploy [electron-render/deploy@b2a820e]: Restart for firejail upgrade [13:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:36] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 264 bytes in 0.007 second response time [13:05:27] 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#2919387 (10Liuxinyu970226) [13:07:30] !log mobrovac@tin Finished deploy [electron-render/deploy@b2a820e]: Restart for firejail upgrade (duration: 05m 29s) [13:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:06] 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#2919396 (10Liuxinyu970226) [13:08:18] !log mobrovac@tin Starting deploy [trending-edits/deploy@c5d239b]: Restart for firejail upgrade [13:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:39] (03CR) 10Zfilipin: build: update rubocop to 0.39 and tweak config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/330470 (owner: 10Hashar) [13:09:04] !log mobrovac@tin Finished deploy [trending-edits/deploy@c5d239b]: Restart for firejail upgrade (duration: 00m 46s) [13:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:12] (03PS3) 10Hashar: build: update rubocop to 0.39 and tweak config [puppet] - 10https://gerrit.wikimedia.org/r/330470 [13:10:32] (03CR) 10Hashar: build: update rubocop to 0.39 and tweak config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/330470 (owner: 10Hashar) [13:17:38] (03PS1) 10Ema: varnishstatsd: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/330668 (https://phabricator.wikimedia.org/T151643) [13:25:16] PROBLEM - puppet last run on ms-be1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:25:42] (03CR) 10Ema: [C: 031] "LGTM. The following graph might also be useful when investigating 5xx spikes: https://grafana.wikimedia.org/dashboard/file/varnish-aggrega" [puppet] - 10https://gerrit.wikimedia.org/r/330460 (owner: 10Filippo Giunchedi) [13:46:44] !log installing audiofile security updates [13:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:00] jouncebot: next [13:51:00] In 0 hour(s) and 8 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170105T1400) [13:51:20] hashar: one patch in eu swat ^ [13:54:16] RECOVERY - puppet last run on ms-be1003 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [13:58:45] zeljkof: +2 ed for the swat [13:59:00] hashar: want to do the swat, or should I? [14:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170105T1400). Please do the needful. [14:00:05] kart_: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:12] o/ [14:01:26] zeljkof: here [14:02:07] kart_, hashar: something came up, one kid sick, have to pick up the other one [14:02:15] hashar: can you do the swat, please? [14:02:33] zeljkof: Oops. Take care. [14:03:28] zeljkof: yeah [14:03:28] * hashar while listening to https://www.youtube.com/watch?v=P6T_khwq-aY [14:03:51] hashar: thanks, back in an hour or so [14:04:26] kart_: waiting for tests to complete [14:04:34] then I will push the change to mwdebug1001 [14:04:38] hashar: sure. I can listen that song meanwhile. [14:04:58] that is a french artist [14:05:12] hashar: quite difficult to test as we thought ca/he wikis are on wmf7, but seems they are still on wmf6 [14:05:39] guess we had to hold the train [14:05:54] hashar: Okay. Any chances to run train tonight? [14:06:26] no clue [14:06:34] I haven't followed the train activity. Best take is to look at the task [14:07:25] I believe the train will probably be running tonight [14:08:22] addshore: and I will want the wikidata version switch to be moved to a different time slot [14:08:35] hmm, wikidata version switch? [14:08:37] as during our afternoon instead of late in our evening [14:09:08] you mean wikidata wmf5 -> wmf5.1 ? [14:09:17] kart_: I guess you have tested it on beta cluster / master branch haven't you ? [14:09:24] or something else? [14:09:48] addshore: I meant either bumping wikidata extension or switch wikidata.org to a new wmf version [14:10:12] !log upgrading firejail on restbase staging hosts [14:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:18] * addshore has nothing to do with that & can't see anything in the calendar! [14:10:22] <_joe_> andre__: around? [14:10:26] or to say otherwise, given wikidata is maintained by people that are mostly in Germany, it would be nicer to have the deployment done during business hours instead of in the evening [14:10:45] hashar: master is Okay [14:11:07] kart_: great [14:11:15] kart_: so I am going to push it [14:11:22] hashar: cool. go ahead. [14:11:32] and you will be able to verify it in prod later whenever wmf.7 is deployed [14:12:10] yep [14:13:39] bah [14:13:44] pending sync-masters [14:15:09] !log hashar@tin Synchronized php-1.29.0-wmf.7/extensions/ContentTranslation: Workaround to fix restoration for truncated section ids - T154279 (duration: 02m 10s) [14:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:13] T154279: Cannot load a saved translation, JS error from Sizzle at $section.data( 'source' ) - https://phabricator.wikimedia.org/T154279 [14:15:28] kart_: done :}} [14:15:38] hashar: thanks! [14:17:24] (03PS1) 10Alexandros Kosiaris: icinga: Move SSL settings from role to module [puppet] - 10https://gerrit.wikimedia.org/r/330673 [14:19:30] !log upgrading firejail on restbase production hosts [14:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:29] (03PS2) 10Giuseppe Lavagetto: docker::baseimages: fix typo in build script [puppet] - 10https://gerrit.wikimedia.org/r/330221 [14:25:56] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] docker::baseimages: fix typo in build script [puppet] - 10https://gerrit.wikimedia.org/r/330221 (owner: 10Giuseppe Lavagetto) [14:29:01] (03PS3) 10Muehlenhoff: Enable enhanced sandbox privilege separation for sshd [puppet] - 10https://gerrit.wikimedia.org/r/330227 [14:30:50] (03CR) 10Muehlenhoff: [C: 032] Enable enhanced sandbox privilege separation for sshd [puppet] - 10https://gerrit.wikimedia.org/r/330227 (owner: 10Muehlenhoff) [14:37:05] 06Operations, 10hardware-requests: Site: (3) hardware access request for dedicated Labs puppetmasters - https://phabricator.wikimedia.org/T147053#2919496 (10chasemp) [14:37:17] 06Operations, 10hardware-requests: Site: (2) hardware access request for dedicated Labs puppetmasters - https://phabricator.wikimedia.org/T147053#2679484 (10chasemp) [14:38:55] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603#2919498 (10akosiaris) The resync is done and neither sda nor sdb logged any kind of errors during the resync process which further enforces the controller issue theory. That being said, I don't see any hardware rai... [14:39:11] 06Operations, 10hardware-requests: Site: (4) hardware access request for labtest - https://phabricator.wikimedia.org/T154664#2919499 (10chasemp) [14:39:23] 06Operations, 10hardware-requests: Site: (4) hardware access request for labtest - https://phabricator.wikimedia.org/T154664#2919511 (10chasemp) 05Open>03stalled p:05Triage>03Normal [14:39:25] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603#2919513 (10akosiaris) 05Open>03Resolved a:03akosiaris I think I am gonna resolve this for now and decide how to act on this if it happens again. [14:40:01] (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/5034/ is happy, merging" [puppet] - 10https://gerrit.wikimedia.org/r/330673 (owner: 10Alexandros Kosiaris) [14:40:07] (03PS2) 10Alexandros Kosiaris: icinga: Move SSL settings from role to module [puppet] - 10https://gerrit.wikimedia.org/r/330673 [14:40:11] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] icinga: Move SSL settings from role to module [puppet] - 10https://gerrit.wikimedia.org/r/330673 (owner: 10Alexandros Kosiaris) [14:41:52] (03CR) 10Alexandros Kosiaris: [C: 04-1] "hmm, ssl settings probably don't depend anymore in the role, but rather a "profile" class per our newly adopted RFC. We don't have that bu" [puppet] - 10https://gerrit.wikimedia.org/r/330633 (https://phabricator.wikimedia.org/T133717) (owner: 10Dzahn) [14:46:39] !log upgrading firejail on image scalers [14:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:49] !log restbase restarting for firejail upgrade [15:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:31] 06Operations: Look into behaviour of /etc/exim4/update-exim4.conf.conf related to updates - https://phabricator.wikimedia.org/T154665#2919557 (10MoritzMuehlenhoff) [15:29:36] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [15:29:43] 06Operations: Look into behaviour of /etc/exim4/update-exim4.conf.conf related to updates - https://phabricator.wikimedia.org/T154665#2919569 (10MoritzMuehlenhoff) p:05Triage>03Low [15:32:51] 06Operations, 10netops: cr2-esams<->cr2-eqiad link flaps - https://phabricator.wikimedia.org/T154577#2919570 (10faidon) p:05High>03Low Level3 responded yesterday that they performed a "warm reset" on the LTX card, which we indeed saw as a longer link down at Jan 4 18:37:32. The link seems stable since and... [15:33:46] PROBLEM - Swift HTTP frontend on ms-fe3001 is CRITICAL: connect to address 10.20.0.15 and port 80: Connection refused [15:34:16] PROBLEM - Swift HTTP backend on ms-fe3001 is CRITICAL: connect to address 10.20.0.15 and port 80: Connection refused [15:35:05] ^ ms-fe3001 is just a test host, will ping Filippo on it [15:44:33] !log labstore1005 systemctl disable create-dbusers [15:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:48] (03CR) 10Eevans: RESTBase-Cassandra: Add the topk reporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) (owner: 10Mobrovac) [15:58:24] (03CR) 10Stryn: [C: 031] "looks fine" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326409 (https://phabricator.wikimedia.org/T151570) (owner: 10Odder) [15:59:52] !log upgrading firejail on thumbor servers [15:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:37] (03CR) 10Anomie: "> Who are you speaking for? I'd personally rather have the edit made," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324215 (owner: 10Anomie) [16:14:48] (03PS1) 10Urbanecm: Add digitalmedia.fws.gov to the whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330685 (https://phabricator.wikimedia.org/T154671) [16:21:12] !log rolling out exim4 upgrades (DSA 3747-1) on db-core-eqiad, db-misc-servers, videoscaler [16:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:33] (03CR) 10Zfilipin: build: update rubocop to 0.39 and tweak config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/330470 (owner: 10Hashar) [16:28:34] !log rolling out exim4 upgrades (DSA 3747-1) on mw-api, swift-fe, swift-be, sca, scb, misc-analytics [16:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:52] (03PS2) 10Filippo Giunchedi: l10nupdate: clean up scap lock file after release [puppet] - 10https://gerrit.wikimedia.org/r/330636 (https://phabricator.wikimedia.org/T72752) (owner: 10BryanDavis) [16:30:02] bd808: doh! re: scap lock, I'll merge the fix [16:30:48] mutante: Do those exim upgrades need to go everywhere? Asking for gerrit/cobalt [16:31:24] (03CR) 10Filippo Giunchedi: [C: 032] l10nupdate: clean up scap lock file after release [puppet] - 10https://gerrit.wikimedia.org/r/330636 (https://phabricator.wikimedia.org/T72752) (owner: 10BryanDavis) [16:31:50] ostriches: yes, but it should be in one of the "misc" groups anyways [16:32:16] Okie dokie, not trying to get in your way, mostly asking cuz I noticed them available in apt there [16:32:33] what i'll do is watch the servermon page which tells me the missing ones [16:32:42] yea, we can also just do it now [16:32:50] "manually" as opposed to debdeploy [16:33:10] !log cobalt upgrading exim packages [16:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:17] done [16:33:29] Awesome [16:33:43] (03CR) 10Filippo Giunchedi: [C: 031] Switch swift in esams to systemd-timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330404 (https://phabricator.wikimedia.org/T150257) (owner: 10Muehlenhoff) [16:34:56] jouncebot: next [16:34:56] In 0 hour(s) and 25 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170105T1700) [16:41:37] godog: thanks for that merge. I think that will fix the small hiccup that stopped it last night [16:43:13] bd808: no worries, thanks for taking a look [16:52:00] !log rolling out exim4 upgrades (DSA 3747-1) on notebook, lvs-canary, lvs, mw-maintenance, all-mw-codfw [16:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:46] PROBLEM - puppet last run on dbproxy1008 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[exim4-daemon-light],Package[exim4-config] [16:59:28] !log rolling out exim4 upgrades (DSA 3747-1) on all-db-noncore, all-mw-eqiad, restbase-eqiad, kafka-main [16:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170105T1700). Please do the needful. [17:00:05] Krenair and Hashar: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:00:18] O/ [17:00:46] RECOVERY - puppet last run on dbproxy1008 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [17:01:14] ok, I'll start with hashar [17:01:29] Krenair: I've ran the puppet compiler on your changes and LGTM [17:01:56] my first patch changes the way the puppet inline doc is generated. Must be a noop for prod [17:02:12] the current doc is lame https://doc.wikimedia.org/puppet/ [17:02:20] (03PS7) 10Filippo Giunchedi: Puppet doc with strings/yard [puppet] - 10https://gerrit.wikimedia.org/r/309561 (https://phabricator.wikimedia.org/T143233) (owner: 10Hashar) [17:02:32] will get us something more modern such as https://doc.wikimedia.org/rubygems/mediawiki-selenium/ :) [17:02:39] yeah easy enough [17:02:47] ok [17:02:51] hashar: the second one https://gerrit.wikimedia.org/r/#/c/330470/ doesn't seem to have consensus [17:04:47] (03CR) 10Filippo Giunchedi: [C: 032] Puppet doc with strings/yard [puppet] - 10https://gerrit.wikimedia.org/r/309561 (https://phabricator.wikimedia.org/T143233) (owner: 10Hashar) [17:05:09] godog: eek [17:05:16] lets skip it [17:05:23] will revisit it with zeljko [17:05:39] ok! [17:05:53] (03CR) 10Hashar: [C: 04-1] "I tried auto generate the config and it eventually screw up everything. Guess i have to revisit this change :-)" [puppet] - 10https://gerrit.wikimedia.org/r/330470 (owner: 10Hashar) [17:06:44] !log rolling out exim4 upgrades (DSA 3747-1) on kraz, wdqs2003, wezen, zosma, tegmen, rutherfordium. upgrade kernel and python-requests on zosma [17:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:51] godog: thx :) [17:07:57] !log scandium (zuul merger), upgrade exim, python-requests, kernel version [17:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:02] (03PS2) 10Filippo Giunchedi: beta: Move beta-specific VHosts into their own apache config file [puppet] - 10https://gerrit.wikimedia.org/r/322601 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk) [17:08:27] (03CR) 10Filippo Giunchedi: "PCC reports the change as noop as it should for production, https://puppet-compiler.wmflabs.org/5038/" [puppet] - 10https://gerrit.wikimedia.org/r/322601 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk) [17:08:54] Krenair: looking at the beta change first [17:08:57] ok [17:09:10] I take it the change is already on beta puppetmaster ? [17:10:52] not this time, no [17:11:24] (03CR) 10Filippo Giunchedi: [C: 032] beta: Move beta-specific VHosts into their own apache config file [puppet] - 10https://gerrit.wikimedia.org/r/322601 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk) [17:11:57] ok [17:14:12] Krenair: merged, please take a look at beta [17:14:36] doing [17:15:02] !log rolling out exim4 upgrades (DSA 3747-1) on ruthenium, einsteinium (icinga), etherpad1001, rhodium, | einsteinium: upgrade python packages, kernel | xenon: apt-get autoremove, upgrade python- arcconf, libs... [17:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:45] godog, lgtm [17:17:00] godog: https://doc.wikimedia.org/puppet/ !!! thanks a bunch [17:17:13] Krenair: ack, noop in prod [17:17:27] hashar: neat, thanks for working on that! [17:17:42] yeah took me a few months :/ [17:17:49] the toolchain is a mess [17:18:20] that's been my general experience as well with ruby things [17:18:21] (03PS5) 10Alex Monk: Move some production apache config files to templates [puppet] - 10https://gerrit.wikimedia.org/r/322602 (https://phabricator.wikimedia.org/T1256) [17:18:51] godog: On the apache front...is https://gerrit.wikimedia.org/r/#/c/321916/ possible to merge? [17:20:40] ostriches: possibly, I'll take a look after the last patch [17:20:57] Awesome thx [17:21:12] topic:make-docroots-sane-again [17:21:41] (03CR) 10Filippo Giunchedi: "PCC is noop https://puppet-compiler.wmflabs.org/5039/" [puppet] - 10https://gerrit.wikimedia.org/r/311648 (owner: 10Alex Monk) [17:21:50] (03PS2) 10Filippo Giunchedi: Replace repeated UseMod rewrites in apache config with existing include [puppet] - 10https://gerrit.wikimedia.org/r/311648 (owner: 10Alex Monk) [17:23:11] how is that a no-op in PCC? [17:23:57] 06Operations, 07Puppet, 10Continuous-Integration-Config, 13Patch-For-Review, 07Upstream: post build failures for operations/puppet on operations-puppet-doc - https://phabricator.wikimedia.org/T143233#2919950 (10hashar) 05Open>03Resolved I got rid of `puppet rdoc` entirely. We are now using `puppet-s... [17:24:53] Krenair: I think because resource-wise nothing changes and the files are the same now [17:25:10] 06Operations, 10hardware-requests: codfw: (4) hardware access request for labtest - https://phabricator.wikimedia.org/T154664#2919952 (10RobH) [17:25:14] the files are different... they have the same meaning when apache expands the includes, but puppet doesn't know about that [17:26:20] yeah that's true [17:28:07] looks sane anyways, I'll merge [17:28:12] (03CR) 10Filippo Giunchedi: [C: 032] Replace repeated UseMod rewrites in apache config with existing include [puppet] - 10https://gerrit.wikimedia.org/r/311648 (owner: 10Alex Monk) [17:28:22] yeah. though I'm now worried about using PCC elsewhere [17:29:33] indeed, worth opening a task to at least inquire about that, maybe it is a bug maybe not [17:29:38] 06Operations, 10hardware-requests: codfw: (4) hardware access request for labtest - https://phabricator.wikimedia.org/T154664#2919499 (10RobH) I've chatted with Chase about this via IRC. The one host to work as a labvirt/nova/neutron host will be for small VM testing, so it doesn't need to be a virt box, just... [17:30:48] 06Operations, 10hardware-requests: eqiad: (2) hardware access request for dedicated Labs puppetmasters - https://phabricator.wikimedia.org/T147053#2919964 (10RobH) [17:34:02] Krenair: godog there is a puppet-compiler bug that fails to notice diff in catalogs ( https://phabricator.wikimedia.org/T149432 ) [17:34:33] scary [17:35:07] !log diabling puppet on mw2075-2089 to decommission them today. [17:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:30] "fun" [17:36:54] 06Operations, 06Labs, 13Patch-For-Review: Kill the labtest $realm - https://phabricator.wikimedia.org/T148717#2919989 (10Andrew) modules/puppetmaster/files/labtest.hiera.yaml is still used, and appears to still be useful. [17:36:56] ostriches: https://gerrit.wikimedia.org/r/#/c/321916 looks good to me, I can check prod can you check beta after the merge? [17:36:56] preguiça* [17:37:08] Yeah, it's just commons for beta [17:37:09] lestaty, hm? [17:37:15] We don't have a usability wiki there :) [17:37:26] wrong channel sorry [17:37:30] (03PS2) 10RobH: decommission mw2075-2089 [puppet] - 10https://gerrit.wikimedia.org/r/330621 [17:37:32] (03PS4) 10Filippo Giunchedi: Commons/Usability docroots: Use wikimedia.org standard docroot [puppet] - 10https://gerrit.wikimedia.org/r/321916 (owner: 10Chad) [17:38:02] godog, you're doing the disable puppet everywhere + slowly re-enable thing? [17:38:29] 06Operations, 10ops-codfw: decommission old mw appservers to make room for new systems - https://phabricator.wikimedia.org/T154621#2919998 (10RobH) [17:39:12] (03CR) 10RobH: [C: 032] decommission mw2075-2089 [puppet] - 10https://gerrit.wikimedia.org/r/330621 (owner: 10RobH) [17:39:23] Krenair: I might for eqiad, I just noticed it touches /wiki too [17:39:44] though the directories are the same on the filesystem [17:40:15] are we talking about the same thing? [17:40:20] I was looking at my legacy UseMod includes thing [17:41:05] ah, no that was fine, puppet DTRT [17:41:36] (03CR) 10Gergő Tisza: [C: 031] "There probably should be an associated ticket with #user-notice so Tech News can pick it up." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324215 (owner: 10Anomie) [17:41:38] (03PS2) 10Muehlenhoff: Also exclude time servers when using timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330660 [17:44:09] Anyone have any experience with removing nodes from conf-tool, and possible failures for it to do so on puppet merge? [17:45:58] 06Operations, 06Analytics-Kanban, 15User-Elukey: Yarn node manager JVM memory leaks - https://phabricator.wikimedia.org/T153951#2920026 (10Nuria) Ok, fix for https://issues.apache.org/jira/browse/YARN-5482 is also in the same method: unregisterSource, that is why memory seems to leak from the MetricsSystemI... [17:46:58] Krenair: so yeah your change is already deploying by itself, I'm looking at ostriches' [17:47:54] (03CR) 10Muehlenhoff: [C: 032] Also exclude time servers when using timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330660 (owner: 10Muehlenhoff) [17:50:40] (03PS4) 10Paladox: Gerrit: Set useUnicode=true, also change connectionCollation to utf8mb4_unicode_ci [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) [17:52:46] 06Operations, 10Traffic: Extra RTT on TLS handshakes - https://phabricator.wikimedia.org/T150561#2920075 (10BBlack) 05Open>03Resolved a:03BBlack [17:53:25] 06Operations, 10Traffic, 07Wikimedia-Incident: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#2920097 (10BBlack) These are now deployed (digicert in esams, globalsign elsewhere). Pending closing this until we document switching off either of the certs... [17:53:53] (03CR) 10Paladox: "Requires dba to convert the db to utf8mb4, even though the connection won't be utf8mb4 but will be utf8 it will work as we are connecting " [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [17:54:01] ostriches ^^ :) [17:54:28] (03PS5) 10Paladox: Gerrit: Set useUnicode=true, also change connectionCollation to utf8mb4_unicode_ci [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) [17:55:51] (03CR) 10Filippo Giunchedi: [C: 032] Commons/Usability docroots: Use wikimedia.org standard docroot [puppet] - 10https://gerrit.wikimedia.org/r/321916 (owner: 10Chad) [17:56:00] (03PS5) 10Filippo Giunchedi: Commons/Usability docroots: Use wikimedia.org standard docroot [puppet] - 10https://gerrit.wikimedia.org/r/321916 (owner: 10Chad) [17:56:15] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Commons/Usability docroots: Use wikimedia.org standard docroot [puppet] - 10https://gerrit.wikimedia.org/r/321916 (owner: 10Chad) [17:57:28] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#2920126 (10Paladox) @jcrespo this https://gerrit.wikimedia.org/r/#/c/330455/ patch should at least stop the 500 error's, the patc... [17:57:42] !log shutting down mw2075-2089 for decom per T154621 [17:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:47] T154621: decommission old mw appservers to make room for new systems - https://phabricator.wikimedia.org/T154621 [17:59:20] ostriches: change is applied to mwdebug, still LGTM [18:00:05] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170105T1800). [18:01:06] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [18:01:31] godog: Looking at beta right now [18:02:19] hrmm [18:05:07] godog: Commons on beta lgtm as well [18:05:16] (plus strategy/commons on mwdebug in prod) [18:05:28] hey look, testing! [18:05:30] :P [18:05:32] !log arlolra@tin Starting deploy [parsoid/deploy@465f9c4]: Updating Parsoid to 974dd5b3 [18:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:37] neat, thanks ostriches [18:06:06] geez, 27 puppet failures on sca? [18:06:13] 06Operations: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#2920182 (10Kelson) @chasemp I'll clean it piece by piece to see if it works fine without. Until now I have only verified other aspects (than storage) of the full dumping. [18:06:16] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:07:08] Krenair: Sorta relatedly, there's an uncommitted change to deployment-puppetmaster02...assuming it's yours? If so, please commit or discard (I had to stash & reapply to pull my stuff) [18:07:39] looking at sca2003 [18:08:06] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [18:08:21] greg-g: eh.. yea. it is not happening again [18:08:49] weird [18:08:58] * greg-g just randomly saw and commented [18:09:18] my guess is it was using dpkg , maybe from my upgrades [18:09:31] and then puppet run and was told it's already locked [18:09:38] and on next run everything was normal again [18:10:12] 27 failures and looks they were all about Package[] [18:10:59] ostriches, yes it's mine [18:11:57] it was a partial revert of https://gerrit.wikimedia.org/r/#/c/327465/ [18:12:40] Krenair: Fair nuff, just saying it needs either a local commit or a discard, uncommitted changes are the absolute worst :) [18:12:43] yeah [18:12:45] I'm working on it [18:13:25] ok all done re: puppet swat and nothing is on fire [18:13:53] Krenair: Thx [18:14:11] greg-g: FYI https://wikitech.wikimedia.org/w/index.php?title=Incident_documentation%2F20170104-MonologSpi&type=revision&diff=1271021&oldid=1270662 [18:14:15] 06Operations, 10ops-codfw: decommission old mw appservers to make room for new systems - https://phabricator.wikimedia.org/T154621#2920212 (10RobH) [18:14:28] (03Abandoned) 10Chad: WIP: Remove mobileportal docroot, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323999 (owner: 10Chad) [18:14:48] (03Abandoned) 10Chad: Kill skins-1.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321730 (owner: 10Chad) [18:15:12] (03CR) 10Alex Monk: "There was something weird going on that prevented puppet on beta from picking this up until recently, but now puppet won't run on deployme" [puppet] - 10https://gerrit.wikimedia.org/r/327465 (owner: 10Giuseppe Lavagetto) [18:15:46] !log rolling out exim4 upgrades (DSA 3747-1) on puppetmaster, yubiauth, oresrdb, (oresrdb1001 - Unknown installation error.. eh.. this is new) [18:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:19] !log arlolra@tin Finished deploy [parsoid/deploy@465f9c4]: Updating Parsoid to 974dd5b3 (duration: 10m 46s) [18:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:06] (03PS1) 10Alex Monk: Partial revert of I89bd171e (LE part) [puppet] - 10https://gerrit.wikimedia.org/r/330694 [18:18:37] (03CR) 10Alex Monk: "Cherry-picked in betA" [puppet] - 10https://gerrit.wikimedia.org/r/330694 (owner: 10Alex Monk) [18:18:48] (03CR) 10Alex Monk: "I2c27e2c5" [puppet] - 10https://gerrit.wikimedia.org/r/327465 (owner: 10Giuseppe Lavagetto) [18:20:07] 06Operations, 10ops-codfw: decommission old mw appservers to make room for new systems - https://phabricator.wikimedia.org/T154621#2920232 (10RobH) So, the port info for mw2079 onwards is missing off the switch stacks. So I'm not sure which exact network ports to disable. @Papaul: Please list off the networ... [18:20:20] RoanKattouw: that's kind of bandwaggoning, it's not related to the outage it's just "why did stashbot not work" [18:20:30] godog: How long until apache changes are everywhere :) [18:20:40] ostriches: 20min tops [18:21:07] Mmk. I've got a followup mw-config change but need apache changes errywhur first [18:21:08] :) [18:21:52] greg-g: Eh, that incident report has turned into "everything we noticed" anyway :p [18:22:22] greg-g: Fair enough, it's not really related, but if somehow had tried to build this incident report from the log instead of their own IRC backscroll, they'd have failed [18:22:29] s/somehow/someone [18:22:46] PROBLEM - puppet last run on db1062 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:22:50] ostriches, are you deploying something? [18:22:58] RoanKattouw: right right [18:23:05] yurik: Not yet, no [18:23:09] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 3 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2920246 (10zhuyifei1999) Someone just reset a ton of 720p webm White House Press Briefing videos (which by curr... [18:23:11] stashbot: you there? [18:23:11] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [18:23:12] I'm doing a mw config thing shortly though [18:23:37] ostriches, i need to refresh graphoid service [18:24:01] Eh, won't get in my way anyway :) [18:24:49] arlolra, actually you were the last one i think - still deploying? [18:25:24] !log rolling out exim4 upgrades (DSA 3747-1) on prometheus, mwlog, pollux, labmon, lithium, hassaleh, dubnium, graphite1002, tin, serpens, bromine, dataset1001 [18:25:26] !log Updated Parsoid to 974dd5b3 (T143183, T102134, T113044) [18:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:31] yurik: no, we're done [18:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:32] T143183: Unexpected Parsoid tokenization - https://phabricator.wikimedia.org/T143183 [18:25:32] T113044: Converting {{{foo}}} from wikitext to html to wikitext returns a 500 error - https://phabricator.wikimedia.org/T113044 [18:25:32] T102134: Content anchors (e.g. #cite_note5) points to Main_Page instead of current article - https://phabricator.wikimedia.org/T102134 [18:25:56] PROBLEM - puppet last run on graphite1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tshark] [18:26:14] bd808: <3 https://wikitech.wikimedia.org/w/index.php?title=Tool%3ASAL&type=revision&diff=1271064&oldid=413011 <3 [18:26:35] the icinga-wm alert for graphite is because it was installing package upgrades during puppet run [18:26:36] !log yurik@tin Starting deploy [graphoid/deploy@d20b00e]: (no message) [18:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:56] RECOVERY - puppet last run on graphite1002 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [18:28:21] greg-g: so yeah... [18:28:37] I ask so I can do it for yoU! [18:29:06] "format as the right shape of json blob and stuff into the backing elasticsearch index" [18:29:48] !log yurik@tin Finished deploy [graphoid/deploy@d20b00e]: (no message) (duration: 03m 11s) [18:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:09] with both the wiki and the es index being managed by the same bot now I guess that gets down to parsing from some irc log somewhere [18:31:27] ideally we'd make a way to update the wiki too. that logic probably all belongs in stashbot [18:31:31] bd808: blugh [18:31:37] * greg-g nods [18:31:41] the blugh is to your first response [18:32:49] the way I did it in the past was with some python script that I would paste the wiki records into and then some hand fixing because I never quite got the python right and finally streaming into es over an ssh tunnel ;) [18:33:06] "easy" [18:33:20] think trebuchet and not scap ;) [18:34:53] The right way to fix this is with some refactoring in stashbot to separate the biz logic from the irc engine and then a frontend script that you can feed a log file to [18:34:55] !log fermium (lists server) - upgrading exim packages, exim4-daemon-heavy, forcing puppet run [18:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:07] patches welcome! the code is in gerrit now [18:35:16] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [18:37:00] greg-g: related thing I've thought of. Should we rename the #adminbot phab project for stashbot or make a new separate project? [18:37:42] adminbot (e.g. morebots) is shutdown now and I doubt it will ever return [18:37:57] the code has too many bugs nobody wanted to fix [18:42:48] (03PS1) 10RobH: remove reclaimed systems from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/330696 [18:44:02] (03CR) 10RobH: [C: 032] remove reclaimed systems from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/330696 (owner: 10RobH) [18:45:47] 06Operations, 10ops-eqiad, 10hardware-requests, 06Services (watching): Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2920349 (10RobH) 05Open>03Resolved Indeed, a quick grep shows no other entries. Removed. [18:45:49] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2920351 (10RobH) [18:49:46] RECOVERY - puppet last run on db1062 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [18:55:36] bd808: /me shrugs I haven't looked at that project [18:56:09] * greg-g is just realizing his plans and promises to his son may be drastically changed due to the heavy snow fall in tahoe this weekend :/ [18:56:49] snowpocalypse is upon us [18:57:22] if you just promised sledding you can stop at the first snowy hill and declare vicotry [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170105T1900). Please do the needful. [19:00:47] I'm going to use this + train window to get the train back on track (hopefully) unless something needs to go for SWAT [19:01:41] thcipriani: Nothing on swat yet, I'd say cancel it and steal the time [19:02:05] * thcipriani nods [19:04:56] PROBLEM - puppet last run on db1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:05:03] !log thcipriani@tin Started scap: SemanticForms l10n cache rebuild for 1.29.0-wmf.7 [19:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:53] 06Operations, 10Citoid, 10RESTBase, 10RESTBase-API, and 5 others: Set-up Citoid behind RESTBase - https://phabricator.wikimedia.org/T108646#2920441 (10Jdforrester-WMF) [19:10:06] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:16:56] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [19:22:32] 06Operations, 10Citoid, 10RESTBase, 10RESTBase-API, and 5 others: Set-up Citoid behind RESTBase - https://phabricator.wikimedia.org/T108646#2920532 (10Mvolz) [19:22:44] 06Operations, 10Citoid, 10RESTBase, 10RESTBase-API, and 5 others: Set-up Citoid behind RESTBase - https://phabricator.wikimedia.org/T108646#1526094 (10Mvolz) [19:29:01] !log rolled out exim4 upgrades (DSA 3747-1) on contint1001/2001, dbmonitor1001/2001, tungsten, seaborgium, hassium, labstore* [19:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:46] PROBLEM - puppet last run on mw1202 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:32:56] RECOVERY - puppet last run on db1030 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [19:34:40] bblack: yt? [19:35:11] !log mira - upgrade exim, python-requests, linux-image [19:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:32] nuria: ? [19:35:37] bblack: did you see the issues with the change to referrer here? https://gerrit.wikimedia.org/r/#/c/255408/2/wmf-config/InitialiseSettings.php listed on this ticket: https://phabricator.wikimedia.org/T148780 [19:36:15] 06Operations, 06Performance-Team, 06Reading-Web-Backlog, 10Traffic, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#2920622 (10MBinder_WMF) [19:38:06] hrm. mw208[0-5].codfw didn't see anything about them in the SAL, but scap is having some trouble connecting to them, known? [19:39:45] thcipriani: only thing I saw was https://tools.wmflabs.org/sal/log/AVlvx3_SlCyyDMEPu_s5 [19:39:59] which includes those you mentioned [19:40:07] so, yeah :) [19:40:10] heh, yeah, that'd explain it [19:40:26] bblack: the internal referrer numbers are off due to browsers not supporting the newer setting. [19:41:30] (03CR) 10Chad: [C: 032] Docroots: Remove commons and usability docroots, they use wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321919 (owner: 10Chad) [19:41:51] thcipriani: I'll deploy to prod shortly ^ Gonna let it stew on beta for a few mins first [19:42:06] (03Merged) 10jenkins-bot: Docroots: Remove commons and usability docroots, they use wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321919 (owner: 10Chad) [19:42:18] (03CR) 10jenkins-bot: Docroots: Remove commons and usability docroots, they use wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321919 (owner: 10Chad) [19:42:25] ostriches: oh boy [19:42:35] Eh, puppet change already went out [19:42:39] It should be a no-op [19:42:47] scap is just finishing up [19:42:58] Well I won't pull to tin til you're done anyway [19:43:14] I'll poke you when scap finishes [19:44:11] Holy crap, ./docroot/ almost looks sane now! [19:44:24] !log thcipriani@tin Finished scap: SemanticForms l10n cache rebuild for 1.29.0-wmf.7 (duration: 39m 21s) [19:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:33] ^ ostriches done, FYI [19:44:43] Only non-standard docroots are mw.org, foundationwiki, *.wikimedia.org, wikipedia.org [19:44:53] (plus the non-wiki-related docroots we deal from there) [19:45:39] 06Operations, 10Analytics, 06Reading-Web-Backlog, 10Traffic: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2920678 (10BBlack) It's not really my feature, I just happened to write the very short config patch to turn it on, because nobody else had at the... [19:46:27] nuria: ticket updated! but really this isn't my thing. I just wrote the quick patch while we were discussing it in an old ticket. [19:47:01] I think o-when-cross-o makes sense policy-wise, but I couldn't say the meta tag or which spelling is best all things considered... [19:47:15] Working fine on mwdebug [19:47:27] nice :) [19:48:07] syncing everywhere now [19:48:29] bblack: ya, my feeling is to let the issue fix itself as browsers update but seems that nobody seems to own this. Will take a second look. [19:49:28] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure, 07Wikimedia-Incident: Replace fans (or paste) on labservices1001 - https://phabricator.wikimedia.org/T154391#2920724 (10Andrew) Note to self: Before outage, silence icinga for labservices1001 and checker.tools.wmflabs.org [19:50:18] nuria: if the misspelled version works on all newer browsers as well, we could just go with that. might be informative to see what other major sites are using, too. [19:50:20] Hmmm, one of the proxies is timing out.... [19:50:33] 19:50:23 ['/usr/bin/scap', 'pull', '--no-update-l10n', '--include', 'docroot', '--include', 'docroot/***', 'tin.eqiad.wmnet'] on mw2080.codfw.wmnet returned [255]: ssh: connect to host mw2080.codfw.wmnet port 22: Connection timed out [19:50:54] bblack: from looking at old ticket it seems that it triggered an error on console [19:51:05] bblack: so no, i do not think is an option [19:51:24] mw2080 is still being treated as a proxy but it's dead, hmm [19:51:35] ostriches: yeah, evidently 208[0-5].codfw have been decom'd but are still in the pool [19:51:37] (03Abandoned) 10MaxSem: Labs: remove unused wmgCommonsMetadataForceRecalculate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328869 (owner: 10MaxSem) [19:52:23] thcipriani: So, 2080 is still in the proxy list. We could be a little more defensive about it in scap...if a node isn't in the full list it shouldn't be in the proxy or canary list [19:52:27] ostriches: I think since each of the pool of machines try to find their closest proxy and 2080 is dead, the pool will just pick another proxy, should be fine. [19:52:37] !log demon@tin Synchronized docroot: Final bit of this round of docroot cleanup (duration: 05m 00s) [19:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:09] (and yeah, those 5 all failed on the final sync) [19:53:41] yarp [19:53:56] PROBLEM - puppet last run on ms-be1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:53:57] this sort of goes along with the canonical target list for scap3 [19:54:32] like ops maintains the master list and we can group 'em however, but if they're not on the master list they aren't tried. [19:54:42] Also, proxy/canary list should come from etcd just like the full list does [19:54:47] So we don't need to maintain two places [19:54:54] this'n https://phabricator.wikimedia.org/T148992 [19:55:35] (we should also get scap speaking etcd directly, rather than relying on puppet to generate rsync lists from etcd) [19:55:56] (03PS2) 10MaxSem: Labs: remove wmgUseGWToolset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328870 [19:55:58] (03PS2) 10MaxSem: Upgrade Collection's license URL to HTTPS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329023 [19:57:55] (03PS1) 10Chad: Create wikidata.org docroot (to replace wikidata) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330707 [19:57:58] ostriches: Nice w ork :) [19:58:46] RECOVERY - puppet last run on mw1202 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [19:59:54] (03PS1) 10Chad: docroots: Swap wikidata for wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/330709 [20:00:04] thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170105T2000). [20:00:35] (03CR) 10Chad: [C: 032] Create wikidata.org docroot (to replace wikidata) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330707 (owner: 10Chad) [20:01:17] (03Merged) 10jenkins-bot: Create wikidata.org docroot (to replace wikidata) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330707 (owner: 10Chad) [20:01:26] (03CR) 10jenkins-bot: Create wikidata.org docroot (to replace wikidata) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330707 (owner: 10Chad) [20:02:06] (03PS1) 10Chad: Drop wikidata docroot, unused (uses wikidata.org now) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330712 [20:02:45] ostriches: lemme know when you're done on tin and I'll go back to train conducting. [20:02:57] Last sync in progress now (gonna be slow, yay timeouts) [20:03:56] PROBLEM - puppet last run on kafka1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:04:44] Krinkle: I'm curious if https://wikimediafoundation.org/logos/{nupedia,quote,wikibooks,wikipedia,wiktionary}.png are important. [20:04:51] important/used [20:05:35] Also the 4 powerpoints in /presentations/ :p [20:07:27] Also, wikipedia.org/15/homepage.js [20:07:27] lol, nupedia.png ? [20:07:31] bawolff: Yes [20:07:33] :P [20:07:46] !log demon@tin Synchronized docroot: Ok last docroot thing for today I promise (duration: 05m 00s) [20:07:46] RECOVERY - Swift HTTP frontend on ms-fe3001 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.175 second response time [20:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:54] thcipriani: Done ^ [20:08:06] ostriches: cool, thanks [20:09:38] ostriches: /15/ sounds like https://15.wikipedia.org/ [20:10:13] but that is https://annual.wikimedia.org/2015/ [20:10:46] PROBLEM - Swift HTTP frontend on ms-fe3001 is CRITICAL: connect to address 10.20.0.15 and port 80: Connection refused [20:11:17] (03PS1) 10Alex Monk: Remove me from researchers group [puppet] - 10https://gerrit.wikimedia.org/r/330714 (https://phabricator.wikimedia.org/T154696) [20:11:42] mutante: Both of which load a homepage.js, but not from there! [20:11:43] :) [20:11:46] RECOVERY - Swift HTTP frontend on ms-fe3001 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.178 second response time [20:12:16] RECOVERY - Swift HTTP backend on ms-fe3001 is OK: HTTP OK: HTTP/1.1 200 OK - 393 bytes in 0.185 second response time [20:12:22] (03PS1) 10Thcipriani: Revert "Revert "group1 wikis to 1.29.0-wmf.7"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330715 [20:14:39] (03PS1) 10Chad: Remove wikipedia.org/15/homepage.js [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330716 [20:14:47] (03CR) 10Thcipriani: [C: 032] Revert "Revert "group1 wikis to 1.29.0-wmf.7"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330715 (owner: 10Thcipriani) [20:15:07] ostriches: yea, i wonder how /15/ got created in the mw docroot. i was involved in getting that special site up, but files are in a separate repo https://gerrit.wikimedia.org/r/#/q/project:wikimedia/annualreport [20:15:18] Probably an earlier iteration? [20:15:23] Before it ended up on the annual report? [20:15:24] yea [20:15:36] i never knew about it in this place [20:15:38] (03Merged) 10jenkins-bot: Revert "Revert "group1 wikis to 1.29.0-wmf.7"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330715 (owner: 10Thcipriani) [20:18:07] ostriches: oooh, i think i just remembered something [20:18:09] * thcipriani waits on ssh timeouts [20:18:42] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.29.0-wmf.7 [20:18:43] ostriches: so when the 15.wp page was brandnew and had the slashdot effect.. and it was hosted on that VM, it got in trouble with all the traffic first [20:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:53] Ahhhh [20:18:55] ostriches: could be that this file was copied here to put the load on the cluster [20:18:58] That would make sense [20:19:29] https://gerrit.wikimedia.org/r/#/c/264302/ [20:19:44] 06Operations, 10Analytics, 06Reading-Web-Backlog, 10Traffic: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2920833 (10Nuria) @JKatzWMF Looks like we already discussed on whether to support the missspelled version (ahem... one of them) and consensus wa... [20:20:05] 06Operations, 10Analytics, 06Reading-Web-Backlog, 10Traffic: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2920834 (10Nuria) 05Open>03Resolved [20:20:06] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:20:31] (03CR) 10Dzahn: "the 15 page files should all be in the separate wikimedia/annualreport repo and get cloned from there by puppet on the ganeti VM that host" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330716 (owner: 10Chad) [20:21:34] mutante: I'm pretty sure it must've gotten changed back to not load from there anymore (at some point) [20:21:48] I can't find any reference to it being loaded, except from the local annualreport version [20:22:13] (03CR) 10jenkins-bot: Revert "Revert "group1 wikis to 1.29.0-wmf.7"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330715 (owner: 10Thcipriani) [20:22:56] RECOVERY - puppet last run on ms-be1024 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [20:23:15] ostriches: yea, i cant imagine there is much traffic for it anymore [20:23:21] I'm going to let 1.29.0-wmf.7 settle on group1 for an hour or so, then I'll move everywhere. [20:23:36] (03CR) 10Chad: [C: 032] Remove wikipedia.org/15/homepage.js [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330716 (owner: 10Chad) [20:24:24] (03Merged) 10jenkins-bot: Remove wikipedia.org/15/homepage.js [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330716 (owner: 10Chad) [20:25:39] 06Operations, 10Analytics, 06Reading-Web-Backlog, 10Traffic: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2920852 (10Nuria) Documented issue in https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly#Changes_and_known_problems_since_2015-06-16 [20:25:47] (03PS1) 10Urbanecm: Add HD logos for multiple projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330719 (https://phabricator.wikimedia.org/T150618) [20:26:06] (03PS2) 10Urbanecm: Add HD logos for multiple projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330719 (https://phabricator.wikimedia.org/T150618) [20:26:38] (03CR) 10jenkins-bot: Remove wikipedia.org/15/homepage.js [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330716 (owner: 10Chad) [20:27:36] (03PS1) 10Chad: scap: Remove mw2080 as a scap proxy, it's offline/decommed for now [puppet] - 10https://gerrit.wikimedia.org/r/330720 [20:27:51] mutante: Mind looking at 330720? [20:27:58] Scap's a bit slower than it should be right now... [20:28:06] (waiting for a dead node) [20:28:11] bblack: ok, closed ticket since given that discussion had already taken place [20:29:58] !log demon@tin Synchronized docroot/wikipedia.org: removing junk 15 stuff (duration: 04m 50s) [20:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:53] (03CR) 10Dzahn: [C: 032] "ack - {"mw2080.codfw.wmnet": {"pooled": "no"," [puppet] - 10https://gerrit.wikimedia.org/r/330720 (owner: 10Chad) [20:31:56] RECOVERY - puppet last run on kafka1003 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [20:40:16] PROBLEM - Check systemd state on ms-be3002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:43:16] PROBLEM - Check systemd state on ms-be3003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:43:26] PROBLEM - Check systemd state on ms-be3004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:45:12] !log iridium (phabricator) - upgrade exim4 packages, force puppet run [20:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:06] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [20:49:45] ms-be3 is me [20:58:31] !log mendelevium (OTRS) - upgrade exim4 packages, force puppet run [20:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:48] 06Operations, 10Analytics, 06Reading-Web-Backlog, 10Traffic: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2920920 (10JKatzWMF) >>! In T148780#2920833, @Nuria wrote: > > Either way there is no perfect solution but we rather not revisit a decision alre... [21:05:15] 06Operations, 10media-storage, 13Patch-For-Review: swift upgrade plans: jessie and swift 2.x - https://phabricator.wikimedia.org/T117972#2920921 (10fgiunchedi) [21:11:11] (03PS5) 10Anomie: Set $wgSoftBlockRanges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324215 (https://phabricator.wikimedia.org/T154698) [21:15:56] 06Operations, 10Analytics, 06Reading-Web-Backlog, 10Traffic: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2920974 (10Nuria) @JKatzWMF: sounds good, as I said on our end there are no changes needed to process the header either way. I just closed ticket... [21:17:37] 06Operations, 10Traffic: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#2920977 (10JKatzWMF) [21:18:20] ok, continuing the train to all wikis. [21:19:27] (03PS1) 10Thcipriani: all wikis to 1.29.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330730 [21:19:29] (03CR) 10Thcipriani: [C: 032] all wikis to 1.29.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330730 (owner: 10Thcipriani) [21:20:12] (03Merged) 10jenkins-bot: all wikis to 1.29.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330730 (owner: 10Thcipriani) [21:20:23] (03CR) 10jenkins-bot: all wikis to 1.29.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330730 (owner: 10Thcipriani) [21:22:55] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.29.0-wmf.7 [21:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:16] RECOVERY - Check systemd state on ms-be3002 is OK: OK - running: The system is fully operational [21:26:17] RECOVERY - Check systemd state on ms-be3003 is OK: OK - running: The system is fully operational [21:26:26] RECOVERY - Check systemd state on ms-be3004 is OK: OK - running: The system is fully operational [21:33:58] (03PS3) 10Tim Landscheidt: Tools: Disable automatic backups of aptly repositories [puppet] - 10https://gerrit.wikimedia.org/r/328031 (https://phabricator.wikimedia.org/T150726) [21:35:15] !log mx2001 - upgrading exim4 packages, daemon-heavey, forcing puppet run [21:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:02] (03PS2) 10RobH: Remove me from researchers group [puppet] - 10https://gerrit.wikimedia.org/r/330714 (https://phabricator.wikimedia.org/T154696) (owner: 10Alex Monk) [21:37:37] (03PS2) 10Tim Landscheidt: docker: Indent @ssl_settings in NGINX configuration [puppet] - 10https://gerrit.wikimedia.org/r/329735 [21:41:20] (03CR) 10RobH: [C: 032] Remove me from researchers group [puppet] - 10https://gerrit.wikimedia.org/r/330714 (https://phabricator.wikimedia.org/T154696) (owner: 10Alex Monk) [21:41:33] (03PS3) 10Aaron Schulz: Add DB "shard" column to logstash log entries for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330612 [21:42:21] (03CR) 10jerkins-bot: [V: 04-1] Add DB "shard" column to logstash log entries for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330612 (owner: 10Aaron Schulz) [21:42:59] (03PS4) 10Aaron Schulz: Add DB "shard" column to logstash log entries for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330612 [21:43:33] (03CR) 10jerkins-bot: [V: 04-1] Add DB "shard" column to logstash log entries for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330612 (owner: 10Aaron Schulz) [21:44:31] (03PS2) 10Tim Landscheidt: puppetmaster: Specify $group for all repositories [puppet] - 10https://gerrit.wikimedia.org/r/329595 (https://phabricator.wikimedia.org/T152060) [21:48:02] (03PS3) 10Tim Landscheidt: puppetmaster: Clone repositories in Labs as root [puppet] - 10https://gerrit.wikimedia.org/r/324727 (https://phabricator.wikimedia.org/T152059) [21:53:06] (03PS2) 10Tim Landscheidt: install_server: Indent @ssl_settings in NGINX configuration [puppet] - 10https://gerrit.wikimedia.org/r/329740 [21:53:08] !log mx1001 - upgrading exim4 packages, exim4-daemon-heavy, forcing puppet run [21:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:40] !log rolling out exim4 upgrades (DSA 3747-1) on all remaning ones in codfw (all-codfw) [21:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:02] (03Abandoned) 10Alex Monk: mediawiki: move redirects site from prod_sites to sites [puppet] - 10https://gerrit.wikimedia.org/r/322413 (owner: 10Alex Monk) [22:00:11] (03Abandoned) 10Alex Monk: beta sites: copy zerowiki config from prod [puppet] - 10https://gerrit.wikimedia.org/r/322416 (owner: 10Alex Monk) [22:00:25] (03Abandoned) 10Alex Monk: beta: configure loginwiki the same as prod [puppet] - 10https://gerrit.wikimedia.org/r/322417 (owner: 10Alex Monk) [22:00:31] (03Abandoned) 10Alex Monk: beta: Use wikimedia-common.incl for testwiki [puppet] - 10https://gerrit.wikimedia.org/r/322418 (owner: 10Alex Monk) [22:00:38] (03Abandoned) 10Alex Monk: beta: copy metawiki config from prod instead of having our own [puppet] - 10https://gerrit.wikimedia.org/r/322419 (owner: 10Alex Monk) [22:00:44] (03Abandoned) 10Alex Monk: beta: bring remnant.conf closer to prod version [puppet] - 10https://gerrit.wikimedia.org/r/322424 (owner: 10Alex Monk) [22:02:34] (03PS5) 10Aaron Schulz: Add DB "shard" column to logstash log entries for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330612 [22:02:44] :( [22:03:39] oh no, not domas [22:09:36] 06Operations, 06Labs, 10hardware-requests: Site: (1) hardware access request for labtest - https://phabricator.wikimedia.org/T154706#2921133 (10chasemp) [22:13:34] 06Operations, 06Labs, 10hardware-requests: Site: (1) hardware access request for labtest - https://phabricator.wikimedia.org/T154706#2921186 (10chasemp) [22:13:44] 06Operations, 10hardware-requests: codfw: (3) hardware access request for labtest - https://phabricator.wikimedia.org/T154664#2921187 (10chasemp) [22:15:01] 06Operations, 06Labs, 10hardware-requests: Site: (1) hardware access request for labtest - https://phabricator.wikimedia.org/T154706#2921133 (10chasemp) 05Open>03stalled [22:22:45] !log demon@tin Synchronized php-1.29.0-wmf.7/extensions/CodeReview/api/ApiQueryCodeComments.php: silence some warnings (duration: 02m 46s) [22:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:06] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:31:46] PROBLEM - puppet last run on analytics1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:35:44] 06Operations, 10Citoid, 10RESTBase, 10RESTBase-API, and 4 others: Set-up Citoid behind RESTBase - https://phabricator.wikimedia.org/T108646#2921419 (10Jdforrester-WMF) [22:40:28] !log demon@tin Synchronized php-1.29.0-wmf.7/extensions/VisualEditor: silence some api warnings (duration: 02m 48s) [22:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:43] (03Abandoned) 10Alex Monk: openstack: mitaka files/templates: fix puppet header to give correct path [puppet] - 10https://gerrit.wikimedia.org/r/311309 (owner: 10Alex Monk) [22:51:03] !log demon@tin Synchronized php-1.29.0-wmf.7/extensions/Echo/includes/api: silence api warnings (duration: 02m 46s) [22:51:06] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [22:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:49] (03PS2) 10Dzahn: Adding install params for elastic2025-elastic2036 Bug:T154251 [puppet] - 10https://gerrit.wikimedia.org/r/330638 (owner: 10Papaul) [22:55:06] (03CR) 10Dzahn: [C: 032] Adding install params for elastic2025-elastic2036 Bug:T154251 [puppet] - 10https://gerrit.wikimedia.org/r/330638 (owner: 10Papaul) [22:55:37] (03PS2) 10Dzahn: DHCP: Add dhcp entries for elastic2025-elastic2036 Bug:T154251 [puppet] - 10https://gerrit.wikimedia.org/r/330637 (owner: 10Papaul) [22:56:06] (03PS8) 10Filippo Giunchedi: Enable Cassandra on restbase-dev100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/328667 (https://phabricator.wikimedia.org/T153880) (owner: 10Eevans) [22:56:49] (03CR) 10Dzahn: [C: 032] DHCP: Add dhcp entries for elastic2025-elastic2036 Bug:T154251 [puppet] - 10https://gerrit.wikimedia.org/r/330637 (owner: 10Papaul) [23:01:46] RECOVERY - puppet last run on analytics1050 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [23:05:05] !log temp disabling puppet on all eqiad hosts via salt - during ganglia aggregator switch [23:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:17] (03PS1) 10Filippo Giunchedi: puppet_compiler: less verbose compiler-update-facts [puppet] - 10https://gerrit.wikimedia.org/r/330817 [23:05:36] (03PS6) 10Dzahn: ganglia: switch eqiad aggregator from carbon to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328599 (https://phabricator.wikimedia.org/T123733) [23:08:19] (03CR) 10Dzahn: [C: 032] ganglia: switch eqiad aggregator from carbon to install1001 [puppet] - 10https://gerrit.wikimedia.org/r/328599 (https://phabricator.wikimedia.org/T123733) (owner: 10Dzahn) [23:18:44] !log switching eqiad ganglia aggregator - running puppet on install1001 - disabling on carbon, re-enabling puppet across eqiad [23:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:23] mutante: good to go ? [23:23:03] (03CR) 10Filippo Giunchedi: [C: 032] puppet_compiler: less verbose compiler-update-facts [puppet] - 10https://gerrit.wikimedia.org/r/330817 (owner: 10Filippo Giunchedi) [23:23:09] (03PS2) 10Filippo Giunchedi: puppet_compiler: less verbose compiler-update-facts [puppet] - 10https://gerrit.wikimedia.org/r/330817 [23:23:12] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] puppet_compiler: less verbose compiler-update-facts [puppet] - 10https://gerrit.wikimedia.org/r/330817 (owner: 10Filippo Giunchedi) [23:23:30] godog: yes [23:23:35] nice, thanks [23:24:06] it turns out that not all salt minions responded [23:24:12] but that's the best i had then [23:24:27] worst case it's just a little gap in the graphs [23:24:50] (03PS2) 10Filippo Giunchedi: grafana: add 503 breakdown to varnish-http-errors [puppet] - 10https://gerrit.wikimedia.org/r/330460 [23:25:07] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] grafana: add 503 breakdown to varnish-http-errors [puppet] - 10https://gerrit.wikimedia.org/r/330460 (owner: 10Filippo Giunchedi) [23:25:19] yeah I think that's fine [23:27:43] (03CR) 10Filippo Giunchedi: [C: 032] Enable Cassandra on restbase-dev100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/328667 (https://phabricator.wikimedia.org/T153880) (owner: 10Eevans) [23:27:49] (03PS9) 10Filippo Giunchedi: Enable Cassandra on restbase-dev100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/328667 (https://phabricator.wikimedia.org/T153880) (owner: 10Eevans) [23:27:53] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Enable Cassandra on restbase-dev100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/328667 (https://phabricator.wikimedia.org/T153880) (owner: 10Eevans) [23:32:05] (03PS1) 10Filippo Giunchedi: site: add restbase-dev100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/330821 (https://phabricator.wikimedia.org/T153880) [23:32:25] (03PS2) 10Filippo Giunchedi: site: add restbase-dev100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/330821 (https://phabricator.wikimedia.org/T153880) [23:34:05] (03CR) 10Filippo Giunchedi: [C: 032] site: add restbase-dev100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/330821 (https://phabricator.wikimedia.org/T153880) (owner: 10Filippo Giunchedi) [23:37:46] PROBLEM - puppet last run on restbase-dev1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:38:06] PROBLEM - puppet last run on restbase-test1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:38:08] (03CR) 10Dzahn: "I did that, first disabled puppet across the fleet in eqiad, then ran it first on install1001, then re-enabled it again. there are still a" [puppet] - 10https://gerrit.wikimedia.org/r/328599 (https://phabricator.wikimedia.org/T123733) (owner: 10Dzahn) [23:38:32] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2921703 (10bearND) Should also add the new trending-edits service. [23:42:20] (03CR) 10Dzahn: [C: 031] Redirect https://toolserver.org/~magnus/ [puppet] - 10https://gerrit.wikimedia.org/r/330178 (https://phabricator.wikimedia.org/T113696) (owner: 10DatGuy) [23:45:49] (03PS4) 10Dzahn: Redirect https://toolserver.org/~magnus/ [puppet] - 10https://gerrit.wikimedia.org/r/330178 (https://phabricator.wikimedia.org/T113696) (owner: 10DatGuy) [23:46:34] !log fix root-owned files on puppetmaster1001:/var/lib/git/operations/private/ causing /srv/private post-commit hook to fail [23:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:51] (03CR) 10Dzahn: [C: 032] Redirect https://toolserver.org/~magnus/ [puppet] - 10https://gerrit.wikimedia.org/r/330178 (https://phabricator.wikimedia.org/T113696) (owner: 10DatGuy) [23:49:18] !log rolling out exim4 upgrades (DSA 3747-1) on all remaining eqiad (all-eqiad) [23:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:00] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure, 07Wikimedia-Incident: Replace fans (or paste) on labservices1001 - https://phabricator.wikimedia.org/T154391#2921778 (10Cmjohnson) I purchased some thermal paste today and the entire process should not take longer than 10mins. [23:50:36] PROBLEM - Check systemd state on restbase-dev1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:50:56] PROBLEM - puppet last run on restbase-test1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): File[/etc/cassandra-instances.d] [23:51:16] PROBLEM - Check systemd state on restbase-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:52:16] PROBLEM - puppet last run on restbase-dev1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/cassandra-instances.d] [23:53:28] hmmm.. these probably had puppet disabled [23:53:32] and now it is enabled again [23:53:56] PROBLEM - puppet last run on ms-fe1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[exim4-daemon-light],Package[exim4-config] [23:53:57] PROBLEM - puppet last run on analytics1036 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[exim4-daemon-light] [23:54:14] yeah possibly also related to the change I merged last, the restbase ones [23:54:16] RECOVERY - puppet last run on restbase-test1002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [23:54:16] RECOVERY - puppet last run on restbase-dev1002 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [23:54:36] oh, ok! [23:56:16] RECOVERY - puppet last run on restbase-dev1001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [23:56:16] RECOVERY - puppet last run on restbase-test1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:58:40] PROBLEM - restbase endpoints health on restbase-dev1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.112, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f2596a07950: Failed to establish a new connection: [Errno 111] Connection refused,)) [23:59:03] yeah that's all expected, I'll silence [23:59:40] PROBLEM - Restbase root url on restbase-dev1001 is CRITICAL: connect to address 10.64.0.35 and port 7231: Connection refused