[00:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190125T0000). Please do the needful. [00:00:05] ebernhardson, tgr, Jdlrobson, and nray: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:58] o/ [00:01:06] !log Updated Parsoid to 4772f44 (T214649, T214648) [00:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:11] T214648: DOMDiff'ing doesn't traverse into galleries - https://phabricator.wikimedia.org/T214648 [00:01:11] T214649: VE's gallery representation differs enough so that selser is never applied? - https://phabricator.wikimedia.org/T214649 [00:01:18] o/ [00:05:56] i suppose i can deploy today, [00:07:26] (03PS2) 10EBernhardson: [cirrus] autocomplete: enable subphrase matching for wikitech and mw.org (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485608 (https://phabricator.wikimedia.org/T212788) (owner: 10DCausse) [00:09:24] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485608 (https://phabricator.wikimedia.org/T212788) (owner: 10DCausse) [00:09:37] (03PS2) 10EBernhardson: [cirrus] autocomplete: enable subphrase matching for wikitech and mw.org (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485609 (https://phabricator.wikimedia.org/T212788) (owner: 10DCausse) [00:10:38] (03Merged) 10jenkins-bot: [cirrus] autocomplete: enable subphrase matching for wikitech and mw.org (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485608 (https://phabricator.wikimedia.org/T212788) (owner: 10DCausse) [00:12:56] (03CR) 10Volans: [C: 03+2] "LGTM" [software/netbox] - 10https://gerrit.wikimedia.org/r/486399 (owner: 10CRusnov) [00:13:54] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485609 (https://phabricator.wikimedia.org/T212788) (owner: 10DCausse) [00:14:12] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT T212788 gerrit:485608: autocomplete subphrase matching on wikitech and mw.org (duration: 00m 46s) [00:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:15] T212788: Use subphrase matching for autocomplete by default on specific sites - https://phabricator.wikimedia.org/T212788 [00:14:37] (03CR) 10CRusnov: [V: 03+2] Nudge requirements to Django 2.1.5 [software/netbox] - 10https://gerrit.wikimedia.org/r/486399 (owner: 10CRusnov) [00:15:01] (03Merged) 10jenkins-bot: [cirrus] autocomplete: enable subphrase matching for wikitech and mw.org (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485609 (https://phabricator.wikimedia.org/T212788) (owner: 10DCausse) [00:15:39] (03CR) 10Volans: [C: 03+2] tests: test also with Python 3.7 [software/cumin] - 10https://gerrit.wikimedia.org/r/481914 (owner: 10Volans) [00:17:00] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/486306 (owner: 10Jbond) [00:20:36] (03CR) 10Volans: [C: 03+1] "LGTM, is WMCS also free of uwsgi services on trusty?" [puppet] - 10https://gerrit.wikimedia.org/r/486223 (owner: 10Muehlenhoff) [00:20:38] !log ebernhardson@deploy1001 Synchronized wmf-config/CirrusSearch-common.php: SWAT T212788 gerrit:485609: autocomplete subphrase matching on wikitech and mw.org 2 of 2 (duration: 00m 45s) [00:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:41] T212788: Use subphrase matching for autocomplete by default on specific sites - https://phabricator.wikimedia.org/T212788 [00:21:12] (03CR) 10jenkins-bot: [cirrus] autocomplete: enable subphrase matching for wikitech and mw.org (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485608 (https://phabricator.wikimedia.org/T212788) (owner: 10DCausse) [00:21:25] (03CR) 10jenkins-bot: [cirrus] autocomplete: enable subphrase matching for wikitech and mw.org (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485609 (https://phabricator.wikimedia.org/T212788) (owner: 10DCausse) [00:26:29] tgr: nray: Code is up on mwdebug1001 [00:26:45] thank you will check now... [00:26:51] (03Merged) 10jenkins-bot: tests: test also with Python 3.7 [software/cumin] - 10https://gerrit.wikimedia.org/r/481914 (owner: 10Volans) [00:27:29] ebernhardson: it's a DB replication timing bug, I'll check it in the logs when it's live [00:27:56] tgr: ok [00:28:02] (03CR) 10jenkins-bot: tests: test also with Python 3.7 [software/cumin] - 10https://gerrit.wikimedia.org/r/481914 (owner: 10Volans) [00:29:24] !log ebernhardson@deploy1001 Synchronized php-1.33.0-wmf.14/includes/Title.php: SWAT T210739 gerrit:486369: Clone the Title object to prevent mutation (duration: 00m 47s) [00:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:28] T210739: Target deletion during page move fails - https://phabricator.wikimedia.org/T210739 [00:29:38] tgr: synced out [00:29:49] thanks! [00:29:55] (03PS5) 10EBernhardson: Add wbsearchentities profiles for de, fr, es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484334 (https://phabricator.wikimedia.org/T214515) [00:30:02] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484334 (https://phabricator.wikimedia.org/T214515) (owner: 10EBernhardson) [00:30:13] (03PS2) 10EBernhardson: Turn on wbsearchentities ab test in de, fr, es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486154 (https://phabricator.wikimedia.org/T214515) [00:31:10] (03Merged) 10jenkins-bot: Add wbsearchentities profiles for de, fr, es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484334 (https://phabricator.wikimedia.org/T214515) (owner: 10EBernhardson) [00:32:12] ok, checked and my changes look good! [00:32:18] ready for deploy [00:33:22] nray: ok shipping [00:34:04] ebernhardson: thanks so much! [00:34:05] !log ebernhardson@deploy1001 Synchronized php-1.33.0-wmf.14/extensions/MobileFrontend/: SWAT T214606 gerrit:486392: MobileFrontend if wikidatadata description exists, set it as tagline (duration: 00m 47s) [00:34:05] (03PS1) 10CRusnov: Upgrade Netbox to 2.5.3 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/486410 [00:34:10] (03CR) 10jenkins-bot: Add wbsearchentities profiles for de, fr, es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484334 (https://phabricator.wikimedia.org/T214515) (owner: 10EBernhardson) [00:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:43] T214606: Regression: Wikidata descriptions are always printed as "1" for tagline - https://phabricator.wikimedia.org/T214606 [00:37:57] !log ebernhardson@deploy1001 Synchronized wmf-config/WikibaseSearchSettings.php: SWAT T214515 gerrit:484334: Add wbsearchentities profiles for de, fr, es (duration: 00m 45s) [00:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:01] T214515: Run wikidata entitiy autocomplete AB test in de, fr, es - https://phabricator.wikimedia.org/T214515 [00:38:05] (03CR) 10EBernhardson: [C: 03+2] Turn on wbsearchentities ab test in de, fr, es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486154 (https://phabricator.wikimedia.org/T214515) (owner: 10EBernhardson) [00:39:20] (03Merged) 10jenkins-bot: Turn on wbsearchentities ab test in de, fr, es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486154 (https://phabricator.wikimedia.org/T214515) (owner: 10EBernhardson) [00:41:52] (03CR) 10Gergő Tisza: "Should probably make those changes on testcommonswiki as well." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485487 (https://phabricator.wikimedia.org/T214003) (owner: 10Zoranzoki21) [00:46:27] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT T214515 gerrit:486154: Turn on wbsearchentities ab test in de, fr, es (duration: 00m 46s) [00:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:31] T214515: Run wikidata entitiy autocomplete AB test in de, fr, es - https://phabricator.wikimedia.org/T214515 [00:47:04] (03PS1) 10BryanDavis: toolforge: install libexiv2-dev on grid nodes [puppet] - 10https://gerrit.wikimedia.org/r/486413 (https://phabricator.wikimedia.org/T213965) [00:47:14] (03CR) 10jenkins-bot: Turn on wbsearchentities ab test in de, fr, es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486154 (https://phabricator.wikimedia.org/T214515) (owner: 10EBernhardson) [00:48:15] PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [00:48:17] PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [00:48:41] PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [00:48:49] PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [00:48:59] PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [00:49:13] PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [00:49:56] (03CR) 10Bstorm: [C: 03+2] toolforge: install libexiv2-dev on grid nodes [puppet] - 10https://gerrit.wikimedia.org/r/486413 (https://phabricator.wikimedia.org/T213965) (owner: 10BryanDavis) [00:59:42] 10Operations, 10wikitech.wikimedia.org: wikitech-static cert about to expire - https://phabricator.wikimedia.org/T214640 (10Dzahn) Yes, we should not have to do anything. And even if it would fail auto-renew for some reason the maximum should be to run `certbot renew` now :) [01:06:34] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/486410 (owner: 10CRusnov) [01:12:52] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Upgrade Netbox to 2.5.3 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/486410 (owner: 10CRusnov) [01:13:40] (03CR) 10Dzahn: [C: 03+2] DNS: Add mgmt DNS entries for cloudcontrol2001-dev and cloudvirt200[123]-dev [dns] - 10https://gerrit.wikimedia.org/r/486391 (https://phabricator.wikimedia.org/T214448) (owner: 10Papaul) [01:18:13] PROBLEM - Check the NTP synchronisation status of timesyncd on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [01:18:24] !log crusnov@deploy1001 Started deploy [netbox/deploy@7770453]: Upgrade netbox to 2.5.3 - T212524 [01:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:28] T212524: Upgrade Netbox to 2.5.x - https://phabricator.wikimedia.org/T212524 [01:22:57] PROBLEM - IPMI Sensor Status on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [01:26:08] !log crusnov@deploy1001 Finished deploy [netbox/deploy@7770453]: Upgrade netbox to 2.5.3 - T212524 (duration: 07m 43s) [01:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:26:11] T212524: Upgrade Netbox to 2.5.x - https://phabricator.wikimedia.org/T212524 [01:27:33] !log crusnov@deploy1001 Started deploy [netbox/deploy@7770453]: Upgrade netbox to 2.5.3 - T212524 Try 2 [01:27:35] (03PS3) 10Dzahn: testreduce: pin npm to stretch-backports and use install_options [puppet] - 10https://gerrit.wikimedia.org/r/486185 (https://phabricator.wikimedia.org/T201366) [01:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:04] !log crusnov@deploy1001 Finished deploy [netbox/deploy@7770453]: Upgrade netbox to 2.5.3 - T212524 Try 2 (duration: 00m 31s) [01:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:14] !log crusnov@deploy1001 Started deploy [netbox/deploy@7770453]: Cleanup deploy - T212524 [01:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:18] T212524: Upgrade Netbox to 2.5.x - https://phabricator.wikimedia.org/T212524 [01:33:25] !log crusnov@deploy1001 Finished deploy [netbox/deploy@7770453]: Cleanup deploy - T212524 (duration: 00m 11s) [01:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:47] 10Operations, 10Analytics, 10Research, 10Article-Recommendation, and 3 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Dzahn) >>! In T213566#4898998, @bmansurov wrote: > @Dzahn please feel free to invite a senior SRE to the discussion. Hi @... [01:40:52] (03PS4) 10Dzahn: testreduce: pin npm to backports, use install_options, fix dependencies [puppet] - 10https://gerrit.wikimedia.org/r/486185 (https://phabricator.wikimedia.org/T201366) [01:43:35] (03PS5) 10Dzahn: testreduce: pin npm to backports, use install_options, fix dependencies [puppet] - 10https://gerrit.wikimedia.org/r/486185 (https://phabricator.wikimedia.org/T201366) [01:44:36] (03PS6) 10Dzahn: testreduce: pin npm to backports, use install_options, fix dependencies [puppet] - 10https://gerrit.wikimedia.org/r/486185 (https://phabricator.wikimedia.org/T201366) [01:51:50] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [01:54:32] (03CR) 10Dzahn: [C: 04-1] "sigh.. Duplicate declaration: Package[nodejs] is already declared in file /srv/jenkins-workspace/puppet-compiler/14479/change/src/modules/" [puppet] - 10https://gerrit.wikimedia.org/r/486185 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [01:59:01] (03CR) 10Dzahn: [C: 04-1] "the visualdiff module says it provides a "stand-alone visual diffing service" so it requires the nodejs package. but on the parsoid-test h" [puppet] - 10https://gerrit.wikimedia.org/r/486185 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [02:03:51] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Traffic, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Tgr) @ema can you help out with the Varnish questions? * Is my understanding correct that by default Varni... [02:15:53] PROBLEM - HP RAID on db2068 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:7 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK [02:15:55] ACKNOWLEDGEMENT - HP RAID on db2068 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:7 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T214663 [02:16:05] 10Operations, 10ops-codfw: Degraded RAID on db2068 - https://phabricator.wikimedia.org/T214663 (10ops-monitoring-bot) [02:17:48] (03PS17) 10Jbond: Add apt pinning for buster [puppet] - 10https://gerrit.wikimedia.org/r/486306 [02:22:39] RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 169.39 seconds [02:22:39] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Jdforrester-WMF) [02:25:02] (03PS1) 10BryanDavis: Refactor and simplify python package [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/486417 (https://phabricator.wikimedia.org/T107878) [02:25:54] (03CR) 10jerkins-bot: [V: 04-1] Refactor and simplify python package [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/486417 (https://phabricator.wikimedia.org/T107878) (owner: 10BryanDavis) [02:32:11] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 19050864 and 0 seconds [02:39:39] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 54 seconds [02:43:41] PROBLEM - Device not healthy -SMART- on db2068 is CRITICAL: cluster=mysql device=cciss,11 instance=db2068:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2068&var-datasource=codfw+prometheus/ops [02:44:42] 10Operations, 10Packaging: Build .deb package of python3-aiokafka - https://phabricator.wikimedia.org/T189741 (10kchapman) @MoritzMuehlenhoff is this task not relevant anymore? I'm getting a "Not Found" for the Gerrit repo. [02:44:59] 10Operations, 10Packaging, 10Performance-Team: Build .deb package of python3-aiokafka - https://phabricator.wikimedia.org/T189741 (10kchapman) a:05Imarlier→03None [02:46:44] 10Operations, 10Packaging: Build .deb package of python3-typing for jessie - https://phabricator.wikimedia.org/T189729 (10kchapman) Going through Ian's old tasks. Closing as there hasn't been activity in almost a year. [02:47:16] 10Operations, 10Packaging, 10Performance-Team: Build .deb package of python3-typing for jessie - https://phabricator.wikimedia.org/T189729 (10kchapman) 05Open→03Stalled [02:48:17] 10Operations, 10Packaging, 10Performance-Team: Build .deb package of python3-typing for jessie - https://phabricator.wikimedia.org/T189729 (10kchapman) a:05Imarlier→03None [02:52:11] (03PS1) 10Dzahn: testreduce: pin npm to stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/486420 (https://phabricator.wikimedia.org/T201366) [02:53:35] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:57:34] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14480/" [puppet] - 10https://gerrit.wikimedia.org/r/486420 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [03:00:31] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 37 probes of 410 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [03:03:41] RECOVERY - MegaRAID on helium is OK: OK: optimal, 1 logical, 12 physical [03:03:53] !log scandium - apt-get -t stretch-backports install npm ; run puppet ; remove manually created /apt/preferences.d/npm.pref ; puppet created npm_stretch_backports.pref ; puppet run without errors again (T201366) [03:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:56] T201366: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 [03:05:47] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 410 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [03:07:15] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Dzahn) Now we have this puppetized APT pinning setup: ` Pinned packages: nodejs -> 10.4.0~dfsg-1+wmf2 with priority 1005 nodejs -> 6.11.0~dfsg-1+w... [03:08:02] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for doing also a bit of cleanup. See the check experimental for compiler results." [puppet] - 10https://gerrit.wikimedia.org/r/486306 (owner: 10Jbond) [03:08:28] (03CR) 10Dzahn: [C: 04-1] "partially done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/486420 and the rest needs rebase now" [puppet] - 10https://gerrit.wikimedia.org/r/486185 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [03:11:36] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10faidon) Netbox is now at 2.5 \o/ which allows us to import cable IDs, type, color etc. Let's start with importing eqsin's, with the data that we have in [[ https://docs.google.com/spreadsheets/d/1FKYVQJePjTQ7nVwYv4oDC6Gszk... [03:12:43] !log scandium sudo chgrp -R wikidev /srv/deployment/parsoid/deploy/ ; sudo chmod -R g+w /srv/deployment/parsoid/deploy/ (T201366) [03:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:12:46] T201366: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 [03:15:14] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Dzahn) @ssastry nodejs 10 and npm are installed, the puppet run is not broken and i changed the ownership of the parsoid deployment files. There are 2 pendin... [03:22:57] (03PS1) 10MaxSem: Remove explicit right grants for group 'confirmed' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486422 (https://phabricator.wikimedia.org/T214655) [03:25:30] (03PS1) 10Dzahn: varnish/trafficserver: switch parsoid-tests backend, rename director [puppet] - 10https://gerrit.wikimedia.org/r/486423 (https://phabricator.wikimedia.org/T201366) [03:28:54] (03PS2) 10Dzahn: varnish/trafficserver: switch parsoid-tests backend, rename director [puppet] - 10https://gerrit.wikimedia.org/r/486423 (https://phabricator.wikimedia.org/T201366) [03:29:07] in search of someone who can help with lists.wikimedia.org issue [03:38:17] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Dzahn) >>! In T201366#4904045, @ssastry wrote: > * on ruthenium, we've "sudo chgrp -R wikidev" and "sudo chmod -R g+w" all the code in /srv/deployment/parsoid... [04:08:52] 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10Papaul) [04:14:06] 10Operations, 10serviceops, 10Patch-For-Review: "sql" command fails with "sh: 1: mysql: not found" on mwdebug1002 - https://phabricator.wikimedia.org/T211512 (10Dzahn) This has been brought this up in our meeting today by @jijiki what do you think of https://gerrit.wikimedia.org/r/c/operations/puppet/+/47914... [04:19:37] PROBLEM - puppet last run on analytics1075 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:31:07] (03CR) 10BryanDavis: "Jenkins failure is:" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/486417 (https://phabricator.wikimedia.org/T107878) (owner: 10BryanDavis) [04:39:02] (03PS2) 10BryanDavis: Refactor and simplify python package [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/486417 (https://phabricator.wikimedia.org/T107878) [04:39:34] (03CR) 10jerkins-bot: [V: 04-1] Refactor and simplify python package [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/486417 (https://phabricator.wikimedia.org/T107878) (owner: 10BryanDavis) [04:43:51] PROBLEM - Long running screen/tmux on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [04:45:45] RECOVERY - puppet last run on analytics1075 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:51:07] (03CR) 10BryanDavis: "real error is:" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/486417 (https://phabricator.wikimedia.org/T107878) (owner: 10BryanDavis) [05:29:53] (03PS3) 10BryanDavis: Refactor and simplify python package [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/486417 (https://phabricator.wikimedia.org/T107878) [05:38:07] !log kartik@deploy1001 Started deploy [cxserver/deploy@a5d7181]: Update cxserver to 356f0a1 (T213257, T213275) [05:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:12] T213275: CX2: Should not transform named HTML entites into numeric HTML entities - https://phabricator.wikimedia.org/T213275 [05:38:13] T213257: CX2: Should not use for regular white-space characters - https://phabricator.wikimedia.org/T213257 [05:42:16] !log kartik@deploy1001 Finished deploy [cxserver/deploy@a5d7181]: Update cxserver to 356f0a1 (T213257, T213275) (duration: 04m 09s) [05:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:39] (03CR) 10Sayant Mahato: "Required change has been done. Please create that namespace on sa.wikisource. Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486221 (https://phabricator.wikimedia.org/T214553) (owner: 10Ammarpad) [05:52:52] (03PS6) 10Ammarpad: Add 'Author' namespace in Sanskrit Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486221 (https://phabricator.wikimedia.org/T214553) [05:56:26] (03CR) 10Ammarpad: "> Colon removing will also be needed in commit message." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486221 (https://phabricator.wikimedia.org/T214553) (owner: 10Ammarpad) [05:57:22] (03CR) 10BryanDavis: [C: 03+1] "> The only way I can think of avoiding this is" [puppet] - 10https://gerrit.wikimedia.org/r/379239 (https://phabricator.wikimedia.org/T175964) (owner: 10Herron) [05:58:44] (03CR) 10Ammarpad: "> Required change has been done. Please create that namespace on" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486221 (https://phabricator.wikimedia.org/T214553) (owner: 10Ammarpad) [05:59:51] (03PS5) 10Ammarpad: Enable blocking feature of AbuseFilter in zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) [06:00:03] (03CR) 10jerkins-bot: [V: 04-1] Enable blocking feature of AbuseFilter in zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) (owner: 10Ammarpad) [06:02:33] (03PS6) 10Ammarpad: Enable blocking feature of AbuseFilter in zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) [06:02:45] (03CR) 10jerkins-bot: [V: 04-1] Enable blocking feature of AbuseFilter in zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) (owner: 10Ammarpad) [06:04:06] !log Compress dbstore1002: staging.mep_word_persistence from Aria to InnoDB - T213706 [06:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:09] T213706: Convert Aria/Tokudb tables to InnoDB on dbstore1002 - https://phabricator.wikimedia.org/T213706 [06:04:40] (03CR) 10Sayant Mahato: "OK. "लेखकः", here it not colon but the part of the word. https://en.wikipedia.org/wiki/Visarga" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486221 (https://phabricator.wikimedia.org/T214553) (owner: 10Ammarpad) [06:06:20] 10Operations, 10ops-codfw: Degraded RAID on db2068 - https://phabricator.wikimedia.org/T214663 (10Marostegui) p:05Triage→03Normal a:03Papaul @Papaul let's get it replaced - thanks! [06:06:54] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2068 is CRITICAL: cluster=mysql device=cciss,11 instance=db2068:9100 job=node site=codfw Marostegui T214663 - The acknowledgement expires at: 2019-02-04 06:06:34. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2068&var-datasource=codfw+prometheus/ops [06:08:51] (03PS1) 10Marostegui: db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486424 (https://phabricator.wikimedia.org/T210713) [06:10:20] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486424 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:11:23] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486424 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:11:40] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486424 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:12:47] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1122 T210713 (duration: 00m 48s) [06:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:50] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [06:13:02] !log Deploy schema change on db1122 - T210713 [06:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:18] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486426 [06:24:00] (03CR) 10Marostegui: transfer.py: Add the ability to transfer from a new mariabackup (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/486264 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [06:27:55] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:32:37] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [06:33:07] PROBLEM - puppet last run on cp5011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:39:03] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [06:46:32] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486426 (owner: 10Marostegui) [06:47:36] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486426 (owner: 10Marostegui) [06:49:00] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1122 T210713 (duration: 00m 47s) [06:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:04] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [06:49:33] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486426 (owner: 10Marostegui) [06:50:31] (03PS1) 10Marostegui: db-eqiad.php: Fully depool db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486427 (https://phabricator.wikimedia.org/T210713) [06:51:32] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully depool db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486427 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:52:35] (03Merged) 10jenkins-bot: db-eqiad.php: Fully depool db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486427 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:53:43] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully depool db1105 (duration: 00m 46s) [06:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:47] !log Stop MySQL on db1105 to upgrade MySQL [06:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:43] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:17] RECOVERY - puppet last run on cp5011 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:02:47] (03CR) 10jenkins-bot: db-eqiad.php: Fully depool db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486427 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [07:15:00] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486428 [07:18:17] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486428 (owner: 10Marostegui) [07:19:22] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486428 (owner: 10Marostegui) [07:21:23] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [07:21:26] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1105 (duration: 00m 47s) [07:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:39] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient [07:21:41] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [07:21:41] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up [07:22:05] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [07:22:06] restarted the nagios daemon --^ [07:22:13] RECOVERY - Disk space on notebook1003 is OK: DISK OK [07:22:23] RECOVERY - DPKG on notebook1003 is OK: All packages OK [07:25:39] RECOVERY - IPMI Sensor Status on notebook1003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [07:25:52] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486430 [07:26:18] 10Operations, 10Packaging, 10Performance-Team: Build .deb package of python3-aiokafka - https://phabricator.wikimedia.org/T189741 (10elukey) @kchapman I don't think that we have ever used it, IIRC Ian wanted to import the package but he never used it anywhere (but I could be wrong!). Is the package needed?... [07:27:37] (03PS2) 10Marostegui: db-eqiad.php: Slowly repool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486430 [07:28:41] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486430 (owner: 10Marostegui) [07:28:55] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486428 (owner: 10Marostegui) [07:29:45] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486430 (owner: 10Marostegui) [07:30:00] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486430 (owner: 10Marostegui) [07:30:59] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1105:3312 (duration: 00m 45s) [07:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:48] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486431 [07:32:43] 10Operations, 10PHP 7.0 support: Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10jcrespo) Check T214248 - this is a known issue (unrelated to php7) but it is not considered a bug, just a configuration weakness (https://dev.mysql.com/doc/refman/8.0/en/... [07:35:56] (03CR) 10Jcrespo: transfer.py: Add the ability to transfer from a new mariabackup (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/486264 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [07:37:33] PROBLEM - puppet last run on mwdebug2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 52 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [07:38:52] (03PS1) 10Elukey: Remove hiera overrides for analytics1054 after disk swap [puppet] - 10https://gerrit.wikimedia.org/r/486433 (https://phabricator.wikimedia.org/T213038) [07:40:38] !log drain + reboot analytics1054 after disk swap (verify reboot + restore correct fstab mountpoints) - T213038 [07:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:44] T213038: Degraded RAID on analytics1054 - https://phabricator.wikimedia.org/T213038 [07:40:46] (03CR) 10Elukey: [C: 03+2] Remove hiera overrides for analytics1054 after disk swap [puppet] - 10https://gerrit.wikimedia.org/r/486433 (https://phabricator.wikimedia.org/T213038) (owner: 10Elukey) [07:47:50] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: Degraded RAID on analytics1054 - https://phabricator.wikimedia.org/T213038 (10elukey) 05Open→03Resolved All good thanks a lot @Cmjohnson ! [07:48:53] (03PS1) 10Elukey: Remove hiera host overrides for analytics1056 after disk swap [puppet] - 10https://gerrit.wikimedia.org/r/486434 (https://phabricator.wikimedia.org/T214057) [07:49:45] (03CR) 10Elukey: [C: 03+2] Remove hiera host overrides for analytics1056 after disk swap [puppet] - 10https://gerrit.wikimedia.org/r/486434 (https://phabricator.wikimedia.org/T214057) (owner: 10Elukey) [07:51:11] RECOVERY - Check the NTP synchronisation status of timesyncd on notebook1003 is OK: OK: synced at Fri 2019-01-25 07:51:09 UTC. [07:51:40] !log restart yarn/hdfs daemons on analytics1056 to pick up new disk settings - T214057 [07:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:43] T214057: Broken disk on analytics1056 - https://phabricator.wikimedia.org/T214057 [07:53:11] RECOVERY - puppet last run on mwdebug2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:54:02] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: Broken disk on analytics1056 - https://phabricator.wikimedia.org/T214057 (10elukey) 05Open→03Resolved all good thanks @Cmjohnson ! [07:58:31] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Increase traffic for db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486431 (owner: 10Marostegui) [07:59:36] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486431 (owner: 10Marostegui) [08:00:35] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Give more traffic to db1105:3312 (duration: 00m 45s) [08:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:46] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486431 (owner: 10Marostegui) [08:08:20] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486435 [08:21:55] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486435 (owner: 10Marostegui) [08:23:00] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486435 (owner: 10Marostegui) [08:24:10] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1105 (duration: 00m 45s) [08:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:02] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486435 (owner: 10Marostegui) [08:36:59] (03CR) 10Alexandros Kosiaris: [C: 03+1] uwsgi: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/486223 (owner: 10Muehlenhoff) [08:41:19] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:44:05] RECOVERY - Long running screen/tmux on notebook1003 is OK: OK: No SCREEN or tmux processes detected. [08:53:47] (03CR) 10Ammarpad: ">" (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) (owner: 10Ammarpad) [09:01:10] 10Operations, 10PHP 7.0 support: Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) After some digging: - HHVM has no `error_reporting` INI value set, meaning we use the default value, which is, according to my tests, `16807935`. Please note this va... [09:05:22] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-Cache, 10Language-Team (Language-2019-January-March), and 5 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) >>! In T203786#49068... [09:10:53] (03PS1) 10Gehel: maps: re-enable OSM lag check [puppet] - 10https://gerrit.wikimedia.org/r/486436 (https://phabricator.wikimedia.org/T198622) [09:21:58] (03CR) 10Ammarpad: "> OK. "लेखकः", here it not colon but the part of the word." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486221 (https://phabricator.wikimedia.org/T214553) (owner: 10Ammarpad) [09:25:18] (03CR) 10星耀晨曦: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) (owner: 10Ammarpad) [09:25:29] (03CR) 10jerkins-bot: [V: 04-1] Enable blocking feature of AbuseFilter in zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) (owner: 10Ammarpad) [09:39:25] (03CR) 10MarcoAurelio: "You need to rebase this patch. See . Thank you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) (owner: 10Ammarpad) [09:43:05] PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 19119.99 seconds [09:48:27] !log Add dbstore1005:3318 to tendril T210478 [09:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:30] T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 [09:52:26] (03PS3) 10Zoranzoki21: Merge the "extended-uploader" and "autopatrolled" user groups on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485487 (https://phabricator.wikimedia.org/T214003) [09:52:56] 10Operations, 10Maps (Kartotherian): Create discovery entry for Kartotherian - https://phabricator.wikimedia.org/T214672 (10Gehel) [09:53:35] (03CR) 10Zoranzoki21: "> Should probably make those changes on testcommonswiki as well." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485487 (https://phabricator.wikimedia.org/T214003) (owner: 10Zoranzoki21) [09:57:34] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Migrate >=90% of existing Logstash traffic to the logging pipeline - https://phabricator.wikimedia.org/T205851 (10fgiunchedi) Expanding from the graph above with this expression: `sum by (plugin_id) (rate(logstash_node_plugin_events_out_total{plugin_id=... [09:59:29] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-Cache, 10Language-Team (Language-2019-January-March), and 5 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Today I reviewed one... [10:01:07] (03PS4) 10Zoranzoki21: Merge the "extended-uploader" and "autopatrolled" user groups on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485487 (https://phabricator.wikimedia.org/T214003) [10:01:37] (03CR) 10Zoranzoki21: "Everything is fixed, should be ok now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485487 (https://phabricator.wikimedia.org/T214003) (owner: 10Zoranzoki21) [10:03:46] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/486306 (owner: 10Jbond) [10:04:59] (03CR) 10Mathew.onipe: [C: 03+1] maps: re-enable OSM lag check [puppet] - 10https://gerrit.wikimedia.org/r/486436 (https://phabricator.wikimedia.org/T198622) (owner: 10Gehel) [10:26:26] (03CR) 10Gehel: [C: 03+2] maps: re-enable OSM lag check [puppet] - 10https://gerrit.wikimedia.org/r/486436 (https://phabricator.wikimedia.org/T198622) (owner: 10Gehel) [10:34:19] (03CR) 10Hashar: "It is magic! Well done :)" [puppet] - 10https://gerrit.wikimedia.org/r/482379 (owner: 10Paladox) [10:38:59] 10Operations, 10PHP 7.0 support: Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) Confirmed that's the case, by running this code in CLI and in a browser `lang=php 10Operations, 10monitoring, 10Patch-For-Review, 10Performance-Team (Radar): Provision >= 50% of statsd/Graphite-only metrics in Prometheus - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) Status update: 4 services (swift / ores / thumbor / logstash) have their metrics collected by Prometheus by v... [11:10:39] (03PS1) 10Muehlenhoff: strongswan: Stop supporting trusty [puppet] - 10https://gerrit.wikimedia.org/r/486442 [11:13:08] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/14481/" [puppet] - 10https://gerrit.wikimedia.org/r/486442 (owner: 10Muehlenhoff) [11:15:06] PROBLEM - Apache HTTP on mw1345 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [11:16:18] RECOVERY - Apache HTTP on mw1345 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.072 second response time [11:18:15] (03PS1) 10Muehlenhoff: hhvm: Remove support for upstart [puppet] - 10https://gerrit.wikimedia.org/r/486443 [11:29:30] (03PS7) 10Ammarpad: Enable blocking feature of AbuseFilter in zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) [11:29:42] (03CR) 10jerkins-bot: [V: 04-1] Enable blocking feature of AbuseFilter in zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) (owner: 10Ammarpad) [11:33:06] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (done), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10Mvolz) [11:40:29] 10Operations, 10Citoid, 10Prod-Kubernetes, 10Core Platform Team Backlog (Watching / External), and 2 others: Citoid automated monitoring times out due to Zotero v2 - https://phabricator.wikimedia.org/T211411 (10Mvolz) How are the timeouts looking since the redeploy on 2019-01-17? I'm having trouble interpr... [11:41:53] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/14482/" [puppet] - 10https://gerrit.wikimedia.org/r/486443 (owner: 10Muehlenhoff) [11:46:41] (03PS1) 10Muehlenhoff: keyholder: Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/486446 [11:47:42] (03PS1) 10Muehlenhoff: Remove unused statsite::decommission class [puppet] - 10https://gerrit.wikimedia.org/r/486447 [11:48:19] (03CR) 10MarcoAurelio: [C: 04-1] "The manual rebase performed here is faulty. I suggest you abandon this change and upload a new patch set with an up-to-date version of the" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) (owner: 10Ammarpad) [11:54:52] (03PS1) 10Muehlenhoff: hhvm: Remove support for pre stretch [puppet] - 10https://gerrit.wikimedia.org/r/486449 [11:58:07] (03CR) 10Arturo Borrero Gonzalez: "You are probably looking for profile::openstack::base::clientpackages" [puppet] - 10https://gerrit.wikimedia.org/r/486322 (owner: 10Andrew Bogott) [11:58:35] (03CR) 10Arturo Borrero Gonzalez: "> You are probably looking for profile::openstack::base::clientpackages" [puppet] - 10https://gerrit.wikimedia.org/r/486322 (owner: 10Andrew Bogott) [11:59:55] (03PS1) 10Muehlenhoff: striker: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/486453 [12:04:44] (03PS18) 10Jbond: Add apt pinning for buster [puppet] - 10https://gerrit.wikimedia.org/r/486306 [12:05:53] (03PS1) 10KartikMistry: WIP: Cron to run script to purge old CX drafts [puppet] - 10https://gerrit.wikimedia.org/r/486454 [12:06:24] (03CR) 10jerkins-bot: [V: 04-1] WIP: Cron to run script to purge old CX drafts [puppet] - 10https://gerrit.wikimedia.org/r/486454 (owner: 10KartikMistry) [12:06:29] (03CR) 10Jbond: [C: 03+2] Add apt pinning for buster [puppet] - 10https://gerrit.wikimedia.org/r/486306 (owner: 10Jbond) [12:08:35] (03PS1) 10Muehlenhoff: lxc: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/486455 [12:11:06] PROBLEM - puppet last run on mc1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:11:32] PROBLEM - puppet last run on cloudvirt1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:12:34] PROBLEM - puppet last run on elastic2043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:12:46] PROBLEM - puppet last run on mc2021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:12:46] PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:13:22] PROBLEM - puppet last run on mc1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:13:22] PROBLEM - puppet last run on auth2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:13:30] PROBLEM - puppet last run on mc2020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:13:36] PROBLEM - puppet last run on alcyone is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:13:40] PROBLEM - puppet last run on ununpentium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:14:24] PROBLEM - puppet last run on restbase1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:14:24] PROBLEM - puppet last run on mc2036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:14:24] PROBLEM - puppet last run on labtestservices2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:14:35] (03PS2) 10KartikMistry: WIP: Cron to run script to purge old CX drafts [puppet] - 10https://gerrit.wikimedia.org/r/486454 [12:14:38] PROBLEM - puppet last run on restbase2008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:14:44] PROBLEM - puppet last run on poolcounter1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:15:12] PROBLEM - puppet last run on poolcounter2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:15:28] PROBLEM - puppet last run on thumbor1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:15:50] PROBLEM - puppet last run on lvs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:15:54] PROBLEM - puppet last run on mc1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:16:00] PROBLEM - puppet last run on prometheus2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:16:06] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:16:16] PROBLEM - puppet last run on mc2030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:16:16] PROBLEM - puppet last run on hassaleh is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:16:16] PROBLEM - puppet last run on thumbor2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:16:24] PROBLEM - puppet last run on roentgenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:16:24] PROBLEM - puppet last run on etcd1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:16:24] PROBLEM - puppet last run on mc1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:16:26] PROBLEM - puppet last run on netmon1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:16:34] PROBLEM - puppet last run on mc1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:16:48] PROBLEM - puppet last run on krypton is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:16:58] PROBLEM - puppet last run on restbase1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:16:58] PROBLEM - puppet last run on mc1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:16:58] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:17:00] PROBLEM - puppet last run on mc1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:17:06] PROBLEM - puppet last run on alsafi is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:17:06] PROBLEM - puppet last run on kraz is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:17:20] PROBLEM - puppet last run on lvs1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:17:23] (03PS3) 10KartikMistry: WIP: Cron to run script to purge old CX drafts [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) [12:17:27] network or puppetmaster, maybe? [12:17:30] PROBLEM - puppet last run on lithium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:17:32] PROBLEM - puppet last run on kafka1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:17:34] PROBLEM - puppet last run on hassium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:17:55] or maybe just a deploy [12:18:08] PROBLEM - puppet last run on oresrdb2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:18:10] PROBLEM - puppet last run on prometheus2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:18:14] (03CR) 10jerkins-bot: [V: 04-1] WIP: Cron to run script to purge old CX drafts [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) (owner: 10KartikMistry) [12:18:20] PROBLEM - puppet last run on dubnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:18:24] PROBLEM - puppet last run on restbase1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:18:32] PROBLEM - puppet last run on etcd1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:18:32] PROBLEM - puppet last run on mc1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:18:38] PROBLEM - puppet last run on mendelevium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:18:50] PROBLEM - puppet last run on mc2032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:18:56] PROBLEM - puppet last run on scb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:18:56] PROBLEM - puppet last run on fermium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:19:18] PROBLEM - puppet last run on mc1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:19:18] Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Invalid relationship: Apt::Pin[puppet-all] { before => Package[puppet-common] }, because Package[puppet-common] doesn't seem to be in the catalog [12:19:22] ^jbond [12:19:25] jbond42: can you revert [12:19:34] PROBLEM - puppet last run on mc1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:19:36] PROBLEM - puppet last run on restbase2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:19:36] PROBLEM - puppet last run on mc2026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:19:40] PROBLEM - puppet last run on thumbor2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:19:40] PROBLEM - puppet last run on dumpsdata1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:19:40] PROBLEM - puppet last run on logstash1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:19:40] PROBLEM - puppet last run on cloudvirtan1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:19:50] PROBLEM - puppet last run on restbase2009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:19:54] PROBLEM - puppet last run on darmstadtium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:19:56] PROBLEM - puppet last run on mc2025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:20:13] (03PS4) 10KartikMistry: WIP: Cron to run script to purge old CX drafts [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) [12:21:08] PROBLEM - puppet last run on labstore1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:21:22] PROBLEM - puppet last run on wezen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:21:28] PROBLEM - puppet last run on mc2024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:21:31] (03PS1) 10Jcrespo: Revert "Add apt pinning for buster" [puppet] - 10https://gerrit.wikimedia.org/r/486459 [12:21:36] PROBLEM - puppet last run on mc1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:21:36] PROBLEM - puppet last run on cloudvirt1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:21:50] ^ moritzm [12:21:58] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:21:58] PROBLEM - puppet last run on scb2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:22:03] ack! [12:22:08] PROBLEM - puppet last run on cloudvirt1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:22:12] PROBLEM - puppet last run on mc1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:22:13] will not do it without your +1 [12:22:18] PROBLEM - puppet last run on poolcounter1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:22:24] as it is a blind revert [12:22:30] (03CR) 10Muehlenhoff: [C: 03+1] Revert "Add apt pinning for buster" [puppet] - 10https://gerrit.wikimedia.org/r/486459 (owner: 10Jcrespo) [12:22:34] PROBLEM - puppet last run on conf2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:22:40] (03CR) 10Jcrespo: [C: 03+2] Revert "Add apt pinning for buster" [puppet] - 10https://gerrit.wikimedia.org/r/486459 (owner: 10Jcrespo) [12:22:44] PROBLEM - puppet last run on cloudvirt1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:22:47] fine to revert, the patch has a bug not caught by PCC [12:23:06] yeah, but as I am not in the loop [12:23:12] PROBLEM - puppet last run on mc2027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:23:12] PROBLEM - puppet last run on kubetcd2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:23:15] sometimes it can cause worse issues [12:23:15] ack, the dependencyn is declared on puppet-common, but we don't declare that package in our manifests [12:23:20] PROBLEM - puppet last run on thumbor1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:23:24] PROBLEM - puppet last run on cloudservices1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:23:24] PROBLEM - puppet last run on restbase2011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:23:24] PROBLEM - puppet last run on oresrdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:23:44] PROBLEM - puppet last run on cloudvirt1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:23:44] (03PS1) 10Jcrespo: Revert "Revert "Add apt pinning for buster"" [puppet] - 10https://gerrit.wikimedia.org/r/486461 [12:23:52] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:23:54] PROBLEM - puppet last run on install1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:23:54] PROBLEM - puppet last run on prometheus1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:23:54] PROBLEM - puppet last run on kubetcd2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:24:04] PROBLEM - puppet last run on thumbor2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:24:04] (03CR) 10Jcrespo: "^fix goes here :-)" [puppet] - 10https://gerrit.wikimedia.org/r/486461 (owner: 10Jcrespo) [12:24:10] PROBLEM - puppet last run on thumbor1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:24:26] PROBLEM - puppet last run on install2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:24:30] PROBLEM - puppet last run on restbase-dev1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:24:52] RECOVERY - puppet last run on mc2036 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [12:25:06] PROBLEM - puppet last run on thumbor1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:25:06] PROBLEM - puppet last run on lvs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:25:50] PROBLEM - puppet last run on mc2033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:26:00] puppet runs are fine again, I'm triggering some puppet runs on failed systems via cumin [12:26:20] not worried about the fail much [12:26:20] PROBLEM - puppet last run on auth1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:26:22] PROBLEM - puppet last run on mc1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:26:26] PROBLEM - puppet last run on tureis is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:26:33] more like someone being around to notice it :-) [12:26:40] PROBLEM - puppet last run on restbase2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:26:40] PROBLEM - puppet last run on serpens is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:26:48] PROBLEM - puppet last run on scb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:27:09] yep, still clearing icinga to make it useful again :-) [12:27:09] when I do a change that affects all hosts, I prepare almost allways the revert in advance [12:27:12] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:27:26] RECOVERY - puppet last run on restbase1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:27:34] and run it quickly on some hosts to check breakage, revert quickly [12:27:36] PROBLEM - puppet last run on maps2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:27:39] yeah, ideally gerrit would have a one-button-revert feature for emergencies [12:27:50] PROBLEM - puppet last run on logstash1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:27:54] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:28:32] PROBLEM - puppet last run on dumpsdata1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:28:40] RECOVERY - puppet last run on restbase2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:28:48] PROBLEM - puppet last run on logstash1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:28:50] RECOVERY - puppet last run on restbase1008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:28:52] PROBLEM - puppet last run on scb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:29:06] PROBLEM - puppet last run on mc1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:29:10] RECOVERY - puppet last run on kubetcd2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:29:18] RECOVERY - puppet last run on thumbor2002 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [12:29:22] RECOVERY - puppet last run on thumbor1001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [12:29:26] PROBLEM - puppet last run on maps2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:29:32] PROBLEM - puppet last run on cloudvirtan1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:29:42] PROBLEM - puppet last run on poolcounter2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:30:04] RECOVERY - puppet last run on restbase2012 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:30:06] RECOVERY - puppet last run on thumbor2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:30:16] RECOVERY - puppet last run on restbase2009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:30:18] RECOVERY - puppet last run on lvs1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:30:18] RECOVERY - puppet last run on thumbor1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:30:20] RECOVERY - puppet last run on restbase2008 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:30:20] PROBLEM - puppet last run on mwlog1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:30:20] PROBLEM - puppet last run on kafka1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:30:22] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:31:08] RECOVERY - puppet last run on thumbor1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:31:36] RECOVERY - puppet last run on mc1032 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [12:31:36] RECOVERY - puppet last run on mc1027 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [12:31:56] RECOVERY - puppet last run on restbase2010 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [12:32:00] RECOVERY - puppet last run on thumbor2004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:32:04] RECOVERY - puppet last run on mc1034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:32:06] RECOVERY - puppet last run on mc1036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:32:08] RECOVERY - puppet last run on mc1023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:32:16] RECOVERY - puppet last run on mc1031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:32:42] RECOVERY - puppet last run on mc1033 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:32:42] RECOVERY - puppet last run on mc1030 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:32:44] RECOVERY - puppet last run on mc1025 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:33:10] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:33:48] RECOVERY - puppet last run on thumbor1002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [12:34:12] RECOVERY - puppet last run on mc1019 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:34:14] RECOVERY - puppet last run on mc1020 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:34:20] RECOVERY - puppet last run on mc1024 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:34:56] RECOVERY - puppet last run on mc1029 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:35:14] RECOVERY - puppet last run on mc1022 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:37:18] RECOVERY - puppet last run on cloudvirt1018 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [12:37:44] RECOVERY - puppet last run on cloudvirt1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:37:50] RECOVERY - puppet last run on cloudvirt1023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:38:26] RECOVERY - puppet last run on cloudvirt1016 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:38:44] RECOVERY - puppet last run on elastic2043 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [12:38:52] RECOVERY - puppet last run on mc2027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:38:54] RECOVERY - puppet last run on mc2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:38:54] RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:39:24] RECOVERY - puppet last run on cloudvirt1021 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:39:36] RECOVERY - puppet last run on mc2020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:39:44] RECOVERY - puppet last run on mc2032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:39:50] RECOVERY - puppet last run on scb2001 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [12:39:58] RECOVERY - puppet last run on cloudvirtan1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:40:28] RECOVERY - puppet last run on mc2026 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:40:30] RECOVERY - puppet last run on restbase1014 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [12:40:32] RECOVERY - puppet last run on cloudvirtan1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:40:44] (03PS1) 10Jbond: Add apt pinning for buster [puppet] - 10https://gerrit.wikimedia.org/r/486464 [12:40:48] RECOVERY - puppet last run on mc2025 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:40:52] RECOVERY - puppet last run on poolcounter1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:41:16] RECOVERY - puppet last run on poolcounter2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:41:32] RECOVERY - puppet last run on mc2033 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:42:00] RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:42:03] (03CR) 10MarcoAurelio: [C: 04-1] Add 'Author' namespace in Sanskrit Wikisource (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486221 (https://phabricator.wikimedia.org/T214553) (owner: 10Ammarpad) [12:42:08] RECOVERY - puppet last run on prometheus2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:42:16] RECOVERY - puppet last run on labsdb1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:42:24] RECOVERY - puppet last run on mc2024 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:42:28] RECOVERY - puppet last run on mc2030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:42:28] RECOVERY - puppet last run on hassaleh is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:42:30] RECOVERY - puppet last run on scb1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:42:34] RECOVERY - puppet last run on etcd1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:42:48] (03CR) 10MarcoAurelio: [C: 04-1] Add 'Author' namespace in Sanskrit Wikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486221 (https://phabricator.wikimedia.org/T214553) (owner: 10Ammarpad) [12:42:54] RECOVERY - puppet last run on scb2006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:42:56] RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [12:43:10] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:43:14] RECOVERY - puppet last run on poolcounter1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:43:32] RECOVERY - puppet last run on logstash1006 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [12:43:40] RECOVERY - puppet last run on lithium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:43:40] RECOVERY - puppet last run on kafka1020 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [12:43:44] RECOVERY - puppet last run on hassium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:44:18] RECOVERY - puppet last run on oresrdb2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:44:20] RECOVERY - puppet last run on prometheus2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:44:28] RECOVERY - puppet last run on logstash1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:44:32] RECOVERY - puppet last run on scb1003 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [12:44:40] RECOVERY - puppet last run on auth2001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [12:44:46] RECOVERY - puppet last run on mendelevium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:44:50] RECOVERY - puppet last run on prometheus1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:44:58] RECOVERY - puppet last run on alcyone is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:44:58] RECOVERY - puppet last run on ununpentium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:45:04] RECOVERY - puppet last run on fermium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:45:20] RECOVERY - puppet last run on poolcounter2002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [12:45:42] RECOVERY - puppet last run on labtestservices2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:45:46] RECOVERY - puppet last run on dumpsdata1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:45:46] RECOVERY - puppet last run on logstash1008 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [12:46:00] RECOVERY - puppet last run on darmstadtium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:47:14] RECOVERY - puppet last run on labstore1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:47:30] RECOVERY - puppet last run on wezen is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:47:48] RECOVERY - puppet last run on roentgenium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:47:50] RECOVERY - puppet last run on netmon1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:48:30] RECOVERY - puppet last run on kraz is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:48:30] RECOVERY - puppet last run on alsafi is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:48:44] RECOVERY - puppet last run on lvs1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:48:44] RECOVERY - puppet last run on conf2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:49:18] RECOVERY - puppet last run on kubetcd2002 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [12:49:34] RECOVERY - puppet last run on cloudservices1003 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [12:49:40] RECOVERY - puppet last run on dubnium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:49:52] RECOVERY - puppet last run on etcd1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:50:00] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:50:36] RECOVERY - puppet last run on install2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:50:38] RECOVERY - puppet last run on restbase-dev1005 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [12:52:30] RECOVERY - puppet last run on auth1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:52:50] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:53:48] RECOVERY - puppet last run on maps2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:54:38] RECOVERY - puppet last run on dumpsdata1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:54:46] RECOVERY - puppet last run on oresrdb2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:55:16] RECOVERY - puppet last run on install1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:55:32] RECOVERY - puppet last run on maps2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:56:03] (03CR) 10Muehlenhoff: [C: 03+1] Add apt pinning for buster [puppet] - 10https://gerrit.wikimedia.org/r/486464 (owner: 10Jbond) [12:56:28] RECOVERY - puppet last run on kafka1012 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:56:28] RECOVERY - puppet last run on mwlog1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:57:50] RECOVERY - puppet last run on tureis is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:01:44] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [13:07:09] 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10aborrero) I find it **really** confusing that we are reusing numbering for these servers, even with the renaming for the new naming scheme. [13:08:09] 10Operations, 10puppet-compiler, 10Continuous-Integration-Config: jenkins-bot puppet-compiler-test may report SUCCESS though compiling failed - https://phabricator.wikimedia.org/T214629 (10hashar) [13:12:44] 10Operations, 10puppet-compiler, 10Continuous-Integration-Config: jenkins-bot puppet-compiler-test may report SUCCESS though compiling failed - https://phabricator.wikimedia.org/T214629 (10hashar) The CI job just shells out to the #puppet-compiler: ` lang=sh,name=jjb/operations-puppet-catalog-compiler.yml... [13:18:43] 10Operations, 10puppet-compiler, 10Continuous-Integration-Config: jenkins-bot puppet-compiler-test may report SUCCESS though compiling failed - https://phabricator.wikimedia.org/T214629 (10hashar) Found it. The command fails when more than half of nodes failed. In operations/software/puppet-compiler: ` lang=... [13:19:09] (03CR) 10MarcoAurelio: "Sorry but I decide what to subscribe to. Thank you :)" [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) (owner: 10KartikMistry) [13:29:40] 10Operations, 10Patch-For-Review: Decommission servermon - https://phabricator.wikimedia.org/T198939 (10MoritzMuehlenhoff) Is anyone still using Servermon at this point? [13:52:21] (03PS1) 10Daimona Eaytoy: Remove $wgAbuseFilterRuntimeProfile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486470 (https://phabricator.wikimedia.org/T191039) [13:52:46] (03CR) 10jerkins-bot: [V: 04-1] Remove $wgAbuseFilterRuntimeProfile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486470 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy) [13:53:56] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice! Haven't tried it out but looks good to me. To be on the safe side, I think we should merge after the all hands" [puppet] - 10https://gerrit.wikimedia.org/r/486169 (https://phabricator.wikimedia.org/T214176) (owner: 10Herron) [14:00:05] 10Operations: ferm: Log dropped packets - https://phabricator.wikimedia.org/T116011 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03jbond [14:05:37] (03CR) 10Filippo Giunchedi: prometheus: upgrade to node-exporter 0.17 in backports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/486192 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [14:14:02] 10Operations, 10Mail, 10OTRS: OTRS receiving flood of emails - https://phabricator.wikimedia.org/T214604 (10Krenair) [14:14:39] 10Operations, 10Mail, 10OTRS: OTRS receiving flood of emails - https://phabricator.wikimedia.org/T214604 (10Krenair) Junk is up to 19021 and rising fast. At least they're not going into proper queues now. [14:15:48] (03PS13) 10Cwhite: prometheus: upgrade to node-exporter 0.17 in backports [puppet] - 10https://gerrit.wikimedia.org/r/486192 (https://phabricator.wikimedia.org/T213708) [14:17:38] (03CR) 10Cwhite: prometheus: upgrade to node-exporter 0.17 in backports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/486192 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [14:18:34] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/486396 (owner: 10Alexandros Kosiaris) [14:21:13] PROBLEM - puppet last run on notebook1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:21:52] (03CR) 10Muehlenhoff: prometheus: upgrade to node-exporter 0.17 in backports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/486192 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [14:23:53] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Backlog): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10hashar) [14:24:43] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Backlog): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10hashar) updated to integrate my comments from T178663#3699074 Could use /srv to be shrinked a bit and a ne... [14:30:05] 10Operations, 10DBA, 10Wikidata, 10Wikimedia-production-error: DBQueryErrors when trying to create Wikidata Items - https://phabricator.wikimedia.org/T214644 (10abian) [14:43:12] !log contint1001: stopping zuul-merger for cleanup duties [14:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:59] PROBLEM - zuul_merger_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [14:47:03] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [14:47:45] grr [14:49:00] ACKNOWLEDGEMENT - zuul_merger_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger amusso cleanup duty [14:51:59] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove unused statsite::decommission class [puppet] - 10https://gerrit.wikimedia.org/r/486447 (owner: 10Muehlenhoff) [14:53:31] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 3 others: Introduce a new namespace for collaborative judgements about wiki entities - https://phabricator.wikimedia.org/T200297 (10daniel) > How does it relate to the subset of wikitext used for edit summaries? As I understand you can do simple things in edi... [14:53:35] 10Operations, 10DBA, 10Wikidata, 10Wikimedia-production-error: DBQueryErrors when trying to create Wikidata Items - https://phabricator.wikimedia.org/T214644 (10jcrespo) p:05Triage→03Low Sorry you suffered this issues. I can see a few hundred `Wikibase\UpsertSqlIdGenerator::upsertId` errors at that tim... [14:54:35] 10Operations, 10monitoring, 10Patch-For-Review, 10User-CDanis, 10User-fgiunchedi: Better organization for SRE grafana dashboards - https://phabricator.wikimedia.org/T178690 (10akosiaris) >>! In T178690#4877021, @jcrespo wrote: >>>! In T178690#4876994, @CDanis wrote: >> Jaime, going to have to guess here;... [14:54:46] 10Operations, 10monitoring, 10Patch-For-Review, 10User-CDanis, 10User-fgiunchedi: Better organization for SRE grafana dashboards - https://phabricator.wikimedia.org/T178690 (10akosiaris) p:05Triage→03Low [14:57:13] (03CR) 10Anomie: [C: 03+1] "Looks good to deploy. Confirmed that the rights being set for nowiki (other than the useless-there OAuth rights) are the same as those set" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486422 (https://phabricator.wikimedia.org/T214655) (owner: 10MaxSem) [14:57:35] 10Operations, 10DBA, 10Wikidata, 10Wikimedia-production-error: DBQueryErrors when trying to create Wikidata Items - https://phabricator.wikimedia.org/T214644 (10jcrespo) The edit rate not being affected means this was a probably relatively localized problem https://grafana.wikimedia.org/d/000000208/edit-co... [14:57:38] hey anomie how is s1 going? [14:57:39] 10Operations, 10monitoring, 10Patch-For-Review, 10User-CDanis, 10User-fgiunchedi: Better organization for SRE grafana dashboards - https://phabricator.wikimedia.org/T178690 (10CDanis) Ah, forgot to update the task, but at the time @jcrespo and @fgiunchedi and I talked, and Jaime's biggest gripe was that... [14:57:46] (maintenance script) [14:57:59] * anomie looks [14:58:22] or well, in general [14:58:48] we don't have lag anymore, so I thought it may have ended or you stopped it [14:59:20] (03CR) 10Herron: "> Nice! Haven't tried it out but looks good to me. To be on the safe" [puppet] - 10https://gerrit.wikimedia.org/r/486169 (https://phabricator.wikimedia.org/T214176) (owner: 10Herron) [14:59:23] jynus: It's up to about log_id=42747656, out of 96509908. After that it still has to process the log_search table. [15:00:01] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, these rules are running in beta prometheus and look like they are working, IOW the new metrics also show up under the old names, e.g" [puppet] - 10https://gerrit.wikimedia.org/r/485889 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [15:01:00] are those the last 2? [15:01:27] I guess the lag then only happens on revision or something else [15:01:28] s4 is up to about log_id=170110192 (of 278508228), and s5 for dewiki is up to about log_id=94904870 (of 114245230). All the other s5 wikis are done. [15:03:10] * anomie is a little surprised that enwiki has fewer log entries than dewiki. [15:03:41] 10Operations, 10monitoring, 10Patch-For-Review, 10User-CDanis, 10User-fgiunchedi: Better organization for SRE grafana dashboards - https://phabricator.wikimedia.org/T178690 (10jcrespo) @akosiaris we had some chat about details, I don't mind the USE pattern, but a poor graph using USE doesn't mean it is g... [15:03:45] every wikis is a bit different [15:04:14] cebwiki has I think has the largest templatelinks [15:04:25] RECOVERY - zuul_merger_service_running on contint1001 is OK: PROCS OK: 1 process with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [15:06:07] 10Operations, 10monitoring, 10Patch-For-Review, 10User-CDanis, 10User-fgiunchedi: Better organization for SRE grafana dashboards - https://phabricator.wikimedia.org/T178690 (10akosiaris) >>! In T178690#4909073, @jcrespo wrote: > @akosiaris we had some chat about details, I don't mind the USE pattern, but... [15:07:00] 10Operations, 10Mail, 10OTRS: OTRS receiving flood of emails - https://phabricator.wikimedia.org/T214604 (10akosiaris) Cleaned up some 10k emails from 2 more host with the same pattern as yesterday and blocked them as well. [15:08:50] 10Operations, 10DBA, 10Wikidata, 10Wikimedia-production-error: DBQueryErrors when trying to create Wikidata Items - https://phabricator.wikimedia.org/T214644 (10Addshore) [15:10:26] 10Operations, 10DBA, 10Wikidata, 10Wikimedia-production-error: DBQueryErrors when trying to create Wikidata Items - https://phabricator.wikimedia.org/T214644 (10Addshore) A better dashboard for this one: https://grafana.wikimedia.org/d/000000170/wikidata-edits?refresh=1m&orgId=1&from=1548357720872&to=15483... [15:10:59] 10Operations, 10DBA, 10Wikidata, 10Wikimedia-production-error: DBQueryErrors when trying to create Wikidata Items - https://phabricator.wikimedia.org/T214644 (10jcrespo) Thanks @addshore ! [15:20:33] 10Operations, 10puppet-compiler, 10Continuous-Integration-Config: jenkins-bot puppet-compiler-test may report SUCCESS though compiling failed - https://phabricator.wikimedia.org/T214629 (10Dzahn) >>! In T214629#4908818, @hashar wrote: > The command fails when more than half of nodes failed. I ran it on 2 n... [15:23:10] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Backlog): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn) >>! In T207707#4908981, @hashar wrote: > Could use /srv to be shrinked a bit and a new partition for... [15:24:18] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Backlog): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn) Let's ask dcops instead and request a new disk to be added. ? [15:26:19] 10Operations, 10puppet-compiler, 10Continuous-Integration-Config: jenkins-bot puppet-compiler-test may report SUCCESS though compiling failed - https://phabricator.wikimedia.org/T214629 (10hashar) I don't quite know what it is doing. There are two nodes then the change reports 1 fail and 2 errors: ` [ 2019-0... [15:32:56] 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196 (10MoritzMuehlenhoff) I think this task is mostly superseded by https://phabricator.wikimedia.org/T212231, https://phabri... [15:38:46] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php: conform error reporting levels to HHVM [puppet] - 10https://gerrit.wikimedia.org/r/486485 (https://phabricator.wikimedia.org/T211488) [15:42:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/486485 (https://phabricator.wikimedia.org/T211488) (owner: 10Giuseppe Lavagetto) [15:43:25] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::mediawiki::php: conform error reporting levels to HHVM [puppet] - 10https://gerrit.wikimedia.org/r/486485 (https://phabricator.wikimedia.org/T211488) (owner: 10Giuseppe Lavagetto) [15:46:43] 10Operations, 10Puppet, 10Operations-Software-Development, 10Patch-For-Review: Consider adding a --skip-conftool option to puppet-merge - https://phabricator.wikimedia.org/T157133 (10Joe) Yes, we definitely need to make sure that conftool sync is always up to date, and yes, it's ok for it to fail to insert... [15:47:25] 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10Andrew) >>! In T214448#4908792, @aborrero wrote: > I find it **really** confusing that we are reusing numbering for these servers, even with the r... [15:48:00] PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [15:48:08] 10Operations, 10Puppet, 10Operations-Software-Development, 10Patch-For-Review: Consider adding a --skip-conftool option to puppet-merge - https://phabricator.wikimedia.org/T157133 (10Joe) Please also note that the confttool-sync phase happens **after** the puppet merge happens, so in an emergency puppet c... [15:48:10] PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [15:48:32] PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [15:48:32] PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [15:48:36] PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [15:49:02] PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [15:49:40] 10Operations, 10Puppet, 10Operations-Software-Development, 10Patch-For-Review: Consider adding a --skip-conftool option to puppet-merge - https://phabricator.wikimedia.org/T157133 (10Joe) Oh, also - the slowness @jynus noticed was due to a bug in etcd that's been solved since. [15:50:55] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "While I think the patch is technically sound, I think this is potentially very dangerous and can lead to drifts between the configuration " [puppet] - 10https://gerrit.wikimedia.org/r/413745 (https://phabricator.wikimedia.org/T157133) (owner: 10Andrew Bogott) [15:51:34] PROBLEM - puppet last run on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [15:52:58] (03PS13) 10Giuseppe Lavagetto: profile::services_proxy: simple local proxying for remote services [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717) [15:56:46] PROBLEM - Check the NTP synchronisation status of timesyncd on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [15:58:01] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::services_proxy: simple local proxying for remote services [puppet] - 10https://gerrit.wikimedia.org/r/483788 (https://phabricator.wikimedia.org/T210717) (owner: 10Giuseppe Lavagetto) [16:00:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] hhvm: Remove support for pre stretch [puppet] - 10https://gerrit.wikimedia.org/r/486449 (owner: 10Muehlenhoff) [16:02:01] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM but let's complete the job while we're at it and convert base::service_unit." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/486443 (owner: 10Muehlenhoff) [16:03:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Warn about lack of changelog or Dockerfile.template [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/485880 (owner: 10Hashar) [16:07:24] _joe_: thanks :) [16:08:29] bah merge conflict of doom [16:09:06] (03PS3) 10Hashar: Warn about lack of changelog or Dockerfile.template [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/485880 [16:09:58] (03PS1) 10Hashar: Edit Project Config [docker-images/docker-pkg] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/486491 [16:11:19] (03PS4) 10Hashar: scan and process templates in parallel [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/484578 [16:15:00] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Good idea but it needs some more work, see the comments inline. Specifically, I'd like not to rely on shared data for threading, it'a a ba" (032 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/484578 (owner: 10Hashar) [16:15:06] 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10aborrero) >>! In T214448#4909170, @Andrew wrote: >>>! In T214448#4908792, @aborrero wrote: >> I find it **really** confusing that we are reusing n... [16:15:43] <_joe_> hashar: what did you change in the configuration of the repo? [16:19:36] PROBLEM - IPMI Sensor Status on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [16:20:48] 10Operations, 10ops-codfw: Degraded RAID on db2068 - https://phabricator.wikimedia.org/T214663 (10Papaul) a:05Papaul→03Marostegui @Marostegui disk replacement complete [16:21:31] _joe_: "allow content merge" [16:21:48] <_joe_> uhm please no :) [16:22:05] _joe_: so that if a patch alters a file meanwhile got touched in the branch, Gerrit would still be able to merge it (and use a git merge to handle the conflict resolution) [16:22:11] with tests covering us ;=] [16:22:24] (03PS14) 10Cwhite: prometheus: upgrade to node-exporter 0.17 in backports [puppet] - 10https://gerrit.wikimedia.org/r/486192 (https://phabricator.wikimedia.org/T213708) [16:22:43] should be fine, at least all mediawiki repos have been set like that for ages [16:23:10] (by the way zuul doesn't quite support AllowContentMerge: False, which is https://phabricator.wikimedia.org/T210442 :/// ) [16:23:42] _joe_: I have rebased the change anyway. Feel free to set the settings back to inherited (and thus false). Sorry I gotta rush out of the coworking place [16:24:01] <_joe_> yeah I'm going to stop working and start packing shortly [16:24:59] (03PS1) 10Cwhite: aptrepo: add prometheus-node-exporter component for jessie [puppet] - 10https://gerrit.wikimedia.org/r/486493 (https://phabricator.wikimedia.org/T213708) [16:25:46] ditto ;( [16:29:23] (03PS15) 10Cwhite: prometheus: upgrade to node-exporter 0.17 in backports [puppet] - 10https://gerrit.wikimedia.org/r/486192 (https://phabricator.wikimedia.org/T213708) [16:32:02] (03CR) 10Dzahn: "@KartikMistry there are plans to use a new define for mediawiki periodic jobs at https://phabricator.wikimedia.org/T211250 maybe it is ca" [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) (owner: 10KartikMistry) [16:39:40] 10Operations, 10monitoring: EDAC events not being reported by node-exporter? - https://phabricator.wikimedia.org/T214529 (10CDanis) I did a `cumin` run across the whole fleet to find hosts that have memory errors in their `dmesg` buffers, but haven't incremented counters beyond 0. `cdanis@cumin1001.eqiad.wmne... [16:39:48] 10Operations, 10Puppet, 10Operations-Software-Development, 10Patch-For-Review: Consider adding a --skip-conftool option to puppet-merge - https://phabricator.wikimedia.org/T157133 (10jcrespo) Thanks! [16:41:40] (03Abandoned) 10CRusnov: Fix mismatched artifacts (matches frozen_requirements.txt now). [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/486187 (owner: 10CRusnov) [16:47:22] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 40 probes of 410 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:47:34] PROBLEM - puppet last run on dumpsdata1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:49:18] RECOVERY - Device not healthy -SMART- on db2068 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2068&var-datasource=codfw+prometheus/ops [16:50:37] 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10Papaul) @Andrew for all those new servers I am using for partman labvirt-ssd.cfg? [16:52:34] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 1 probes of 410 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:53:01] 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10Papaul) @Andrew can you also specify on this task in which VLAN eth1 needs to be for cloudvirt200[1-3]. Thanks [17:00:21] (03PS5) 10DCausse: [WIP] Upgrade to 6.5.4 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/446869 (https://phabricator.wikimedia.org/T199791) [17:00:23] (03PS2) 10DCausse: [WIP] Add nori korean analyzer [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/486266 (https://phabricator.wikimedia.org/T206874) [17:04:56] PROBLEM - Long running screen/tmux on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [17:09:21] (03CR) 10DCausse: [WIP] Upgrade to 6.5.4 (032 comments) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/446869 (https://phabricator.wikimedia.org/T199791) (owner: 10DCausse) [17:13:42] RECOVERY - puppet last run on dumpsdata1002 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [17:15:40] RECOVERY - Disk space on notebook1003 is OK: DISK OK [17:15:40] RECOVERY - DPKG on notebook1003 is OK: All packages OK [17:15:42] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up [17:16:10] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [17:16:22] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [17:16:34] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient [17:17:02] !log notebook1003 restarted nagios-nrpe-server due to oom - T212824 [17:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:05] T212824: notebook/stat server(s) running out of memory - https://phabricator.wikimedia.org/T212824 [17:20:02] RECOVERY - IPMI Sensor Status on notebook1003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [17:22:49] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) I looked into that and imported all the cables related to scs-eqsin to test the new feature: * The [[ https://netbox.wikimedia.org/dcim/cables/ | cable list ]] doesn't allow to filter/sort by endpoint. Which makes... [17:23:31] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) [17:26:41] (03CR) 10Bstorm: "We can always just deploy the stretch package and remove it if it turns out bad, right?" (031 comment) [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/486417 (https://phabricator.wikimedia.org/T107878) (owner: 10BryanDavis) [17:27:28] RECOVERY - Check the NTP synchronisation status of timesyncd on notebook1003 is OK: OK: synced at Fri 2019-01-25 17:27:26 UTC. [17:29:35] 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10aborrero) >>! In T214448#4909362, @Papaul wrote: > @Andrew can you also specify on this task in which VLAN eth1 needs to be for cloudvirt200[1-3]... [17:41:46] (03CR) 10Volans: [C: 03+1] "LGTM, also cumin 'P{C:keyholder} and P{F:lsbdistcodename = trusty}' returns no matching hosts. I didn't checked WMCS though." [puppet] - 10https://gerrit.wikimedia.org/r/486446 (owner: 10Muehlenhoff) [17:43:22] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10Volans) @ayounsi you can test it on the WMCS instance that @crusnov has created to test the upgrade ;) (do not add sensitive data there) [17:47:09] (03CR) 10BryanDavis: "> It's Friday, though. Maybe merge and deploy...uhhh, in a" (031 comment) [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/486417 (https://phabricator.wikimedia.org/T107878) (owner: 10BryanDavis) [17:53:49] (03CR) 10Gergő Tisza: [C: 03+1] Merge the "extended-uploader" and "autopatrolled" user groups on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485487 (https://phabricator.wikimedia.org/T214003) (owner: 10Zoranzoki21) [17:58:02] (03PS1) 10Papaul: DNS: Add production DNS enties for cloudcontrol2001-dev and cloudvirt200[123]-dev [dns] - 10https://gerrit.wikimedia.org/r/486504 (https://phabricator.wikimedia.org/T214448) [17:58:32] 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10Andrew) >>! In T214448#4909344, @Papaul wrote: > @Andrew for all those new servers I am using for partman labvirt-ssd.cfg? It depends on what rai... [17:59:14] (03CR) 10Samwilson: [C: 03+1] "Looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486405 (https://phabricator.wikimedia.org/T213003) (owner: 10MaxSem) [18:06:19] 10Operations, 10ops-codfw: Degraded RAID on db2068 - https://phabricator.wikimedia.org/T214663 (10jcrespo) Thanks, rebuilding: ` /usr/local/lib/nagios/plugins/get-raid-status-hpssacli ... physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, Rebuilding) ` [18:06:34] 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10Papaul) @Andrew there is no raid controller on the new servers. They all have 2x200GB SSD's [18:10:18] (03PS2) 10Mforns: Adapt saltrotate and EventLoggingSanitization params in data_purge.pp [puppet] - 10https://gerrit.wikimedia.org/r/485063 (https://phabricator.wikimedia.org/T212014) [18:15:54] RECOVERY - HP RAID on db2068 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK [18:27:50] 10Operations, 10ops-codfw: Degraded RAID on db2068 - https://phabricator.wikimedia.org/T214663 (10Marostegui) 05Open→03Resolved Thanks! ` 18:15 <+icinga-wm> RECOVERY - HP RAID on db2068 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12... [18:28:43] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@94b76f5]: Update mobileapps to 4c42e3d (T214714) [18:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:46] T214714: Improper CSS being served to apps. - https://phabricator.wikimedia.org/T214714 [18:29:51] (03PS1) 10CDanis: Add an Icinga alert for syslog-reported EDAC events, as a workaround for whatever is causing T214529. [puppet] - 10https://gerrit.wikimedia.org/r/486507 (https://phabricator.wikimedia.org/T214529) [18:31:04] (03CR) 10jerkins-bot: [V: 04-1] Add an Icinga alert for syslog-reported EDAC events, as a workaround for whatever is causing T214529. [puppet] - 10https://gerrit.wikimedia.org/r/486507 (https://phabricator.wikimedia.org/T214529) (owner: 10CDanis) [18:31:48] (03PS2) 10CDanis: icinga alert for syslogged EDACs [puppet] - 10https://gerrit.wikimedia.org/r/486507 (https://phabricator.wikimedia.org/T214529) [18:32:16] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@94b76f5]: Update mobileapps to 4c42e3d (T214714) (duration: 03m 33s) [18:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:33] (03PS1) 10Zoranzoki21: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486508 [18:51:35] (03PS1) 10Zoranzoki21: Removed WikibaseQuality from extensions-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486509 (https://phabricator.wikimedia.org/T208499) [18:51:57] (03Abandoned) 10Zoranzoki21: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486508 (owner: 10Zoranzoki21) [18:54:28] (03CR) 10Zoranzoki21: [C: 04-1] "DNM, should be done and for other Serbian projects." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485903 (owner: 10Zoranzoki21) [18:54:34] (03CR) 10Volans: [C: 03+1] "LGTM, let's try not to forget to remove it if we fix the underlying issue ;)" [puppet] - 10https://gerrit.wikimedia.org/r/486507 (https://phabricator.wikimedia.org/T214529) (owner: 10CDanis) [18:55:33] (03PS2) 10Zoranzoki21: Set wgRestrictionLevels for srwiki to autoconfirmed, autopatrol and sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485903 [18:56:34] (03CR) 10CDanis: [C: 03+2] icinga alert for syslogged EDACs [puppet] - 10https://gerrit.wikimedia.org/r/486507 (https://phabricator.wikimedia.org/T214529) (owner: 10CDanis) [18:57:16] (03PS3) 10Zoranzoki21: Set wgRestrictionLevels for srwiki to autoconfirmed, autopatrol and sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485903 [18:57:58] (03PS4) 10Zoranzoki21: Set wgRestrictionLevels for srwiki to autoconfirmed, autopatrol and sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485903 [18:58:23] (03PS5) 10Zoranzoki21: Set wgRestrictionLevels for all Serbian projects to autoconfirmed, autopatrol and sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485903 [18:59:42] (03CR) 10Zppix: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485903 (owner: 10Zoranzoki21) [19:09:45] PROBLEM - Host db1114 is DOWN: PING CRITICAL - Packet loss = 100% [19:11:17] wut? [19:11:43] * volans checking [19:13:32] unable to ping, ssh, attached now to console [19:16:30] mmmm [19:17:13] that is a production enwiki api [19:17:19] depooling [19:17:23] (03PS1) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 [19:18:18] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work): Analytics query access for search platform NLP contractor @Julia.glen - https://phabricator.wikimedia.org/T214623 (10Julia.glen) My developer account user name is julia.glen Ty [19:18:20] (03CR) 10jerkins-bot: [V: 04-1] Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (owner: 10Jbond) [19:18:37] (03PS1) 10Jcrespo: mariadb: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486514 [19:19:05] (03PS1) 10Volans: depooling db1114 that crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486515 [19:19:22] I made mine [19:19:26] jynus: ah, sorry didn't saw your message, was doing the same :) [19:19:27] ^volans [19:19:44] but had to fight with git and re-clone from scratch [19:19:46] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486514 (owner: 10Jcrespo) [19:20:04] (03CR) 10jenkins-bot: mariadb: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486514 (owner: 10Jcrespo) [19:20:26] (03PS2) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [19:20:31] (03Abandoned) 10Volans: depooling db1114 that crashed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486515 (owner: 10Volans) [19:20:46] we start well [19:21:10] I dont' see anything in the console, but I can leave it to you if you prefer [19:21:18] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1114 (duration: 00m 46s) [19:21:18] !log disabling notifications on db1114 [19:21:19] (03CR) 10jerkins-bot: [V: 04-1] Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [19:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:04] (03PS1) 10Mathew.onipe: wdqs: make free allocator check unique [puppet] - 10https://gerrit.wikimedia.org/r/486517 [19:22:32] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work): Analytics query access for search platform NLP contractor @Julia.glen - https://phabricator.wikimedia.org/T214623 (10Dzahn) >>! In T214623#4906897, @TJones wrote: > I can't check, other than to ask legal, but I don't actually know who to... [19:23:05] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for fixing this!" [puppet] - 10https://gerrit.wikimedia.org/r/486517 (owner: 10Mathew.onipe) [19:25:27] (03PS3) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [19:25:37] 10Operations: wmf-auto-reimage-host: icinga downtime error - https://phabricator.wikimedia.org/T214314 (10Dzahn) >>! In T214314#4897014, @aborrero wrote: > duplicate the FQDNs briefly in the DNS while running the reimage+rename. I think it would be better if we can tell wmf-auto-reimage-host / cumin to not even... [19:25:53] Internal error has occurred check for additional logs. [19:25:58] The Intel Management Engine has reported an internal system error. [19:26:04] System CPU Resetting. [19:26:18] (03CR) 10jerkins-bot: [V: 04-1] Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [19:26:59] The Intel Management Engine is unable to utilize the PECI over DMI facility. [19:30:01] 10Operations, 10ops-eqiad, 10DBA: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10jcrespo) [19:30:41] ""We have a Dell PowerEdge R630 Server with Intel Xeon E5-2640 v4 CPUs and we are getting exactly the same error after updating from BIOS 2.3.4 to 2.6.0:Sometimes (it can be 2-3 weeks uptime) we get a intel management engine error followed by a hardreset."" [19:31:16] "We have a systemic issue we have experienced this over 100 times. Dell is engaged but we have not made much process. We were told to move from 2.4.3 to 2.6.0 by Dell, the condition still persisted. " :/ [19:32:07] "It's very unlikely that this a hardware issue since this situation has started after the BIOS update" [19:32:33] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=db1068&service=EDAC+syslog+messages expected db1068 to fail for this [19:34:12] "Dell Poweredge R630 massive stability problems" HW type:Dell PowerEdge R630 :/ [19:36:04] !log powercycle db1114 T214720 [19:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:07] T214720: db1114 crashed - https://phabricator.wikimedia.org/T214720 [19:39:03] is anyone able to see where my error is on the following report? i cant sot it and its not triggering locally [19:39:06] https://integration.wikimedia.org/ci/job/operations-puppet-tests-stretch-docker/5072/console [19:39:14] 10Operations, 10ops-eqiad, 10DBA: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10Dzahn) [[ https://forums.intel.com/s/question/0D50P0000490X2QSAU/the-intel-management-engine-is-unable-to-utilize-the-peci-over-dmi-facility?language=en_US | Intel - "The Intel Management Engine is unable to utili... [19:40:30] PROBLEM - EDAC syslog messages on db1068 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1068&var-datasource=eqiad+prometheus/ops [19:41:07] well that's just overly picky [19:43:24] jbond42: where you include base::firewall, should be ::base::firewall ? [19:44:17] the only one in the actual output i notice and not sure if normal is " Failed to retrieve Augeas version: cannot load such file -- augeas" [19:44:30] oh yes i forgot scoping is all different in 4.*, will update that and try [19:44:31] and augeas says to me firewall related [19:44:46] mutante: yes i saw that and also unsre if its normal, doubt i managed to break that one [19:47:27] (03PS4) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [19:47:44] jynus: one person says ""exactly the same error _after_ updating from BIOS 2.3.4 to 2.6.0" and the other guy says "told to move from 2.4.3 to 2.6.0 by Dell," what a "nice" combo :/ [19:48:00] but that totally matches.. same model [19:48:24] (03CR) 10jerkins-bot: [V: 04-1] Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [19:51:23] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486525 [19:52:54] 10Operations, 10ops-eqiad, 10DBA: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10jcrespo) While a CPU failure should be "clean", with gtid and binlog_sync, it should be checked or reimaged before being repooled. [19:52:56] elukey: thinking of https://phabricator.wikimedia.org/T203786, we really need to deal with the failover problem (e.g. adding gutter nodes or something) [19:53:24] (03PS1) 10Volans: dns: remove unused dry_run argument [software/spicerack] - 10https://gerrit.wikimedia.org/r/486527 [19:53:26] (03PS1) 10Volans: ipmi: fix typos in docstrings [software/spicerack] - 10https://gerrit.wikimedia.org/r/486528 [19:53:28] (03PS1) 10Volans: management: add management module [software/spicerack] - 10https://gerrit.wikimedia.org/r/486529 [19:53:30] (03PS1) 10Volans: icinga: add context manager for downtimed hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/486530 [19:53:46] (03CR) 10jerkins-bot: [V: 04-1] icinga: add context manager for downtimed hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/486530 (owner: 10Volans) [19:56:52] (03PS1) 10Jcrespo: mariadb: Pool db1106 as an extra api host after db1114 crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486533 (https://phabricator.wikimedia.org/T214720) [19:57:37] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/486530 (owner: 10Volans) [19:59:49] (03PS5) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [20:00:53] (03CR) 10jerkins-bot: [V: 04-1] Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [20:00:55] (03CR) 10Jcrespo: [C: 03+2] mariadb: Pool db1106 as an extra api host after db1114 crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486533 (https://phabricator.wikimedia.org/T214720) (owner: 10Jcrespo) [20:01:53] (03Merged) 10jenkins-bot: mariadb: Pool db1106 as an extra api host after db1114 crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486533 (https://phabricator.wikimedia.org/T214720) (owner: 10Jcrespo) [20:04:42] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Pool db1106 as an extra api host (duration: 00m 46s) [20:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:47] (03PS6) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [20:10:47] (03CR) 10jerkins-bot: [V: 04-1] Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [20:12:25] (03CR) 10jenkins-bot: mariadb: Pool db1106 as an extra api host after db1114 crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486533 (https://phabricator.wikimedia.org/T214720) (owner: 10Jcrespo) [20:16:22] (03PS1) 10Zoranzoki21: Removed namespace Comment, added namespace Portal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486536 (https://phabricator.wikimedia.org/T214561) [20:16:24] 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10Andrew) >>! In T214448#4909558, @Papaul wrote: > @Andrew there is no raid controller on the new servers. They all have 2x200GB SSD's ok -- let's... [20:16:43] PROBLEM - EDAC syslog messages on thumbor1004 is CRITICAL: 28.01 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [20:17:11] (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [software/spicerack] - 10https://gerrit.wikimedia.org/r/486530 (owner: 10Volans) [20:17:41] (03PS2) 10Zoranzoki21: Removed namespace Коментар, added namespace Портал on srwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486536 (https://phabricator.wikimedia.org/T214561) [20:19:33] (03CR) 10Gehel: [C: 03+2] wdqs: make free allocator check unique [puppet] - 10https://gerrit.wikimedia.org/r/486517 (owner: 10Mathew.onipe) [20:20:58] (03PS1) 10Zoranzoki21: Changed wgImportSources for srwikinews to w:sr instead of no which is unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486538 (https://phabricator.wikimedia.org/T214562) [20:21:34] (03CR) 10Gehel: [C: 03+2] ipmi: fix typos in docstrings [software/spicerack] - 10https://gerrit.wikimedia.org/r/486528 (owner: 10Volans) [20:22:42] (03CR) 10Gehel: [C: 03+1] "LGTM, trivial enough" [software/spicerack] - 10https://gerrit.wikimedia.org/r/486527 (owner: 10Volans) [20:26:07] (03PS1) 10BryanDavis: tcl: switch base image from jessie to stretch [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/486539 (https://phabricator.wikimedia.org/T214668) [20:30:40] (03CR) 10GTirloni: "Also need to edit tcl/base/Dockerfile.template to depend on Stretch" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/486539 (https://phabricator.wikimedia.org/T214668) (owner: 10BryanDavis) [20:34:06] 10Operations, 10ops-eqiad, 10ops-eqsin, 10netops: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10ayounsi) a:03ayounsi [20:34:29] (03CR) 10BryanDavis: "> Also need to edit tcl/base/Dockerfile.template to depend on Stretch" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/486539 (https://phabricator.wikimedia.org/T214668) (owner: 10BryanDavis) [20:34:31] (03PS2) 10BryanDavis: tcl: switch base image from jessie to stretch [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/486539 (https://phabricator.wikimedia.org/T214668) [20:37:30] (03CR) 10Andrew Bogott: "Giuseppe, is that an argument in favor of removing the --skip-confctl arg but keeping the check in this patch so that the puppet-merge err" [puppet] - 10https://gerrit.wikimedia.org/r/413745 (https://phabricator.wikimedia.org/T157133) (owner: 10Andrew Bogott) [20:37:42] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decom mw2213 - https://phabricator.wikimedia.org/T203434 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for mw2213.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB - Downti... [20:40:01] (03CR) 10Gehel: [C: 04-1] "see comments inline" (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/486529 (owner: 10Volans) [20:40:29] (03CR) 10BryanDavis: [C: 03+2] tcl: switch base image from jessie to stretch [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/486539 (https://phabricator.wikimedia.org/T214668) (owner: 10BryanDavis) [20:40:52] (03Merged) 10jenkins-bot: tcl: switch base image from jessie to stretch [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/486539 (https://phabricator.wikimedia.org/T214668) (owner: 10BryanDavis) [20:41:09] 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10Papaul) [20:42:21] (03CR) 10Gehel: [C: 03+1] "I was soo looking forward to that patch!" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/486530 (owner: 10Volans) [20:44:05] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decom mw2213 - https://phabricator.wikimedia.org/T203434 (10RobH) [20:44:56] 10Operations, 10ops-codfw, 10decommission: Decom mw2213 - https://phabricator.wikimedia.org/T203434 (10RobH) a:03RobH [20:46:49] (03PS1) 10RobH: mw2213 decom production dns entries [dns] - 10https://gerrit.wikimedia.org/r/486545 (https://phabricator.wikimedia.org/T203434) [20:47:27] (03PS1) 10BryanDavis: tcl/web: Create /var/run/lighttpd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/486546 (https://phabricator.wikimedia.org/T214668) [20:47:43] (03CR) 10RobH: [C: 03+2] mw2213 decom production dns entries [dns] - 10https://gerrit.wikimedia.org/r/486545 (https://phabricator.wikimedia.org/T203434) (owner: 10RobH) [20:47:50] (03CR) 10BryanDavis: [C: 03+2] tcl/web: Create /var/run/lighttpd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/486546 (https://phabricator.wikimedia.org/T214668) (owner: 10BryanDavis) [20:48:12] (03Merged) 10jenkins-bot: tcl/web: Create /var/run/lighttpd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/486546 (https://phabricator.wikimedia.org/T214668) (owner: 10BryanDavis) [20:48:44] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decom mw2213 - https://phabricator.wikimedia.org/T203434 (10RobH) [20:49:06] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decom mw2213 - https://phabricator.wikimedia.org/T203434 (10RobH) [20:50:27] (03PS2) 10Volans: dns: remove unused dry_run argument [software/spicerack] - 10https://gerrit.wikimedia.org/r/486527 [20:50:45] (03PS1) 10RobH: mw2213 decom [puppet] - 10https://gerrit.wikimedia.org/r/486547 (https://phabricator.wikimedia.org/T203434) [20:54:21] 10Operations: CI errors not being displayed in console log - https://phabricator.wikimedia.org/T214726 (10jbond) [20:55:31] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 9 minutes ago with 0 failures [20:56:20] 10Operations: CI errors not being displayed in console log - https://phabricator.wikimedia.org/T214726 (10jbond) [20:58:54] (03CR) 10RobH: [C: 03+2] mw2213 decom [puppet] - 10https://gerrit.wikimedia.org/r/486547 (https://phabricator.wikimedia.org/T203434) (owner: 10RobH) [20:59:02] (03PS1) 10Krinkle: flaggedrevs: Remove unused variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486550 [20:59:46] 10Operations, 10ops-codfw, 10decommission: Decom mw2213 - https://phabricator.wikimedia.org/T203434 (10RobH) [20:59:59] 10Operations, 10ops-codfw, 10decommission: Decom mw2213 - https://phabricator.wikimedia.org/T203434 (10RobH) a:05RobH→03Papaul [21:00:57] 10Operations, 10ops-codfw, 10Patch-For-Review, 10Services (watching), and 2 others: Reconfigure hardware and reimage restbase201[3-8].codfw.wmnet - https://phabricator.wikimedia.org/T210863 (10RobH) [21:01:02] (03PS3) 10Krinkle: PhpAutoPrepend: Merge php7.php into PhpAutoPrepend.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486176 [21:05:07] RECOVERY - Long running screen/tmux on notebook1003 is OK: OK: No SCREEN or tmux processes detected. [21:05:50] !log cleared sel on db1068, it had a power redundancy loss event (old and resolved) that was triggering the icinga check [21:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:05] PROBLEM - puppet last run on an-worker1085 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:13:56] 10Operations, 10HHVM: Fix config file handling for /etc/hhvm/php.ini - https://phabricator.wikimedia.org/T157306 (10Krinkle) [21:14:37] RECOVERY - ElasticSearch shard size check on search.svc.eqiad.wmnet is OK: OK - All good! [21:14:42] 10Operations, 10PHP 7.0 support, 10Patch-For-Review: Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Krinkle) Found this old issue - T157306, should that be closed, or still something we need/want for HHVM during the migration? [21:15:02] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work): Analytics query access for search platform NLP contractor @Julia.glen - https://phabricator.wikimedia.org/T214623 (10EBjune) The official contract end date in our system is 6/30/2019, if that helps. [21:17:07] (03CR) 10Volans: [C: 03+2] dns: remove unused dry_run argument [software/spicerack] - 10https://gerrit.wikimedia.org/r/486527 (owner: 10Volans) [21:17:44] 10Operations, 10PHP 7.2 support: PHP Warning: PHP Startup: Unable to load dynamic library luasandbox.so - https://phabricator.wikimedia.org/T214730 (10Krinkle) [21:18:03] 10Operations, 10PHP 7.2 support: PHP Warning: PHP Startup: Unable to load dynamic library luasandbox.so - https://phabricator.wikimedia.org/T214730 (10Krinkle) [21:22:30] (03PS1) 10Aaron Schulz: Make labs just use the "mcrouter" object cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486558 (https://phabricator.wikimedia.org/T214275) [21:22:36] (03Merged) 10jenkins-bot: dns: remove unused dry_run argument [software/spicerack] - 10https://gerrit.wikimedia.org/r/486527 (owner: 10Volans) [21:23:35] (03CR) 10jenkins-bot: dns: remove unused dry_run argument [software/spicerack] - 10https://gerrit.wikimedia.org/r/486527 (owner: 10Volans) [21:24:21] (03PS2) 10Volans: ipmi: fix typos in docstrings [software/spicerack] - 10https://gerrit.wikimedia.org/r/486528 [21:30:07] 10Operations, 10PHP 7.2 support: PHP Warning: PHP Startup: Unable to load dynamic library luasandbox.so - https://phabricator.wikimedia.org/T214730 (10Reedy) Why's it apparently running PHP 7.0? Shouldn't it be PHP 7.2? [21:33:16] (03CR) 10Aaron Schulz: [C: 03+1] flaggedrevs: Remove unused variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486550 (owner: 10Krinkle) [21:33:29] (03CR) 10jenkins-bot: ipmi: fix typos in docstrings [software/spicerack] - 10https://gerrit.wikimedia.org/r/486528 (owner: 10Volans) [21:33:41] 10Operations, 10PHP 7.2 support: PHP Warning: PHP Startup: Unable to load dynamic library luasandbox.so - https://phabricator.wikimedia.org/T214730 (10Jdforrester-WMF) Special:Version reports 7.2 (specifically `7.2.8-1+0~20180725124257.2+stretch~1.gbp571e56 (fpm-fcgi)`) on all the debug servers – mwdebug1001,... [21:35:49] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work): Analytics query access for search platform NLP contractor @Julia.glen - https://phabricator.wikimedia.org/T214623 (10Dzahn) >>! In T214623#4910053, @EBjune wrote: > The official contract end date in our system is 6/30/2019, if that helps... [21:38:25] 10Operations, 10PHP 7.2 support: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp) - https://phabricator.wikimedia.org/T214734 (10Krinkle) [21:38:46] 10Operations, 10PHP 7.2 support: PHP Warning: PHP Startup: Unable to load dynamic library luasandbox.so - https://phabricator.wikimedia.org/T214730 (10Reedy) They do have PHP 7.0 installed [21:38:53] (03CR) 10Krinkle: [C: 03+2] PhpAutoPrepend: Merge php7.php into PhpAutoPrepend.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486176 (owner: 10Krinkle) [21:39:17] 10Operations, 10Wikimedia-Mailing-lists: Adding administrator to mailing list for Wikimedia New Zealand - https://phabricator.wikimedia.org/T214271 (10Dzahn) If Wikimedianz-owner@NPSPAMWikipedia.org is indeed not answering any mails then yea, we can do it. No need to create a new list. Maybe a second neutral p... [21:39:55] (03Merged) 10jenkins-bot: PhpAutoPrepend: Merge php7.php into PhpAutoPrepend.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486176 (owner: 10Krinkle) [21:40:10] RECOVERY - puppet last run on an-worker1085 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:40:53] * Krinkle staging on mwdebug1002 [21:42:49] (03PS2) 10Krinkle: flaggedrevs: Remove unused variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486550 [21:43:16] !log krinkle@deploy1001 Synchronized wmf-config/PhpAutoPrepend.php: Idb695dd033d42 (duration: 00m 47s) [21:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:13] !log krinkle@deploy1001 Synchronized wmf-config/: Idb695dd033d42 (duration: 00m 46s) [21:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:01] (03CR) 10Krinkle: [C: 03+2] "To confirm, this is labs (beta cluster), not labswiki (wikitech). So nutcracker is installed there already, and being used via the mirror " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486558 (https://phabricator.wikimedia.org/T214275) (owner: 10Aaron Schulz) [21:45:25] 10Operations, 10Wikimedia-Mailing-lists: Reset list admin password for Wikies-l mailing list - https://phabricator.wikimedia.org/T214249 (10Dzahn) https://wikitech.wikimedia.org/wiki/Mailman#Reset_the_admin_password_of_a_list ` [fermium:~] $ list_name="wikies-l" ; sudo /var/lib/mailman/bin/change_pw -l $list... [21:45:47] 10Operations, 10Wikimedia-Mailing-lists: Reset list admin password for Wikies-l mailing list - https://phabricator.wikimedia.org/T214249 (10Dzahn) 05Open→03Resolved a:03Dzahn [21:46:05] (03Merged) 10jenkins-bot: Make labs just use the "mcrouter" object cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486558 (https://phabricator.wikimedia.org/T214275) (owner: 10Aaron Schulz) [21:48:29] (03CR) 10Krinkle: [C: 03+2] flaggedrevs: Remove unused variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486550 (owner: 10Krinkle) [21:49:33] (03Merged) 10jenkins-bot: flaggedrevs: Remove unused variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486550 (owner: 10Krinkle) [21:50:19] * Krinkle staging on mwdebug1002 [21:54:28] 10Operations, 10ops-eqiad: Broken memory on thumbor1004 - https://phabricator.wikimedia.org/T207721 (10RobH) This expired back in 2017, shouldn't we just replace it rather than repair? [21:54:39] (03CR) 10jenkins-bot: PhpAutoPrepend: Merge php7.php into PhpAutoPrepend.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486176 (owner: 10Krinkle) [21:54:41] (03CR) 10jenkins-bot: Make labs just use the "mcrouter" object cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486558 (https://phabricator.wikimedia.org/T214275) (owner: 10Aaron Schulz) [21:54:43] (03CR) 10jenkins-bot: flaggedrevs: Remove unused variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486550 (owner: 10Krinkle) [21:56:23] !log krinkle@deploy1001 Synchronized wmf-config/flaggedrevs.php: I95c37d628557c (duration: 00m 46s) [21:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:20] 10Operations, 10Wikimedia-Mailing-lists: Adding administrator to mailing list for Wikimedia New Zealand - https://phabricator.wikimedia.org/T214271 (10Podzemnik) Sure. Who is a neutral person though? User:John Vandenberg tried, I tried it today. Should I also ask somebody else? [22:11:55] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10Patch-For-Review, 10User-Elukey: Consider removing the last traces of nutcracker in Mediawiki configs - https://phabricator.wikimedia.org/T214275 (10aaron) Things needed here: [] Use only mcrouter in deployment-prep (no multiwrite) from MW [] Remove... [22:16:17] 10Operations, 10Wikimedia-Mailing-lists: Adding administrator to mailing list for Wikimedia New Zealand - https://phabricator.wikimedia.org/T214271 (10Dzahn) >>! In T214271#4910179, @Podzemnik wrote: > Sure. Who is a neutral person though? Nominates @MarcoAurelio ;) > User:John Vandenberg tried, I tried it... [22:22:35] 10Operations, 10Wikimedia-Mailing-lists: Adding administrator to mailing list for Wikimedia New Zealand - https://phabricator.wikimedia.org/T214271 (10Dzahn) >>! In T214271#4910230, @Dzahn wrote: > Did discussion about it take place on a wiki by any chance? A link to something like that would help a lot as wel... [22:26:31] 10Operations, 10MediaWiki-Debug-Logger, 10Performance-Team: Set up request profiling for PHP 7 - https://phabricator.wikimedia.org/T206152 (10Krinkle) Tried both via `forceprofile=1` (GET parameter) and via the `profiler` (XWD attribute), and both never seems to work yet on PHP 7. They do still work on HHVM. [22:27:49] 10Operations, 10PHP 7.2 support: PHP Warning: PHP Startup: Unable to load dynamic library luasandbox.so - https://phabricator.wikimedia.org/T214730 (10Joe) The reason for this error is php7.0 is still installed on the mwdebug servers, I should clean it up. Once in a while, a cronjob for php7.0 runs and genera... [22:28:13] 10Operations, 10serviceops: PHP Warning: PHP Startup: Unable to load dynamic library luasandbox.so - https://phabricator.wikimedia.org/T214730 (10Joe) p:05Triage→03Low [22:34:49] 10Operations, 10Wikimedia-Mailing-lists: Adding administrator to mailing list for Wikimedia New Zealand - https://phabricator.wikimedia.org/T214271 (10MarcoAurelio) >>! In T214271#4910230, @Dzahn wrote: > Nominates @MarcoAurelio ;) The list admin appears to be former user `User:Brian New Zealand` currently ht... [22:42:25] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@5e859c4]: Update mobileapps to a8834e8 (T214728) [22:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:29] T214728: Dark mode broken (page background remains light) in apps. - https://phabricator.wikimedia.org/T214728 [22:42:37] 10Operations, 10ops-eqiad: Broken memory on thumbor1004 - https://phabricator.wikimedia.org/T207721 (10Cmjohnson) Typically that has been the standard response. [22:45:52] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@5e859c4]: Update mobileapps to a8834e8 (T214728) (duration: 03m 27s) [22:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:14] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP Warning: PHP Startup: Unable to load dynamic library luasandbox.so - https://phabricator.wikimedia.org/T214730 (10Krinkle) [22:50:56] 10Operations, 10Wikimedia-Mailing-lists: Adding administrator to mailing list for Wikimedia New Zealand - https://phabricator.wikimedia.org/T214271 (10Dzahn) also: http://wikinewsie.org/ looks pretty inactive as well and i think that would be his site or the former proposed chapter site? [22:55:36] 10Operations, 10Wikimedia-Mailing-lists: Adding administrator to mailing list for Wikimedia New Zealand - https://phabricator.wikimedia.org/T214271 (10MarcoAurelio) I'm not sure what wikinewsie is/was, sorry. I think https://nz.wikimedia.org/wiki/Main_Page was the project/chapter wiki. But it is closed. If a... [22:55:43] (03PS3) 10MaxSem: Set confirmed permissions after extensions are loaded [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486405 (https://phabricator.wikimedia.org/T213003) [22:55:48] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10ssastry) [23:03:41] (03PS1) 10Papaul: Add DHCP MAC addrese and partman for cloudcontrol2001-dev and cloudvirt200[123]-dev [puppet] - 10https://gerrit.wikimedia.org/r/486700 (https://phabricator.wikimedia.org/T214448) [23:04:23] (03CR) 10jerkins-bot: [V: 04-1] Add DHCP MAC addrese and partman for cloudcontrol2001-dev and cloudvirt200[123]-dev [puppet] - 10https://gerrit.wikimedia.org/r/486700 (https://phabricator.wikimedia.org/T214448) (owner: 10Papaul) [23:09:12] (03PS2) 10Volans: management: add management module [software/spicerack] - 10https://gerrit.wikimedia.org/r/486529 (https://phabricator.wikimedia.org/T205885) [23:09:14] (03PS2) 10Volans: icinga: add context manager for downtimed hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/486530 [23:10:10] (03CR) 10Volans: "comment addressed except two, for those see the replies inline" (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/486529 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [23:10:23] (03CR) 10Volans: icinga: add context manager for downtimed hosts (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/486530 (owner: 10Volans) [23:10:57] (03CR) 10Dzahn: "it downvotes you because of commit message guidelines (over 80 chars). i'll shorten it a bit" [puppet] - 10https://gerrit.wikimedia.org/r/486700 (https://phabricator.wikimedia.org/T214448) (owner: 10Papaul) [23:11:58] (03PS1) 10Dzahn: planet: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/486701 [23:13:10] (03PS2) 10Dzahn: DHCP/partman: add cloudcontrol2001-dev and cloudvirt200[123]-dev [puppet] - 10https://gerrit.wikimedia.org/r/486700 (https://phabricator.wikimedia.org/T214448) (owner: 10Papaul) [23:15:59] (03CR) 10Dzahn: [C: 04-1] "|cloudcontrol2001-dev|cloudvirt-dev200[1-3]) is the "-dev" part before or after the number? the commit message and code have it different" [puppet] - 10https://gerrit.wikimedia.org/r/486700 (https://phabricator.wikimedia.org/T214448) (owner: 10Papaul) [23:18:06] (03CR) 10Dzahn: [C: 04-1] "the DHCP part looks consistent. then it's just a typo in the netboot.cfg" [puppet] - 10https://gerrit.wikimedia.org/r/486700 (https://phabricator.wikimedia.org/T214448) (owner: 10Papaul) [23:22:25] (03CR) 10Dzahn: [C: 04-1] "URL does not include protocol ...hrm Stdlib::HTTPSUrl = Pattern[/^https:\/\//], got 'meta.wikimedia.org/wiki/Planet_Wikimedia'" [puppet] - 10https://gerrit.wikimedia.org/r/486701 (owner: 10Dzahn) [23:22:45] (03PS3) 10Papaul: DHCP/partman: add cloudcontrol2001-dev and cloudvirt200[123]-dev [puppet] - 10https://gerrit.wikimedia.org/r/486700 (https://phabricator.wikimedia.org/T214448) [23:28:19] (03PS2) 10Dzahn: planet: add data types and hardcode https link, not proto-relative [puppet] - 10https://gerrit.wikimedia.org/r/486701 [23:30:00] (03CR) 10Dzahn: [C: 03+2] DHCP/partman: add cloudcontrol2001-dev and cloudvirt200[123]-dev [puppet] - 10https://gerrit.wikimedia.org/r/486700 (https://phabricator.wikimedia.org/T214448) (owner: 10Papaul) [23:34:09] 10Operations, 10Continuous-Integration-Config: CI errors not being displayed in console log - https://phabricator.wikimedia.org/T214726 (10Peachey88) [23:35:11] 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10Papaul) [23:38:08] (03PS3) 10Dzahn: planet: add data types and hardcode https link, not proto-relative [puppet] - 10https://gerrit.wikimedia.org/r/486701 [23:38:56] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Allow the deployment of users without SSH access - https://phabricator.wikimedia.org/T212949 (10Nuria) 05Open→03Resolved [23:41:09] 10Operations, 10ops-codfw, 10Traffic, 10decommission: Decommission baham - https://phabricator.wikimedia.org/T199247 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for baham.wikimedia.org and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB - Downtime... [23:42:44] (03PS4) 10Dzahn: planet: add data types and hardcode https link, not proto-relative [puppet] - 10https://gerrit.wikimedia.org/r/486701 [23:43:22] 10Operations, 10ops-codfw, 10Traffic, 10decommission: Decommission baham - https://phabricator.wikimedia.org/T199247 (10RobH) [23:44:44] (03PS1) 10RobH: decom baham [puppet] - 10https://gerrit.wikimedia.org/r/486703 (https://phabricator.wikimedia.org/T199247) [23:44:57] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "compiles now https://puppet-compiler.wmflabs.org/compiler1002/14487/" [puppet] - 10https://gerrit.wikimedia.org/r/486701 (owner: 10Dzahn) [23:45:42] (03PS1) 10RobH: decom baham prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/486704 (https://phabricator.wikimedia.org/T199247) [23:46:04] (03CR) 10RobH: [C: 03+2] decom baham [puppet] - 10https://gerrit.wikimedia.org/r/486703 (https://phabricator.wikimedia.org/T199247) (owner: 10RobH) [23:46:27] (03CR) 10RobH: [C: 03+2] decom baham prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/486704 (https://phabricator.wikimedia.org/T199247) (owner: 10RobH) [23:47:45] 10Operations, 10ops-codfw, 10decommission: Decommission baham - https://phabricator.wikimedia.org/T199247 (10RobH) [23:48:05] 10Operations, 10ops-codfw, 10decommission: Decommission baham - https://phabricator.wikimedia.org/T199247 (10RobH) a:03Papaul ready for onsite wipe and decom steps. [23:48:57] 10Operations, 10ops-codfw, 10decommission: Decommission baham - https://phabricator.wikimedia.org/T199247 (10RobH) [23:49:48] (03CR) 10CRusnov: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/486529 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans)