[00:03:21] AaronSchulz: staged on mwdebug1002 [00:15:02] (03CR) 10Thcipriani: [C: 03+2] Automate deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/597653 (https://phabricator.wikimedia.org/T253264) (owner: 10Jeena Huneidi) [00:15:37] (03Merged) 10jenkins-bot: Automate deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/597653 (https://phabricator.wikimedia.org/T253264) (owner: 10Jeena Huneidi) [00:40:04] 10Operations, 10Jupyter-Hub, 10LDAP-Access-Requests: Give access to the JupyterHub (SWAP) notebooks to (@YiJuLu ) - https://phabricator.wikimedia.org/T254121 (10YiJuLu) @Dzahn Hi, I just created one, the username is YiJuLu, Thanks a lot:) [00:47:53] AaronSchulz: standing by [00:52:27] aye [00:54:32] Krinkle: lgtm [01:01:20] AaronSchulz: ack, rolling out now [01:03:13] !log krinkle@deploy1001 Synchronized wmf-config/mc.php: I06897bcc92c5 (duration: 00m 59s) [01:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:15] 10Operations, 10WMF-Design, 10Design, 10Patch-For-Review: Create sub-directory URL for Design blog (https://design.wikimedia.org/blog) - https://phabricator.wikimedia.org/T254118 (10Prtksxna) Thank you for taking this up @Dzahn πŸ™πŸ½ >>! In T254118#6181687, @Dzahn wrote: > Hi @Prtksxna I just cloned https://... [03:21:02] (03PS1) 10Andrew Bogott: Revert "Designate: have mdns use tcp rather than udp for axfr" [puppet] - 10https://gerrit.wikimedia.org/r/601550 [03:21:04] (03PS1) 10Andrew Bogott: Rocky/Buster/Designate: a few live hacks to get things working on Buster [puppet] - 10https://gerrit.wikimedia.org/r/601551 (https://phabricator.wikimedia.org/T253780) [03:23:21] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Designate: have mdns use tcp rather than udp for axfr" [puppet] - 10https://gerrit.wikimedia.org/r/601550 (owner: 10Andrew Bogott) [03:26:09] 10Operations, 10ORES, 10Scoring-platform-team (Current): ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10Gilles) @dpifke has looked into this. Dave, can you share notes here about what you've tried and what you unde... [03:29:28] RECOVERY - Unmerged changes on repository puppet on labtestpuppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [03:32:19] (03CR) 10Andrew Bogott: [C: 03+2] Rocky/Buster/Designate: a few live hacks to get things working on Buster [puppet] - 10https://gerrit.wikimedia.org/r/601551 (https://phabricator.wikimedia.org/T253780) (owner: 10Andrew Bogott) [03:43:45] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10Gilles) It's waiting for someone to do it. @jijiki when she gets back from leave, possibly? [03:53:19] (03CR) 10Gilles: [C: 03+1] Swift object servers: enable object-expirer [puppet] - 10https://gerrit.wikimedia.org/r/601429 (https://phabricator.wikimedia.org/T229584) (owner: 10Dave Pifke) [03:59:03] (03CR) 10Gilles: Set expiry headers on thumbnails (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [05:01:02] !log Stop mysql on db1141 to save a binary backup - T249188 [05:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:06] T249188: Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 [05:01:51] RECOVERY - Check systemd state on labstore1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:06:16] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [05:13:12] ^ expected [05:13:40] ACKNOWLEDGEMENT - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui known https://wikitech.wikimedia.org/wiki/HAProxy [05:46:32] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 94 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:50:46] (03CR) 10Ammarpad: Use AddFooterLink hook for code of conduct and contact links (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596277 (https://phabricator.wikimedia.org/T251817) (owner: 10Jdlrobson) [05:58:12] 10Operations, 10netops: cr1-codfw:fpc0 failure - https://phabricator.wikimedia.org/T254110 (10ayounsi) [05:58:49] 10Operations, 10netops: cr1-codfw:fpc0 failure - https://phabricator.wikimedia.org/T254110 (10ayounsi) [06:13:33] (03CR) 10Ayounsi: [C: 03+1] "LGTM based on Ibe5453c71768107dacf306f1107dc61a6e615b09" [dns] - 10https://gerrit.wikimedia.org/r/601434 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [06:32:14] RECOVERY - Check systemd state on an-launcher1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:35:53] 10Operations, 10Analytics: Increase memory available for an-launcher1001 - https://phabricator.wikimedia.org/T254125 (10elukey) There are still some important jobs running so I cannot reboot the instance, will do it hopefully tomorrow :) [06:44:52] 10Operations, 10netops: cr1-codfw:fpc5 partial failure - https://phabricator.wikimedia.org/T254216 (10ayounsi) p:05Triageβ†’03High [07:02:13] 10Operations, 10netops: cr1-codfw:fpc5 partial failure - https://phabricator.wikimedia.org/T254216 (10ayounsi) Opened JTAC case 2020-0601-0882, at this point it's too much of a coincidence to not think of a backplane issue. [07:06:35] !log Stop MySQL and poweroff on db1138 for on-site maintenance - T253808 [07:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:39] T253808: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 [07:08:25] (03CR) 10DCausse: [C: 03+1] Remove duplication and improve clarity in role::wdqs [puppet] - 10https://gerrit.wikimedia.org/r/598884 (owner: 10EBernhardson) [07:09:15] ACKNOWLEDGEMENT - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv6: Idle - Telia, AS1299/IPv4: Idle - Telia Ayounsi https://phabricator.wikimedia.org/T254216 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:09:15] ACKNOWLEDGEMENT - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi https://phabricator.wikimedia.org/T254216 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:09:19] 10Operations, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Marostegui) @Jclark-ctr db1138 is now off and ready for you to change the memory whenever you get to the DC. Once you are done, please power the host back o... [07:12:56] (03PS6) 10Kormat: mariadb: Add db2040 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/599746 (https://phabricator.wikimedia.org/T252985) [07:15:51] (03CR) 10Marostegui: [C: 04-1] "Commit says db2040" [puppet] - 10https://gerrit.wikimedia.org/r/599746 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [07:16:59] marostegui: thanks for catching that :) [07:17:06] (03PS7) 10Kormat: mariadb: Add db2140 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/599746 (https://phabricator.wikimedia.org/T252985) [07:19:25] (03CR) 10Marostegui: [C: 03+1] mariadb: Add db2140 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/599746 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [07:19:40] (03CR) 10Kormat: [C: 03+2] mariadb: Add db2140 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/599746 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [07:22:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1079 for data check', diff saved to https://phabricator.wikimedia.org/P11350 and previous config saved to /var/cache/conftool/dbconfig/20200602-072214-marostegui.json [07:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:49] !log Stop slave on db1079 for data check [07:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1079 after data check', diff saved to https://phabricator.wikimedia.org/P11351 and previous config saved to /var/cache/conftool/dbconfig/20200602-073245-marostegui.json [07:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:11] (03PS1) 10Marostegui: mariadb: Place db1148 into s4 [puppet] - 10https://gerrit.wikimedia.org/r/601641 (https://phabricator.wikimedia.org/T252512) [07:40:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1121 to clone db1148', diff saved to https://phabricator.wikimedia.org/P11353 and previous config saved to /var/cache/conftool/dbconfig/20200602-074027-marostegui.json [07:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:31] marostegui: oh, heh. it looks like you committed my changes too [07:41:38] (i was depooling db2110) [07:41:49] was very confused when `dbctl config diff` was empty [07:41:53] oh yeah [07:42:07] that's interesting, the diff was so long that it was cut [07:42:14] so I only saw my changes [07:43:17] !log Stop MySQL on db1121 [07:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:13] ACKNOWLEDGEMENT - MariaDB read only s4 on db2110 is CRITICAL: Could not connect to localhost:3306 Kormat Source of copy to db2140 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [07:48:51] (03CR) 10DCausse: [C: 03+1] query_service: Move shared config into common file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/599145 (owner: 10EBernhardson) [07:48:55] PROBLEM - mysqld processes on db2110 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:53:58] ACKNOWLEDGEMENT - MariaDB Slave Lag: s4 on db2110 is CRITICAL: CRITICAL slave_sql_lag could not connect Kormat Source of copy to db2140 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [07:53:58] ACKNOWLEDGEMENT - mysqld processes on db2110 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Kormat Source of copy to db2140 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:54:12] that's what i get for trying to do a minimal set of downtimes, sigh. [07:57:21] (03PS1) 10Dzahn: DHCP: switch contint1001 from jessie to buster installer [puppet] - 10https://gerrit.wikimedia.org/r/601645 (https://phabricator.wikimedia.org/T224591) [07:58:09] kormat: why not disabling notifications entirely? if that were eqiad, those would have paged [07:58:22] (03PS6) 10Alexandros Kosiaris: Initial debianization [debs/kubeyaml] - 10https://gerrit.wikimedia.org/r/601372 [07:59:28] marostegui: uff, ack. [08:00:11] kormat: normally if I am going to put mysql down on a host, I downtime the host entirely [08:00:27] and if it will take a few days I just disable notifications (and downtime) [08:02:00] ack [08:02:10] (03CR) 10Dzahn: [C: 03+2] DHCP: switch contint1001 from jessie to buster installer [puppet] - 10https://gerrit.wikimedia.org/r/601645 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [08:02:18] (03PS2) 10Dzahn: DHCP: switch contint1001 from jessie to buster installer [puppet] - 10https://gerrit.wikimedia.org/r/601645 (https://phabricator.wikimedia.org/T224591) [08:04:03] @bang 9 [08:08:24] that's a lot of banging [08:08:26] * vgutierrez hides [08:09:39] ;) [08:09:45] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10ops-monitoring-bot) Script wmf-auto-reimage wa... [08:09:48] !log re-imaging contint1001 with buster [08:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:13] 10Operations, 10Traffic: atskafka: expose rdkafka metrics to prometheus - https://phabricator.wikimedia.org/T253551 (10ema) 05Openβ†’03Resolved a:03ema This is now done: atskafka uses [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/prometheus-rdkafka-exporter | prometheus-rdkafka-expo... [08:14:07] heh [08:14:15] I was wondering why I couldn't log into contint1001 all of a sudden [08:16:04] legoktm: i wasn't aware there are people doing that? what is it for? [08:16:40] also contint1001 is not the active CI server , btw [08:16:44] mutante: I didn't realize we had switched to a different server and still had my config pointing to 1001 [08:18:05] legoktm: ok, so you just wanted to see if it's down or there is something that is being run manually on the shell? contint2001 is currently the active one, yes [08:18:14] I meant to get into 2001 [08:18:19] which I am logged into now :) [08:18:26] to do what though? [08:18:35] deploy CI changes [08:18:50] oh. the phan thing? [08:19:03] https://gerrit.wikimedia.org/r/600403 [08:19:13] I guess our home dirs didn't get synced from 1001 to 2001 [08:19:34] i see [08:20:02] is this the normal workflow to deploy CI changes every time? [08:21:06] for changes to zuul/*, yes. you ssh in, git pull the repo, and then reload the `zuul` service. it's automated by https://gerrit.wikimedia.org/r/plugins/gitiles/integration/config/+/master/fabfile.py [08:21:52] there's a task somewhere about changing it to use scap, but that's been pending for years [08:22:20] sigh.. i would like it to be neither of these options, tbh [08:24:58] puppet reloads the zuul service when the config changes [08:25:21] but not "restart".. so not sure [08:25:48] maybe comment on https://phabricator.wikimedia.org/T129357? [08:28:51] done [08:29:52] ty :) [08:30:03] (03CR) 10Kormat: [C: 03+1] mariadb: Place db1148 into s4 [puppet] - 10https://gerrit.wikimedia.org/r/601641 (https://phabricator.wikimedia.org/T252512) (owner: 10Marostegui) [08:30:19] (03CR) 10Marostegui: [C: 03+2] mariadb: Place db1148 into s4 [puppet] - 10https://gerrit.wikimedia.org/r/601641 (https://phabricator.wikimedia.org/T252512) (owner: 10Marostegui) [08:32:08] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10fgiunchedi) [08:33:06] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10fgiunchedi) I just noticed three out of four hosts are in row C, we'll need one host per row though [08:34:59] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) [08:35:44] 10Operations, 10ops-eqiad, 10DC-Ops: (NEED BY: ASAP) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10fgiunchedi) Similarly to thanos-be, these hosts will need to be row-diverse but ATM there are two in row A [08:36:17] (03CR) 10ArielGlenn: "Do we have alerts for the specific failures for services covered by systemd units that make sense for WMCS to handle, if this is turned of" [puppet] - 10https://gerrit.wikimedia.org/r/601374 (owner: 10Bstorm) [08:37:00] (03PS1) 10Ema: 0.15: build against prometheus-rdkafka-exporter 0.2 [software/purged] - 10https://gerrit.wikimedia.org/r/601649 [08:38:31] (03CR) 10Dzahn: ""Monitoring systemd as a whole will always page WMCS because the host" [puppet] - 10https://gerrit.wikimedia.org/r/601374 (owner: 10Bstorm) [08:38:49] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: refactor stats_reporter into a profile [puppet] - 10https://gerrit.wikimedia.org/r/601346 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [08:38:58] (03PS3) 10Filippo Giunchedi: swift: refactor stats_reporter into a profile [puppet] - 10https://gerrit.wikimedia.org/r/601346 (https://phabricator.wikimedia.org/T252186) [08:48:03] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 93 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:48:15] (03CR) 10Filippo Giunchedi: "Do we gain anything significant from fsnotify in the normal case? In other words IIRC mtail will track inodes when fsnotify is disabled (?" [puppet] - 10https://gerrit.wikimedia.org/r/601436 (https://phabricator.wikimedia.org/T254192) (owner: 10Cwhite) [08:51:21] (03CR) 10Ema: [C: 03+2] 0.15: build against prometheus-rdkafka-exporter 0.2 [software/purged] - 10https://gerrit.wikimedia.org/r/601649 (owner: 10Ema) [08:59:33] (03CR) 10Filippo Giunchedi: [C: 04-1] "Thanks for working on this! See inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/601429 (https://phabricator.wikimedia.org/T229584) (owner: 10Dave Pifke) [08:59:41] !log upload purged 0.15 to buster-wikimedia [08:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:30] (03PS2) 10Filippo Giunchedi: prometheus: enable Thanos upload for analytics [puppet] - 10https://gerrit.wikimedia.org/r/601326 (https://phabricator.wikimedia.org/T252186) [09:05:31] (03PS1) 10Filippo Giunchedi: thanos: enable swift stats reporting [puppet] - 10https://gerrit.wikimedia.org/r/601657 (https://phabricator.wikimedia.org/T252186) [09:07:17] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1001/22923/" [puppet] - 10https://gerrit.wikimedia.org/r/601657 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [09:09:58] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10ops-monitoring-bot) Completed auto-reimage of... [09:13:55] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/551882 (owner: 10Dzahn) [09:17:29] the reimaging of cont1001 failed ... because ... [09:17:33] the partman recipe it uses has been deleted. [09:20:47] 10Operations, 10ORES, 10Scoring-platform-team: Move ORES to redis misc cluster - https://phabricator.wikimedia.org/T254226 (10akosiaris) [09:20:56] 10Operations, 10ORES, 10Scoring-platform-team: Move ORES to redis misc cluster - https://phabricator.wikimedia.org/T254226 (10akosiaris) p:05Triageβ†’03High [09:22:51] (03PS1) 10Alexandros Kosiaris: Switch oresrdb.svc records to redis::misc [dns] - 10https://gerrit.wikimedia.org/r/601665 (https://phabricator.wikimedia.org/T254226) [09:23:05] (03PS1) 10Dzahn: partman: switch contint1001 to raid10-4dev, previous recipe is gone [puppet] - 10https://gerrit.wikimedia.org/r/601666 (https://phabricator.wikimedia.org/T224591) [09:23:29] (03CR) 10Alexandros Kosiaris: "@halfak, thanks. I 've filed https://phabricator.wikimedia.org/T254226 to track the process/work on it." [puppet] - 10https://gerrit.wikimedia.org/r/595167 (owner: 10Alexandros Kosiaris) [09:23:55] (03PS3) 10Alexandros Kosiaris: ores: Parameterize redis ports [puppet] - 10https://gerrit.wikimedia.org/r/595167 (https://phabricator.wikimedia.org/T254226) [09:25:03] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review: Move ORES to redis misc cluster - https://phabricator.wikimedia.org/T254226 (10akosiaris) [09:29:05] (03CR) 10Dzahn: "Just attempted the reimage of contint1001 but it failed because "raid1-4dev.cfg" does not exist. First I thought that had been deleted but" [puppet] - 10https://gerrit.wikimedia.org/r/571265 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [09:29:31] (03CR) 10Dzahn: "follow-up to https://gerrit.wikimedia.org/r/c/operations/puppet/+/571265" [puppet] - 10https://gerrit.wikimedia.org/r/601666 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [09:29:33] (03CR) 10Dzahn: [C: 03+2] partman: switch contint1001 to raid10-4dev, previous recipe is gone [puppet] - 10https://gerrit.wikimedia.org/r/601666 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [09:33:16] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10ops-monitoring-bot) Script wmf-auto-reimage wa... [09:33:19] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10ops-monitoring-bot) Completed auto-reimage of... [09:38:05] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10ops-monitoring-bot) Script wmf-auto-reimage wa... [09:38:09] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10ops-monitoring-bot) Completed auto-reimage of... [09:38:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1148 to dbctl depooled T252512', diff saved to https://phabricator.wikimedia.org/P11356 and previous config saved to /var/cache/conftool/dbconfig/20200602-093841-marostegui.json [09:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:45] T252512: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 [09:39:15] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10ops-monitoring-bot) Script wmf-auto-reimage wa... [09:42:01] (03PS1) 10Marostegui: db1148: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/601670 (https://phabricator.wikimedia.org/T252512) [09:43:37] (03CR) 10Marostegui: [C: 03+2] db1148: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/601670 (https://phabricator.wikimedia.org/T252512) (owner: 10Marostegui) [09:44:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1121, db1148 T252512', diff saved to https://phabricator.wikimedia.org/P11357 and previous config saved to /var/cache/conftool/dbconfig/20200602-094441-marostegui.json [09:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:45] T252512: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 [09:45:07] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10ops-monitoring-bot) Completed auto-reimage of... [09:45:30] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10ops-monitoring-bot) Script wmf-auto-reimage wa... [09:45:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] Switch oresrdb.svc records to redis::misc [dns] - 10https://gerrit.wikimedia.org/r/601665 (https://phabricator.wikimedia.org/T254226) (owner: 10Alexandros Kosiaris) [09:46:59] (03CR) 10Alexandros Kosiaris: [C: 03+2] ores: Parameterize redis ports [puppet] - 10https://gerrit.wikimedia.org/r/595167 (https://phabricator.wikimedia.org/T254226) (owner: 10Alexandros Kosiaris) [09:49:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1138', diff saved to https://phabricator.wikimedia.org/P11358 and previous config saved to /var/cache/conftool/dbconfig/20200602-094914-marostegui.json [09:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1121, db1148 T252512', diff saved to https://phabricator.wikimedia.org/P11359 and previous config saved to /var/cache/conftool/dbconfig/20200602-095321-marostegui.json [09:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:26] T252512: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 [09:58:49] (03PS1) 10Filippo Giunchedi: icinga: delete unreferenced contact groups [puppet] - 10https://gerrit.wikimedia.org/r/601672 (https://phabricator.wikimedia.org/T254006) [10:02:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1121, db1148 T252512', diff saved to https://phabricator.wikimedia.org/P11360 and previous config saved to /var/cache/conftool/dbconfig/20200602-100246-marostegui.json [10:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:50] T252512: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 [10:07:15] 10Operations, 10Analytics, 10Event-Platform, 10Services (watching): Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10hnowlan) I'm okay with using `general.yaml` in this way but I would like to put the kafka service list under a general hierarchy of services (like `"service... [10:08:45] 10Operations, 10Gerrit: Initial backup run for Gerrit LFS data - https://phabricator.wikimedia.org/T254162 (10jcrespo) I have to temporarily block gerrit-root members access to gerrit2001 as I need to use the global key to decrypt backups to a different host than they were taken AND these members have local ro... [10:09:25] !log switch over ores2XXX hosts to redis::misc from oresrdb hosts. T254226 [10:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:29] T254226: Move ORES to redis misc cluster - https://phabricator.wikimedia.org/T254226 [10:09:36] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [10:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:49] 10Operations, 10Gerrit: Initial backup run for Gerrit LFS data - https://phabricator.wikimedia.org/T254162 (10Dzahn) I have already tried restoring a single random file from the git-lfs directory on gerrit1001 itself and the file is here now: ` root@gerrit1001:/var/tmp/bacula-restores/srv/gerrit/plugins/lfs/... [10:11:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1121, db1148 T252512', diff saved to https://phabricator.wikimedia.org/P11361 and previous config saved to /var/cache/conftool/dbconfig/20200602-101150-marostegui.json [10:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:54] T252512: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 [10:12:05] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:08] !log disable non-global root login to gerrit2001 T254162 [10:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:11] T254162: Initial backup run for Gerrit LFS data - https://phabricator.wikimedia.org/T254162 [10:13:30] 10Operations, 10Gerrit: Initial backup run for Gerrit LFS data - https://phabricator.wikimedia.org/T254162 (10Dzahn) >>! In T254162#6184364, @jcrespo wrote: > I have to temporarily block gerrit-root members access to gerrit2001 as I need to use the global key to decrypt backups to a different host than they we... [10:21:48] RECOVERY - cassandra-a service on restbase2009 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:23:50] 10Operations, 10Jupyter-Hub, 10LDAP-Access-Requests: Give access to the JupyterHub (SWAP) notebooks to (@YiJuLu ) - https://phabricator.wikimedia.org/T254121 (10Dzahn) Thanks @YuJuLu I found it. You will go by "lulu" since that is the UID though. Can you let me know your first and last name please? I could... [10:26:25] 10Operations, 10ORES, 10Scoring-platform-team: Move ORES to redis misc cluster - https://phabricator.wikimedia.org/T254226 (10akosiaris) codfw migration has gone really well, I 've barely managed to notice the migration in the dashboards. [10:29:10] !log switch over ores1XXX hosts to redis::misc from oresrdb hosts. T254226 [10:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:14] T254226: Move ORES to redis misc cluster - https://phabricator.wikimedia.org/T254226 [10:29:27] (03PS1) 10Dzahn: admin: add Yi-Ju Lu to ldap_only admins (nda) [puppet] - 10https://gerrit.wikimedia.org/r/601685 (https://phabricator.wikimedia.org/T254121) [10:31:13] 10Operations, 10Jupyter-Hub, 10LDAP-Access-Requests, 10Patch-For-Review: Give access to the JupyterHub (SWAP) notebooks to (@YiJuLu ) - https://phabricator.wikimedia.org/T254121 (10YiJuLu) @Dzahn Yes, "Yi-Ju Lu" is correct as my name. [10:33:22] 10Operations, 10Gerrit: Initial backup run for Gerrit LFS data - https://phabricator.wikimedia.org/T254162 (10jcrespo) I have scheduled the restore. If this was an emergency, I would kill ongoing backups jobs and the restore would run immediately, but because this is a test, I would let the large ongoing backu... [10:36:49] PROBLEM - PHP opcache health on scandium is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:39:57] 10Operations, 10Gerrit: Initial backup run for Gerrit LFS data - https://phabricator.wikimedia.org/T254162 (10jcrespo) >>! In T254162#6184387, @Dzahn wrote: > Would this be sufficient? Normally yes, but I would like to do a full restore to a separate server (simulating a total loss of the primary server), as... [10:40:32] ACKNOWLEDGEMENT - PHP opcache health on scandium is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn https://phabricator.wikimedia.org/T254025 https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:43:01] RECOVERY - mysqld processes on db2110 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [10:44:17] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/601685 (https://phabricator.wikimedia.org/T254121) (owner: 10Dzahn) [10:46:18] (03CR) 10Dzahn: [C: 03+2] admin: add Yi-Ju Lu to ldap_only admins (nda) [puppet] - 10https://gerrit.wikimedia.org/r/601685 (https://phabricator.wikimedia.org/T254121) (owner: 10Dzahn) [10:48:37] !log LDAP - added uid=lulu to group nda (T254121) [10:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:41] T254121: Give access to the JupyterHub (SWAP) notebooks to (@YiJuLu ) - https://phabricator.wikimedia.org/T254121 [10:51:26] 10Operations, 10Jupyter-Hub, 10LDAP-Access-Requests, 10Patch-For-Review: Give access to the JupyterHub (SWAP) notebooks to (@YiJuLu ) - https://phabricator.wikimedia.org/T254121 (10Dzahn) @YiJuLu Thanks for confirming. You have been added to the LDAP group "nda". This was one step needed to give you access... [10:52:10] 10Operations, 10Jupyter-Hub, 10LDAP-Access-Requests, 10Patch-For-Review: Give access to the JupyterHub (SWAP) notebooks to (@YiJuLu ) - https://phabricator.wikimedia.org/T254121 (10Dzahn) This will continue on T254130 [10:53:00] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10Dzahn) T254121 has been resolved now. That was one pre-requisite for getting access to JupyterHub (and other things). We still need the things Aklapper mentioned above though. [10:56:05] 10Operations, 10ORES, 10Scoring-platform-team: Move ORES to redis misc cluster - https://phabricator.wikimedia.org/T254226 (10akosiaris) eqiad migrations has gone pretty much as well. There seem to be some occasional overloads due to ores1001 at some point, it looks like a restart of uwsgi+celery fixed it. [10:57:16] (03PS3) 10Hnowlan: changeprop-jobqueue: enable all high-traffic jobs, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/599383 (https://phabricator.wikimedia.org/T220399) [10:57:33] (03PS2) 10KartikMistry: Create URL campaign for African languages for COVID-19 translation project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601174 (https://phabricator.wikimedia.org/T253305) [10:58:00] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] changeprop-jobqueue: enable all high-traffic jobs, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/599383 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [10:58:25] (03PS2) 10Muehlenhoff: Enable managed adduser config for codfw [puppet] - 10https://gerrit.wikimedia.org/r/599358 (https://phabricator.wikimedia.org/T235162) [10:58:30] (03Merged) 10jenkins-bot: changeprop-jobqueue: enable all high-traffic jobs, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/599383 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: How many deployers does it take to do European Mid-day SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200602T1100). [11:00:04] kart_: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:38] * kart_ is here and will deploy config patch.. [11:01:28] (03CR) 10KartikMistry: [C: 03+2] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601174 (https://phabricator.wikimedia.org/T253305) (owner: 10KartikMistry) [11:01:53] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [11:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:06] RECOVERY - PHP opcache health on scandium is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:02:19] (03Merged) 10jenkins-bot: Create URL campaign for African languages for COVID-19 translation project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601174 (https://phabricator.wikimedia.org/T253305) (owner: 10KartikMistry) [11:04:21] 10Operations, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T253715 (10hnowlan) Rebootstrapping now. [11:04:37] (03PS1) 10Arturo Borrero Gonzalez: toolforge: prometheus: renew TLS cert for the k8s API [puppet] - 10https://gerrit.wikimedia.org/r/601692 (https://phabricator.wikimedia.org/T250874) [11:05:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: prometheus: renew TLS cert for the k8s API [puppet] - 10https://gerrit.wikimedia.org/r/601692 (https://phabricator.wikimedia.org/T250874) (owner: 10Arturo Borrero Gonzalez) [11:07:53] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit|601174|Create URL campaign for African languages for COVID-19 translation project (T253305)]] (duration: 01m 00s) [11:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:57] T253305: Create URL campaign for African languages for COVID-19 translation project - https://phabricator.wikimedia.org/T253305 [11:08:17] (03CR) 10Muehlenhoff: [C: 03+2] Enable managed adduser config for codfw [puppet] - 10https://gerrit.wikimedia.org/r/599358 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [11:08:28] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [11:08:36] No other patches in EU SWAT. [11:08:41] !log contint1001 - common issue after reinstalls again - a2dismod mpm_event ; systemctl restart apache2 ; puppet agent -tv ( T196968) https://gerrit.wikimedia.org/r/c/operations/puppet/+/451206 [11:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:44] T196968: Re-organize the apache configuration for MediaWiki in puppet - https://phabricator.wikimedia.org/T196968 [11:09:23] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10ops-monitoring-bot) Completed auto-reimage of... [11:10:20] !log Finished EU Mid-day SWAT. [11:10:20] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) @hashar contint1001 is now on buster [11:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:34] 10Operations, 10Wikimedia-Logstash, 10observability: Logstash: add SSD tier to ELK7 cluster - https://phabricator.wikimedia.org/T247376 (10Dzahn) logtash2028 is reporting as failed SSH since 2 days. There is noting in SAL or an open ticket. Notifications are disabled but that could be from previous reinsta... [11:13:45] ACKNOWLEDGEMENT - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff Kormat I wandered off to lunch in the middle. https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [11:14:06] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [11:17:34] 10Operations, 10ORES, 10Scoring-platform-team: Move ORES to redis misc cluster - https://phabricator.wikimedia.org/T254226 (10akosiaris) 05Openβ†’03Stalled https://grafana.wikimedia.org/d/RLhtAw6mz/ores-redis?orgId=1&refresh=1m has been updated as well. I am gonna call this resolved, but we should wait a c... [11:23:45] 10Operations, 10Jupyter-Hub, 10LDAP-Access-Requests: Give access to the JupyterHub (SWAP) notebooks to (@YiJuLu ) - https://phabricator.wikimedia.org/T254121 (10Dzahn) 05Openβ†’03Resolved a:03Dzahn [11:23:48] 10Operations, 10SRE-Access-Requests: Give access to the Analytics Cluster to Research Intern (@YiJuLu) - https://phabricator.wikimedia.org/T254120 (10Dzahn) [11:24:38] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.01012 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:24:48] (03PS2) 10Alexandros Kosiaris: ci: Add kubeyaml [puppet] - 10https://gerrit.wikimedia.org/r/601376 [11:24:53] (03CR) 10Alexandros Kosiaris: [C: 03+2] Initial debianization [debs/kubeyaml] - 10https://gerrit.wikimedia.org/r/601372 (owner: 10Alexandros Kosiaris) [11:25:24] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) [11:25:59] (03Merged) 10jenkins-bot: Initial debianization [debs/kubeyaml] - 10https://gerrit.wikimedia.org/r/601372 (owner: 10Alexandros Kosiaris) [11:26:02] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) a:05Dzahnβ†’03hashar re-assigning for... [11:26:58] (03CR) 10Dzahn: [C: 03+2] monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 (owner: 10Dzahn) [11:28:48] moritzm: I think the puppet failures are related to your change to File[/etc/sysusers.d/sysusers-base.conf] [11:30:21] change from 'absent' to 'file' failed: Could not set 'file' on ensure: No such file or directory [11:32:37] 10Operations, 10WMF-Design, 10Design, 10Patch-For-Review: Create sub-directory URL for Design blog (https://design.wikimedia.org/blog) - https://phabricator.wikimedia.org/T254118 (10Dzahn) Hi @Prtksxna ah, that explains it. Thanks. Could we have _only_ the contents of _site in the repo to make it a "deploy... [11:33:30] moritzm: so, it seems that on 29 hosts /etc/sysusers.d/ doesn't exists [11:33:49] (in codfw) [11:34:22] moritzm: list at https://etherpad.wikimedia.org/p/volans-tmp2 [11:34:55] (all OS versions included) [11:35:05] looking [11:35:49] there is some buster/stretch/jessie in the mix, so is not OS dependent [11:37:23] (03PS1) 10Muehlenhoff: Revert "Enable managed adduser config for codfw" [puppet] - 10https://gerrit.wikimedia.org/r/601698 [11:37:49] 10Operations, 10WMF-Design, 10Design, 10Patch-For-Review: Create sub-directory URL for Design blog (https://design.wikimedia.org/blog) - https://phabricator.wikimedia.org/T254118 (10Prtksxna) @Dzahn, I can see how it would make things easier, but I am not sure it would be maintainable to have the source an... [11:38:30] I'll revert for now, this worked fine when enabling in ulsfo, puppet should have handled that transparently, but I can also explicitly add the directory in puppet [11:38:45] who creates the directory normally? [11:38:54] some postinst? [11:38:56] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Enable managed adduser config for codfw" [puppet] - 10https://gerrit.wikimedia.org/r/601698 (owner: 10Muehlenhoff) [11:40:32] can't find it with dpkg -S [11:40:48] the systemd package only creates /usr/lib/sysusers.d, not /etc/sysusers.d [11:41:11] system users shipped by packages are expected to only use the distro path in /usr [11:41:23] so why that dir is created on so many hosts but not all? [11:41:58] that is the question I'm trying to figure out :-) [11:42:29] :) [11:42:43] moritzm: e050204896aa4c11101dc9b2d5a9ab0a3b5fcef0 [11:42:57] in the puppet repo [11:44:40] (03CR) 10Alexandros Kosiaris: [C: 04-1] ganeti: add monitoring for ganeti RAPI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/589608 (owner: 10Dzahn) [11:44:49] interestingly that code is still there, so maybe an order issue [11:45:00] adding a require => File['/etc/sysusers.d'] should solve it [11:45:18] the issue is that we have ~ 60 hosts which don't use the systemd class ATM [11:45:25] ah [11:45:27] 10Operations, 10WMF-Design, 10Design, 10Patch-For-Review: Create sub-directory URL for Design blog (https://design.wikimedia.org/blog) - https://phabricator.wikimedia.org/T254118 (10Dzahn) >>! In T254118#6184641, @Prtksxna wrote: > @Dzahn, I can see how it would make things easier, but I am not sure it wou... [11:45:29] stashbot is not working due to issues connecting to toolforge internal elasticsearch server, I'm investigating [11:45:44] so e.g. apt2001 apparently never declares any resource using systemd [11:47:00] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [11:47:10] I'll prepare a fix [11:47:16] ack, thanks! [11:47:24] 10Operations, 10WMF-Design, 10Design, 10Patch-For-Review: Create sub-directory URL for Design blog (https://design.wikimedia.org/blog) - https://phabricator.wikimedia.org/T254118 (10Dzahn) >>! In T254118#6184641, @Prtksxna wrote: > I am not sure it would be maintainable to have the source and the built si... [11:49:58] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 59 probes of 575 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:53:18] (03CR) 10CDanis: [C: 03+1] icinga: delete unreferenced contact groups [puppet] - 10https://gerrit.wikimedia.org/r/601672 (https://phabricator.wikimedia.org/T254006) (owner: 10Filippo Giunchedi) [11:55:46] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 46 probes of 575 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:56:42] (03PS1) 10Muehlenhoff: Make systemd::sysuser require systemd class [puppet] - 10https://gerrit.wikimedia.org/r/601703 (https://phabricator.wikimedia.org/T235162) [11:56:56] (03CR) 10Alexandros Kosiaris: [C: 03+2] ci: Add kubeyaml [puppet] - 10https://gerrit.wikimedia.org/r/601376 (owner: 10Alexandros Kosiaris) [12:01:35] (03CR) 10Dzahn: ganeti: add monitoring for ganeti RAPI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/589608 (owner: 10Dzahn) [12:02:14] (03PS4) 10Dzahn: ganeti: add monitoring for ganeti RAPI [puppet] - 10https://gerrit.wikimedia.org/r/589608 [12:02:44] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.005693 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:03:08] (03PS2) 10Dzahn: site: decom mw2173 through mw2179 [puppet] - 10https://gerrit.wikimedia.org/r/599603 (https://phabricator.wikimedia.org/T247018) [12:06:15] (03PS1) 10Kormat: mariadb: Enable notifications for db2140 [puppet] - 10https://gerrit.wikimedia.org/r/601704 (https://phabricator.wikimedia.org/T252985) [12:07:44] (03PS5) 10Alexandros Kosiaris: rake: Add kubeyaml validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 [12:10:04] (03PS1) 10Alexandros Kosiaris: Revert "ci: Add kubeyaml" [puppet] - 10https://gerrit.wikimedia.org/r/601705 [12:10:26] stashbot should be back shortly [12:11:49] (03CR) 10jerkins-bot: [V: 04-1] Revert "ci: Add kubeyaml" [puppet] - 10https://gerrit.wikimedia.org/r/601705 (owner: 10Alexandros Kosiaris) [12:12:00] (03CR) 10Muehlenhoff: partman: switch contint1001 to raid10-4dev, previous recipe is gone (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/601666 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [12:13:52] (03PS2) 10Alexandros Kosiaris: Revert "ci: Add kubeyaml" [puppet] - 10https://gerrit.wikimedia.org/r/601705 [12:14:54] (03CR) 10Dzahn: "I did not expect we'd be telling partman to use recipes that don't exist yet." [puppet] - 10https://gerrit.wikimedia.org/r/601666 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [12:16:52] (03Abandoned) 10Alexandros Kosiaris: Revert "ci: Add kubeyaml" [puppet] - 10https://gerrit.wikimedia.org/r/601705 (owner: 10Alexandros Kosiaris) [12:16:59] (03Restored) 10Alexandros Kosiaris: Revert "ci: Add kubeyaml" [puppet] - 10https://gerrit.wikimedia.org/r/601705 (owner: 10Alexandros Kosiaris) [12:17:04] (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "ci: Add kubeyaml" [puppet] - 10https://gerrit.wikimedia.org/r/601705 (owner: 10Alexandros Kosiaris) [12:17:08] (03CR) 10Dzahn: "At second look it seemed like a simple typo (1 instead of 10). That being said, the current status with RAID10 seems fine to me. We have t" [puppet] - 10https://gerrit.wikimedia.org/r/601666 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [12:17:18] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Again, actual change that is going to work in https://gerrit.wikimedia.org/r/#/c/integration/config/+/601707" [puppet] - 10https://gerrit.wikimedia.org/r/601705 (owner: 10Alexandros Kosiaris) [12:18:19] (03PS1) 10Muehlenhoff: Add raid1-4dev.cfg [puppet] - 10https://gerrit.wikimedia.org/r/601708 (https://phabricator.wikimedia.org/T156955) [12:18:28] (03PS3) 10Dzahn: site: decom mw2173 through mw2179 [puppet] - 10https://gerrit.wikimedia.org/r/599603 (https://phabricator.wikimedia.org/T247018) [12:25:50] (03PS1) 10Andrew Bogott: wmcs resolv.conf: reduce timeout to 1s [puppet] - 10https://gerrit.wikimedia.org/r/601711 (https://phabricator.wikimedia.org/T253780) [12:26:52] (03PS2) 10Alexandros Kosiaris: Package multiple charts for egress networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/598187 (https://phabricator.wikimedia.org/T249927) [12:27:32] (03PS1) 10Jbond: puppetmaster: update puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/601712 [12:28:01] (03CR) 10Alexandros Kosiaris: [C: 03+2] Package multiple charts for egress networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/598187 (https://phabricator.wikimedia.org/T249927) (owner: 10Alexandros Kosiaris) [12:28:27] (03Merged) 10jenkins-bot: Package multiple charts for egress networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/598187 (https://phabricator.wikimedia.org/T249927) (owner: 10Alexandros Kosiaris) [12:28:41] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw217[3-9].codfw.wmnet [12:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:43] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: update puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/601712 (owner: 10Jbond) [12:29:14] RECOVERY - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is OK: TCP OK - 0.039 second response time on 10.192.48.54 port 9042 https://phabricator.wikimedia.org/T93886 [12:30:04] (03PS2) 10Jbond: puppetmaster: update puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/601712 [12:30:21] !log kormat@cumin1001 dbctl commit (dc=all): 'Repool db2110, copy to db2140 complete T252985', diff saved to https://phabricator.wikimedia.org/P11362 and previous config saved to /var/cache/conftool/dbconfig/20200602-123020-kormat.json [12:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:25] T252985: Productionize db213[6-9] and db2140 - https://phabricator.wikimedia.org/T252985 [12:31:14] (03CR) 10Filippo Giunchedi: "See inline for wrong devcount, although I'm not sure in what case(s) a four-way mirrored array that spans all the disk would be useful" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/601708 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [12:31:22] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: update puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/601712 (owner: 10Jbond) [12:31:53] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw217[3-9].codfw.wmnet [12:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:37] (03PS3) 10Jbond: puppetmaster: update puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/601712 [12:33:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs resolv.conf: reduce timeout to 1s [puppet] - 10https://gerrit.wikimedia.org/r/601711 (https://phabricator.wikimedia.org/T253780) (owner: 10Andrew Bogott) [12:33:47] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: update puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/601712 (owner: 10Jbond) [12:35:53] (03PS4) 10Jbond: puppetmaster: update puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/601712 [12:36:24] PROBLEM - Unmerged changes on repository puppet on labtestpuppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [12:36:28] (03PS2) 10Andrew Bogott: wmcs resolv.conf: reduce timeout to 1s [puppet] - 10https://gerrit.wikimedia.org/r/601711 (https://phabricator.wikimedia.org/T253780) [12:36:30] (03PS1) 10Andrew Bogott: wmcs vms: stop using ns1 for resolving [puppet] - 10https://gerrit.wikimedia.org/r/601714 (https://phabricator.wikimedia.org/T253780) [12:38:06] (03CR) 10Marostegui: [C: 03+1] mariadb: Enable notifications for db2140 [puppet] - 10https://gerrit.wikimedia.org/r/601704 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [12:39:35] (03PS1) 10Marostegui: install_server: Reimage labsdb1011 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/601715 (https://phabricator.wikimedia.org/T249188) [12:39:39] (03PS5) 10Jbond: puppetmaster: update puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/601712 (https://phabricator.wikimedia.org/T251104) [12:39:46] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [12:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:54] (03CR) 10Kormat: [C: 03+2] mariadb: Enable notifications for db2140 [puppet] - 10https://gerrit.wikimedia.org/r/601704 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [12:41:06] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/601712 (https://phabricator.wikimedia.org/T251104) (owner: 10Jbond) [12:41:28] PROBLEM - mediawiki-installation DSH group on mw2173 is CRITICAL: Host mw2173 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [12:41:57] (03CR) 10Kormat: [C: 03+1] install_server: Reimage labsdb1011 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/601715 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [12:43:12] PROBLEM - mediawiki-installation DSH group on mw2174 is CRITICAL: Host mw2174 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [12:43:20] akosiaris: is it safe to merge your puppet CR? [12:43:27] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage labsdb1011 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/601715 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [12:43:33] kormat: you can also merge mine .) [12:43:34] :) [12:43:45] (03Abandoned) 10Muehlenhoff: Add raid1-4dev.cfg [puppet] - 10https://gerrit.wikimedia.org/r/601708 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [12:43:47] πŸƒβ€β™‚οΈ [12:43:51] kormat: the revert? yes fully [12:43:52] thanks! [12:44:30] RECOVERY - cassandra-b service on restbase2009 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:44:32] np, done [12:44:54] <3 [12:45:28] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2173 is CRITICAL: Host mw2173 is not in mediawiki-installation dsh group daniel_zahn decom https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [12:45:28] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2174 is CRITICAL: Host mw2174 is not in mediawiki-installation dsh group daniel_zahn decom https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [12:45:34] RECOVERY - Unmerged changes on repository puppet on labtestpuppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [12:46:36] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020): CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Trizek-WMF) Any update since last month? Q1 starts in one month and this task would need some preparation and some scheduling ahead. Q1 is also summer vac... [12:49:22] 10Operations, 10Acme-chief, 10Traffic: Provide ensure => absent support for acme_chief::cert define - https://phabricator.wikimedia.org/T229097 (10Vgutierrez) 05Openβ†’03Resolved a:03Vgutierrez This has been automagically solved with 725e7f4eeb37a3742591a3f7357b6862e3b4c361, moving OCSP stapling to the a... [12:49:25] 10Operations, 10Traffic: acme-chief failing in puppet with "Cannot open input file" - https://phabricator.wikimedia.org/T229091 (10Vgutierrez) [12:49:33] jouncebot: refresh [12:49:34] I refreshed my knowledge about deployments. [12:49:40] jouncebot: next [12:49:41] In 1 hour(s) and 10 minute(s): database load testing (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200602T1400) [12:50:12] !log kormat@cumin1001 dbctl commit (dc=all): 'Pool db2140 into s4 T252985', diff saved to https://phabricator.wikimedia.org/P11363 and previous config saved to /var/cache/conftool/dbconfig/20200602-125012-kormat.json [12:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:16] T252985: Productionize db213[6-9] and db2140 - https://phabricator.wikimedia.org/T252985 [12:50:27] jouncebot: refresh [12:50:27] I refreshed my knowledge about deployments. [12:50:28] jouncebot: next [12:50:29] In 0 hour(s) and 9 minute(s): database load testing (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200602T1300) [12:50:30] (03CR) 10Marostegui: [C: 03+1] limit per-user Special:Contributions concurrency to 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601361 (https://phabricator.wikimedia.org/T234450) (owner: 10CDanis) [12:52:22] (03PS6) 10Jbond: puppetmaster: update puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/601712 (https://phabricator.wikimedia.org/T251104) [12:52:39] 10Operations, 10WMF-Design, 10Design, 10Patch-For-Review: Create sub-directory URL for Design blog (https://design.wikimedia.org/blog) - https://phabricator.wikimedia.org/T254118 (10Dzahn) Per our chat on IRC just now. It's not a problem. We can do it like we did for the TransparencyReport. I see there... [12:52:48] (03PS1) 10Alexandros Kosiaris: beta: Allow using docker volumes [puppet] - 10https://gerrit.wikimedia.org/r/601717 (https://phabricator.wikimedia.org/T251176) [12:52:50] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/601712 (https://phabricator.wikimedia.org/T251104) (owner: 10Jbond) [12:53:04] PROBLEM - mediawiki-installation DSH group on mw2175 is CRITICAL: Host mw2175 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [12:53:15] (03CR) 10CDanis: [C: 03+2] limit per-user Special:Contributions concurrency to 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601361 (https://phabricator.wikimedia.org/T234450) (owner: 10CDanis) [12:53:33] 10Operations, 10Traffic: Let ats-tls handle port 80 - https://phabricator.wikimedia.org/T254235 (10Vgutierrez) [12:53:59] 10Operations, 10Traffic: Let ats-tls handle port 80 - https://phabricator.wikimedia.org/T254235 (10Vgutierrez) p:05Triageβ†’03Medium [12:54:05] (03Merged) 10jenkins-bot: limit per-user Special:Contributions concurrency to 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601361 (https://phabricator.wikimedia.org/T234450) (owner: 10CDanis) [12:56:00] (03CR) 10Jbond: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/601703 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [12:56:04] !log cdanis@deploy1001 Synchronized wmf-config/PoolCounterSettings.php: 5debc3223 limit per-user Special:Contributions concurrency to 2 T234450 (duration: 00m 58s) [12:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:14] 10Operations, 10ORES, 10Scoring-platform-team: Move ORES to redis misc cluster - https://phabricator.wikimedia.org/T254226 (10akosiaris) [12:58:30] 10Operations, 10ops-eqiad, 10DC-Ops: decomission oresrdb100[12] - https://phabricator.wikimedia.org/T254238 (10akosiaris) [12:59:10] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Certificate *.wikipedia.org valid until 2020-06-20 - https://phabricator.wikimedia.org/T251726 (10Vgutierrez) can we close this task or at least change the task title to lfocus on the icinga alerts? there is no issue with cert renewal itself :) ` w... [12:59:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [12:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:22] (03CR) 10Muehlenhoff: Make systemd::sysuser require systemd class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/601703 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [12:59:30] (03PS2) 10Muehlenhoff: Make systemd::sysuser require systemd class [puppet] - 10https://gerrit.wikimedia.org/r/601703 (https://phabricator.wikimedia.org/T235162) [12:59:31] 10Operations, 10ops-codfw, 10DC-Ops: Decomission oresrdb2002.codfw.wmnet,oresrdb - https://phabricator.wikimedia.org/T254240 (10akosiaris) [12:59:59] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [13:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:05] cdanis and marostegui: I, the Bot under the Fountain, allow thee, The Deployer, to do database load testing deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200602T1300). [13:00:40] PROBLEM - mediawiki-installation DSH group on mw2176 is CRITICAL: Host mw2176 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [13:00:41] 10Operations, 10ops-eqiad, 10DC-Ops: decomission oresrdb100[12] - https://phabricator.wikimedia.org/T254238 (10akosiaris) p:05Triageβ†’03Medium a:05wiki_willyβ†’03None [13:00:56] 10Operations, 10ops-codfw, 10DC-Ops: Decomission oresrdb2002.codfw.wmnet,oresrdb - https://phabricator.wikimedia.org/T254240 (10akosiaris) a:05wiki_willyβ†’03None [13:01:03] 10Operations, 10ops-codfw, 10DC-Ops: Decomission oresrdb2002.codfw.wmnet - https://phabricator.wikimedia.org/T254240 (10akosiaris) [13:01:21] (03PS7) 10Jbond: puppetmaster: update puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/601712 (https://phabricator.wikimedia.org/T251104) [13:03:06] !log cdanis@deploy1001 Synchronized php-1.35.0-wmf.34/includes/specials/pagers/ContribsPager.php: revert contribs limit to 5000 T234450 (duration: 00m 58s) [13:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:39] 10Operations, 10WMF-Design, 10Design, 10Patch-For-Review: Create sub-directory URL for Design blog (https://design.wikimedia.org/blog) - https://phabricator.wikimedia.org/T254118 (10Prtksxna) >>! In T254118#6184852, @Dzahn wrote: > Just one unrelated thing needs to be checked.. Gerrit permissions to make s... [13:04:36] (03PS1) 10Muehlenhoff: Add library hints for pango [puppet] - 10https://gerrit.wikimedia.org/r/601724 [13:05:14] !log cdanis@deploy1001 Synchronized php-1.35.0-wmf.31/includes/specials/pagers/ContribsPager.php: revert contribs limit to 5000 T234450 (duration: 00m 57s) [13:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:21] (03CR) 10Jbond: "PCC OPS: https://puppet-compiler.wmflabs.org/compiler1003/22927/" [puppet] - 10https://gerrit.wikimedia.org/r/601712 (https://phabricator.wikimedia.org/T251104) (owner: 10Jbond) [13:06:30] (03CR) 10Muehlenhoff: [C: 03+2] Add library hints for pango [puppet] - 10https://gerrit.wikimedia.org/r/601724 (owner: 10Muehlenhoff) [13:07:30] (03CR) 10Jbond: [C: 03+1] Make systemd::sysuser require systemd class [puppet] - 10https://gerrit.wikimedia.org/r/601703 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [13:08:40] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Certificate *.wikipedia.org valid until 2020-06-20 - https://phabricator.wikimedia.org/T251726 (10Dzahn) Yes, it should be renamed. But i think it is traffic team's decision what to do about the monitoring per this being the " primary automated mon... [13:09:19] 10Operations, 10netops: cr1-codfw:fpc5 partial failure - https://phabricator.wikimedia.org/T254216 (10ayounsi) > Please find the below KB for TOE chip memory errors reported on routers with MPC-3D-16XGE-SFPP FPCs. > https://kb.juniper.net/InfoCenter/index?page=content&id=KB31235 > These messages could indicate... [13:12:19] (03CR) 10Dzahn: "I was about to say that RAID 10 with 4 disks works for this and we have enough space. so +1 to abandon it." [puppet] - 10https://gerrit.wikimedia.org/r/601708 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [13:13:38] (03CR) 10Muehlenhoff: [C: 03+2] Make systemd::sysuser require systemd class [puppet] - 10https://gerrit.wikimedia.org/r/601703 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [13:16:33] (03PS1) 10Ssingh: dnsdist: add parameters for TLS configuration [puppet] - 10https://gerrit.wikimedia.org/r/601727 (https://phabricator.wikimedia.org/T252132) [13:17:50] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2175 is CRITICAL: Host mw2175 is not in mediawiki-installation dsh group daniel_zahn decom https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [13:17:50] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2176 is CRITICAL: Host mw2176 is not in mediawiki-installation dsh group daniel_zahn decom https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [13:17:50] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2179 is CRITICAL: Host mw2179 is not in mediawiki-installation dsh group daniel_zahn decom https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [13:18:12] (03CR) 10Dzahn: [C: 03+2] site: decom mw2173 through mw2179 [puppet] - 10https://gerrit.wikimedia.org/r/599603 (https://phabricator.wikimedia.org/T247018) (owner: 10Dzahn) [13:18:15] (03CR) 10jerkins-bot: [V: 04-1] dnsdist: add parameters for TLS configuration [puppet] - 10https://gerrit.wikimedia.org/r/601727 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [13:18:24] (03PS4) 10Dzahn: site: decom mw2173 through mw2179 [puppet] - 10https://gerrit.wikimedia.org/r/599603 (https://phabricator.wikimedia.org/T247018) [13:18:34] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [13:18:35] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [13:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:42] mutante: wait [13:18:51] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [13:18:52] I have still those 2 patches to merge [13:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:57] volans: ok! [13:18:59] !log dzahn@cumin1001 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) [13:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:04] * mutante hits ctrl+c [13:19:40] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: check repositories [cookbooks] - 10https://gerrit.wikimedia.org/r/598065 (owner: 10Volans) [13:19:48] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: use new spicerack.actions [cookbooks] - 10https://gerrit.wikimedia.org/r/598153 (owner: 10Volans) [13:19:53] volans: yes, i have about 14. 7 per change [13:20:02] lol [13:20:13] hopefully all works fine at the first [13:20:15] (03PS2) 10Ssingh: dnsdist: add parameters for TLS configuration [puppet] - 10https://gerrit.wikimedia.org/r/601727 (https://phabricator.wikimedia.org/T252132) [13:20:31] i am doing max 5 because i don't want to --force it [13:20:35] sorry for having stopped you, I already missed few occasions in the last days [13:20:43] no problem. i am making a coffee [13:20:55] thx, I'll ping once deployed in few [13:20:58] bbiaw [13:21:03] ack, cool [13:21:52] (03Merged) 10jenkins-bot: sre.hosts.decommission: check repositories [cookbooks] - 10https://gerrit.wikimedia.org/r/598065 (owner: 10Volans) [13:21:59] (03Merged) 10jenkins-bot: sre.hosts.decommission: use new spicerack.actions [cookbooks] - 10https://gerrit.wikimedia.org/r/598153 (owner: 10Volans) [13:24:12] mutante: cookbooks deployed, I'm around in case of any issue [13:26:18] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/22929/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/601727 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [13:28:09] volans: ack, starting with a single one [13:28:21] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [13:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:26] great [13:28:36] Found match(es) in the Puppet or mediawiki-config repositories (see above), proceed anyway? [13:28:40] Type "done" to proceed [13:28:46] hah, nice volans! [13:28:51] (03PS1) 10Muehlenhoff: Enable managed adduser config for codfw [puppet] - 10https://gerrit.wikimedia.org/r/601730 (https://phabricator.wikimedia.org/T235162) [13:28:54] false positive or real one? [13:28:54] best test of that new feature :) [13:28:58] looking [13:28:58] yeah [13:29:08] PROBLEM - mediawiki-installation DSH group on mw2178 is CRITICAL: Host mw2178 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [13:30:11] volans: it's real.. but you know why.. i am following the ServerLifecycle that says to first run decom cookbook and then remove from site.pp [13:31:00] what we wanted was "in other places besides site.pp" or change the order .. i guess [13:31:11] let me see [13:31:24] well, but you need to handle the service specific removal steps before running the decom cookbook in any case# [13:31:56] yea, i did "depool=inactive" before [13:32:13] but i did not do "remove from site.pp, conftool and DHCP" before running it [13:32:34] what we wanted it to catch was stuff like "it is also an mcrouter proxy" [13:32:43] yeah, I see [13:32:48] how we want to improve it? :) [13:33:13] if i remove it from site.pp first and then run the decom script.. will it be a problem? [13:33:21] because it wont find it in puppet db anymore, right? [13:33:44] will still be in puppetdb but puppet should start failing [13:33:53] because no matches in site.pp [13:34:11] and we should not remove dhcp before the decom I'd say [13:34:30] that's not a big deal.. the flip-side is that we won't be getting the Icinga alerts "not in dsh group" if i depool but don't remove from site [13:35:04] i don't think DHCP is very important.. it's not like dcops use it to boot into disk wiping software or something [13:35:05] you can anyway override the check and proceed anyway [13:35:19] maybe is just a good reminder of what's left [13:35:23] to remove after it :) [13:35:23] we would really only need to revert DHCP if we change our mind and want to reinstall [13:35:38] (03CR) 10Muehlenhoff: [C: 03+2] Enable managed adduser config for codfw [puppet] - 10https://gerrit.wikimedia.org/r/601730 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [13:35:58] * mutante types "done" to override it [13:37:14] i dunno, the other option seems to be to say "site.pp and DHCP are excluded from the check" [13:37:25] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [13:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:32] 10Operations, 10ops-codfw, 10decommission, 10serviceops, 10Patch-For-Review: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw217... [13:38:13] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [13:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:26] PROBLEM - mediawiki-installation DSH group on mw2177 is CRITICAL: Host mw2177 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [13:38:39] (03PS1) 10Ottomata: EventLogging - use EventGate on group0 wikis for SearchSatisfaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601732 (https://phabricator.wikimedia.org/T249261) [13:39:11] volans: i like how it shows me the matches though.. so i can glance at it and see "ah yea, only expected things". that is already useful to catch stuff [13:39:35] ok, great, thanks for the testing and feedback [13:41:26] (03CR) 10Jbond: [C: 03+1] "LGTM some optional comments" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/601727 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [13:41:28] yep, i see no other issues. ran it on 4 at once this time. [13:42:03] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [13:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:10] 10Operations, 10ops-codfw, 10decommission, 10serviceops, 10Patch-For-Review: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[21... [13:42:18] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [13:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:17] volans: actually..it is only about DHCP and conftool-data, not about site.pp. I don't think it finds them in site.pp because that would mean expanding all the regexes [13:43:41] yeah matching regexs with a regex is not easy ;) [13:43:47] yea.. [13:45:12] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [13:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:19] 10Operations, 10ops-codfw, 10decommission, 10serviceops, 10Patch-For-Review: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[21... [13:48:23] 10Operations, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Jclark-ctr) @Marostegui Replaced failed DIMM. host is powered back on [13:49:16] 10Operations, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Marostegui) Thank you, I will take it from here [13:53:34] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.01017 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:54:01] (03CR) 10Dzahn: "I agree with Jbond that making the ciphers an array is a bit nicer but also that doing an entire type is overkill." [puppet] - 10https://gerrit.wikimedia.org/r/601727 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [13:54:03] ^ the puppet alerts are known [13:54:56] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 94 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [13:55:40] jynus: ^ this one normal too because of the restore test or something? it was actually ok earlier [13:55:53] as in .. i saw it recover after the first full one was done [13:56:34] mutante: at the start of the month there is a bit of overload [13:56:44] because all full backups run the first week [13:56:51] so there could be some delays [13:57:07] the restore may not help by adding more scheduled jobs [13:57:27] jynus: ah, restore being a job makes sense.. ack [13:57:29] the main issue if the phabricator backup [13:57:51] that takes a long time as it has every git repo too [13:58:04] that is why I didn't want to cancel it [13:58:15] as it would have to start from the beginning [13:58:30] i saw that comment. also thanks for the whole "check whether it can actually be restored" comment [13:58:41] yea, don't cancel it. ack [13:58:59] given it is a non prio test, I prefer to continue with the normal schedule [13:59:10] now that we have more hardware we could tune concurrency [13:59:30] but it is one of those things that to try to make things faster they may end up slower [13:59:36] (03PS1) 10Jbond: adduser: manage /etc/sysusers.d [puppet] - 10https://gerrit.wikimedia.org/r/601738 [13:59:47] yep! makes sense to me. thank you. gotta go to a meeting now [14:05:27] (03CR) 10Jbond: "pcc https://puppet-compiler.wmflabs.org/compiler1003/22930/" [puppet] - 10https://gerrit.wikimedia.org/r/601738 (owner: 10Jbond) [14:06:05] (03PS6) 10Alexandros Kosiaris: rake: Add kubeyaml validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 [14:08:48] (03PS1) 10Muehlenhoff: Revert "Enable managed adduser config for codfw" [puppet] - 10https://gerrit.wikimedia.org/r/601739 [14:09:52] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [14:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:04] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Enable managed adduser config for codfw" [puppet] - 10https://gerrit.wikimedia.org/r/601739 (owner: 10Muehlenhoff) [14:13:22] (03PS1) 10Alexandros Kosiaris: ganeti: codfw+eqiad: Reimage as stretch [puppet] - 10https://gerrit.wikimedia.org/r/601740 [14:15:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] ganeti: codfw+eqiad: Reimage as stretch [puppet] - 10https://gerrit.wikimedia.org/r/601740 (owner: 10Alexandros Kosiaris) [14:17:29] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['db1140.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20200... [14:19:01] (03Abandoned) 10Jbond: adduser: manage /etc/sysusers.d [puppet] - 10https://gerrit.wikimedia.org/r/601738 (owner: 10Jbond) [14:20:43] (03CR) 10Ssingh: "Thank you both for your review:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/601727 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [14:23:07] 10Puppet, 10User-jbond: Upgrade puppet to use hiera version 5 - https://phabricator.wikimedia.org/T254248 (10jbond) p:05Triageβ†’03Medium [14:23:24] (03PS55) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 (https://phabricator.wikimedia.org/T254248) [14:25:44] (03PS3) 10Jbond: java: update java.security [puppet] - 10https://gerrit.wikimedia.org/r/593467 (https://phabricator.wikimedia.org/T251493) [14:27:04] 10Operations, 10Patch-For-Review, 10User-MoritzMuehlenhoff, 10User-jbond: Updated java security policy in OpenJDK 8 u252 - https://phabricator.wikimedia.org/T251493 (10jbond) [14:28:09] (03CR) 10Cwhite: [C: 03+1] icinga: delete unreferenced contact groups [puppet] - 10https://gerrit.wikimedia.org/r/601672 (https://phabricator.wikimedia.org/T254006) (owner: 10Filippo Giunchedi) [14:28:19] (03Abandoned) 10Jbond: add abuse_networks [labs/private] - 10https://gerrit.wikimedia.org/r/583327 (owner: 10Jbond) [14:28:41] (03PS1) 10Ayounsi: Depool codfw for network work [dns] - 10https://gerrit.wikimedia.org/r/601741 (https://phabricator.wikimedia.org/T254216) [14:29:16] (03CR) 10CDanis: [C: 03+1] Depool codfw for network work [dns] - 10https://gerrit.wikimedia.org/r/601741 (https://phabricator.wikimedia.org/T254216) (owner: 10Ayounsi) [14:29:38] (03CR) 10Vgutierrez: [C: 03+1] Depool codfw for network work [dns] - 10https://gerrit.wikimedia.org/r/601741 (https://phabricator.wikimedia.org/T254216) (owner: 10Ayounsi) [14:30:00] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10User-jbond: Ensure hiera only has profile:: qualified or global hiera keys - https://phabricator.wikimedia.org/T247956 (10jbond) [14:30:14] (03PS1) 10Muehlenhoff: Ship the sysusers default config via systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/601743 (https://phabricator.wikimedia.org/T235162) [14:30:46] PROBLEM - Host thumbor1004 is DOWN: PING CRITICAL - Packet loss = 100% [14:30:49] (03CR) 10Ayounsi: [C: 03+2] Depool codfw for network work [dns] - 10https://gerrit.wikimedia.org/r/601741 (https://phabricator.wikimedia.org/T254216) (owner: 10Ayounsi) [14:30:54] (03PS1) 10Jcrespo: install_server: Update NIC hw address for db1140 [puppet] - 10https://gerrit.wikimedia.org/r/601744 (https://phabricator.wikimedia.org/T250602) [14:31:04] PROBLEM - Host thumbor1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:31:32] !log depool codfw - T254216 [14:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:36] T254216: cr1-codfw:fpc5 partial failure - https://phabricator.wikimedia.org/T254216 [14:31:41] (03CR) 10jerkins-bot: [V: 04-1] Ship the sysusers default config via systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/601743 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [14:32:32] PROBLEM - Host thumbor1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:33:37] uh [14:33:54] thumbor problems known? [14:34:15] hw maintenance? [14:34:29] (03PS1) 10Hnowlan: changeprop: disable config value quoting in jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/601745 [14:35:20] PROBLEM - Host thumbor1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:35:21] (03CR) 10Jcrespo: [C: 03+2] install_server: Update NIC hw address for db1140 [puppet] - 10https://gerrit.wikimedia.org/r/601744 (https://phabricator.wikimedia.org/T250602) (owner: 10Jcrespo) [14:35:38] (03CR) 10Ppchelko: [C: 03+2] changeprop: disable config value quoting in jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/601745 (owner: 10Hnowlan) [14:36:10] (03Merged) 10jenkins-bot: changeprop: disable config value quoting in jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/601745 (owner: 10Hnowlan) [14:36:12] (03PS11) 10Jbond: puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 [14:36:42] (03CR) 10Jbond: "rebased" [puppet] - 10https://gerrit.wikimedia.org/r/544943 (owner: 10Jbond) [14:37:04] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.004453 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:37:09] thumbor100[34] are both in rack D5, I think we're decoming a bunch of the appservers in that rack -- mutante to confirm [14:37:10] D5: Ok so I hacked up ssh.py to use mozprocess - https://phabricator.wikimedia.org/D5 [14:37:17] 1003 and 1004 are both in D5 [14:37:24] (03CR) 10jerkins-bot: [V: 04-1] puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 (owner: 10Jbond) [14:37:28] and can't connect to neither mgmt or the actual servers [14:37:33] 10Puppet, 10User-jbond: Refactor puppet-merge - https://phabricator.wikimedia.org/T254249 (10jbond) p:05Triageβ†’03Medium [14:37:51] (03PS12) 10Jbond: puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249) [14:38:58] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['db1140.eqiad.wmnet'] ` The log can be found in `/var/log/... [14:38:59] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [14:39:00] (03CR) 10jerkins-bot: [V: 04-1] puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249) (owner: 10Jbond) [14:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:44] RECOVERY - cassandra-b CQL 10.192.48.55:9042 on restbase2009 is OK: TCP OK - 0.036 second response time on 10.192.48.55 port 9042 https://phabricator.wikimedia.org/T93886 [14:39:49] yeah, confirming, all the mw hosts in that rack are in state DECOM according to netbox, but those thumbor hosts are not [14:40:04] is dcops onsite in eqiad today? [14:40:22] (03PS13) 10Jbond: puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249) [14:41:02] db1137 is also in D5 and hasn't alerted afaict, so we can probably rule out rack network/power issues [14:41:13] cmjohnson1 owns https://phabricator.wikimedia.org/T253856 [14:42:18] (03PS3) 10Apakhomov: eventgate: added support egress rules eventgate: Deleted networkpolicy field from values-canary.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/597772 [14:43:25] cmjohnson1: ^ thumbor1003/1004 are unreachable, side effect of mw decoms? [14:43:45] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar) [14:43:46] contact jclark-ctr he is onsite right now [14:43:56] I pinged in #-dcops [14:44:27] I will be there shortly [14:44:52] (03CR) 10Cwhite: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/601436 (https://phabricator.wikimedia.org/T254192) (owner: 10Cwhite) [14:45:06] (03PS2) 10Apakhomov: eventstreams: added support egress rules eventstreams: deleted networkpolicy field from values-canary.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/597774 [14:45:09] (03CR) 10Bstorm: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/601374 (owner: 10Bstorm) [14:47:02] (03CR) 10Bstorm: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/601374 (owner: 10Bstorm) [14:47:08] !log cdanis@cumin1001 conftool action : set/pooled=no; selector: name=thumbor100[34].* [14:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:27] cdanis: thanks, was just about to ask you to do that -- I'm having trouble sshing [14:47:52] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar) [14:48:18] rzl: are you trying to use codfw per chance [14:49:03] !log prefer eqsin-ulsfo tunnel - T254216 [14:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:06] T254216: cr1-codfw:fpc5 partial failure - https://phabricator.wikimedia.org/T254216 [14:49:35] cdanis: I was, but connecting direct to bast1002 didn't work either [14:49:43] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime [14:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:01] oh, yes it did, just took longer than I expected [14:50:06] rzl: connection timeout? or other? also please get mtrs [14:50:07] ah okay [14:50:19] !log jynus@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [14:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:24] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [14:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:39] (and yeah, you'd think eqiad would be best from my house, but codfw is reliably lower-latency, go figure) [14:51:57] (when it's online, anyway) [14:52:26] so far I think it is still online [14:54:01] (03CR) 10Jbond: [C: 03+1] dnsdist: add parameters for TLS configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/601727 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [14:54:47] (03PS2) 10Apakhomov: mathoid: added support egress rules mathoid: deleted _policy_helper.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/597777 [14:55:11] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10jcrespo) a:05Jclark-ctrβ†’03jcrespo Jclark changed the serial port and we now have output on serial console, including post. The other issue was the mac address... [14:55:18] (03Abandoned) 10Jbond: wmf_auto_reimage: improve fingerprint detection [puppet] - 10https://gerrit.wikimedia.org/r/515051 (owner: 10Jbond) [14:56:08] (03PS1) 10Bstorm: labstore: turn off systemd paging for labstore1004/5 [puppet] - 10https://gerrit.wikimedia.org/r/601753 [14:56:46] (03CR) 10EBernhardson: query_service: Move shared config into common file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/599145 (owner: 10EBernhardson) [14:56:58] !log depref ulsfo-codfw link - T254216 [14:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:02] T254216: cr1-codfw:fpc5 partial failure - https://phabricator.wikimedia.org/T254216 [14:57:09] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [14:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:19] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1140.eqiad.wmnet'] ` and were **ALL** successful. [14:58:13] 10Operations, 10observability, 10User-MoritzMuehlenhoff, 10Wikimedia-Incident: Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810 (10fgiunchedi) I've PoC this with `check_ipmi_sensor` which supports checking SEL, for example: ` /usr/local/lib/nagios/plugins/check_ipmi_sensor -f freei... [15:00:56] RECOVERY - Host thumbor1003 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [15:01:10] RECOVERY - Host thumbor1004 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [15:01:46] RECOVERY - Host thumbor1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [15:01:58] (03PS2) 10Apakhomov: chromium-render: added support egress rules chromium-render: Created symlink _helpers.tpl from common templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/597785 [15:02:22] RECOVERY - Host thumbor1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.94 ms [15:03:15] (03PS1) 10Bstorm: cloudstore: turn off systemd paging for cloudstore1008/9 [puppet] - 10https://gerrit.wikimedia.org/r/601756 [15:03:37] (03PS1) 10Volans: scripts: assign DNS name for mgmt address [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/601757 [15:04:46] PROBLEM - Check systemd state on thumbor1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:05] !log shifting all high traffic cpjobqueue rules to k8s [15:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:10] PROBLEM - Check whether ferm is active by checking the default input chain on thumbor1003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:05:24] PROBLEM - Check systemd state on thumbor1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:06:28] PROBLEM - Host wtp1032 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:37] !log hnowlan@deploy1001 Started deploy [cpjobqueue/deploy@8a53ff1]: (no justification provided) [15:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:56] RECOVERY - cassandra-c service on restbase2009 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:07:04] (03PS3) 10Ssingh: dnsdist: add parameters for TLS configuration [puppet] - 10https://gerrit.wikimedia.org/r/601727 (https://phabricator.wikimedia.org/T252132) [15:07:26] !log reboot cr1-codfw:fpc5 - T254216 [15:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:29] T254216: cr1-codfw:fpc5 partial failure - https://phabricator.wikimedia.org/T254216 [15:08:25] (03PS2) 10Muehlenhoff: Ship the sysusers default config via systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/601743 (https://phabricator.wikimedia.org/T235162) [15:08:25] restarting ferm on thumbor100[34] [15:08:26] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar) Partitions on `contint1001` have chang... [15:09:10] !log hnowlan@deploy1001 Finished deploy [cpjobqueue/deploy@8a53ff1]: (no justification provided) (duration: 02m 33s) [15:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:46] (03CR) 10jerkins-bot: [V: 04-1] Ship the sysusers default config via systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/601743 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [15:10:12] RECOVERY - Check systemd state on thumbor1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:10:21] (03CR) 10Jdlrobson: [C: 04-1] Use AddFooterLink hook for code of conduct and contact links (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596277 (https://phabricator.wikimedia.org/T251817) (owner: 10Jdlrobson) [15:10:50] RECOVERY - Check systemd state on thumbor1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:10:59] net issues on codfw? lots of hosts went down (soft) [15:11:17] jynus: https://phabricator.wikimedia.org/T254216 [15:11:23] ah, I see [15:11:25] (03CR) 10Ssingh: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/601727 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:11:28] it was 1.5 hours ago [15:11:39] or more [15:11:42] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:11:42] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:11:46] PROBLEM - OSPF status on cr1-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:11:57] ^but that is new? [15:12:16] (03PS1) 10Hashar: contint: move Docker data to /srv/docker [puppet] - 10https://gerrit.wikimedia.org/r/601760 (https://phabricator.wikimedia.org/T224591) [15:12:26] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:12:37] might be expected, XioNoX is actively working on it [15:12:39] ah, I see [15:12:42] the logs [15:12:44] sorry [15:13:02] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 18.17 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:13:12] hosts should not go down [15:13:21] ok, so that is strange inded [15:13:27] (03PS3) 10Apakhomov: mediawiki-dev: added support egress rules mediawiki-dev: Created symlink _helpers.tpl from common templates. Fixed appbase_url_port field in values. [deployment-charts] - 10https://gerrit.wikimedia.org/r/597787 [15:13:32] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:13:32] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:13:34] RECOVERY - OSPF status on cr1-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:13:49] a glitch for things migrating to the other router? [15:14:08] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:14:10] everything should have been offloaded from cr1-codfw so in theory no [15:14:14] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:14:30] RECOVERY - Check whether ferm is active by checking the default input chain on thumbor1003 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:14:52] everything is coming up, even the down ports so far [15:14:55] I will let you work, thinkgs came up [15:15:16] (03PS1) 10Kormat: install_server: Allow reuse of partitions during reimage. [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T251768) [15:15:33] thx for checking! [15:16:00] (03PS2) 10Kormat: install_server: Allow reuse of partitions during reimage. [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) [15:17:17] (03PS3) 10Kormat: install_server: Allow reuse of partitions during reimage. [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) [15:17:42] (03PS3) 10Apakhomov: parsoid: added support egress rules parsoid: Created symlink _helpers.tpl from common templates. Fixed appbase_url_port field in values. [deployment-charts] - 10https://gerrit.wikimedia.org/r/597789 [15:19:54] !log rollback ospf changes - T254216 [15:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:58] T254216: cr1-codfw:fpc5 partial failure - https://phabricator.wikimedia.org/T254216 [15:20:23] (03CR) 10Hashar: "There is no Docker images on contint1001 yet, and even if we had, we can afford to delete them since they are always published to the regi" [puppet] - 10https://gerrit.wikimedia.org/r/601760 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [15:20:28] (03PS4) 10Kormat: install_server: Allow reuse of partitions during reimage. [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) [15:21:39] 10Operations, 10netops: cr1-codfw:fpc5 partial failure - https://phabricator.wikimedia.org/T254216 (10ayounsi) 05Openβ†’03Resolved FPC reboot solved the issue. Will re-open if it re-appears. [15:21:39] thumbor100[34] are back and alerts are cleared (except for T215411) so I'm going to repool [15:21:40] T215411: (OoW) thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 [15:21:46] (03PS1) 10Ayounsi: Revert "Depool codfw for network work" [dns] - 10https://gerrit.wikimedia.org/r/601766 [15:22:04] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)8 ge (W)1 ge 0.2625 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:22:45] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool codfw for network work" [dns] - 10https://gerrit.wikimedia.org/r/601766 (owner: 10Ayounsi) [15:23:19] !log repool codfw - T254216 [15:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:55] (03CR) 10CRusnov: "LGTM." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/601757 (owner: 10Volans) [15:24:35] !log rzl@cumin1001 conftool action : set/pooled=yes; selector: name=thumbor100[34].* [15:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:38] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 9.204 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:26:32] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic={rsyslog-notice,udp_localhost-info} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+promet [15:26:32] ter=logging-eqiad&var-topic=All&var-consumer_group=All [15:29:06] icinga-wm: errr message too long! [15:29:14] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 8.808 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:29:44] sigh, I'll take a look [15:31:05] hnowlan: FYI I'm trying to confirm whether the last cpjobqueue deployment is spamming the logs or sth related [15:31:18] godog: it most likely is :( [15:31:20] I'll fix now [15:31:35] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [15:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:37] ah, did you find the log spam already ? [15:31:41] hnowlan: ^ [15:31:59] godog: it's been in a verbose logging state for a little bit to debug and I didn't turn that down when I did my deploy [15:32:03] so it's almost certainly me [15:32:08] deploying a fix now [15:33:08] ah yeah lotsa "Event was deduplicated based on sha1" [15:33:38] thanks hnowlan ! [15:34:01] (03PS1) 10Hnowlan: cpjobqueue: reduce log level [deployment-charts] - 10https://gerrit.wikimedia.org/r/601776 [15:34:29] godog: the spam should be on the way down [15:34:40] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 8.458 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:35:08] (03CR) 10Dzahn: "@Ssingh There is also modules/wmflib/lib/puppet/parser/functions/ssl_ciphersuite.rb which takes an argument of "strong, mid, compat" and r" [puppet] - 10https://gerrit.wikimedia.org/r/601727 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:35:16] it indeed is hnowlan [15:35:23] sorry about that! [15:35:23] !log power cycling wtp1032 which is bootlooping? https://phabricator.wikimedia.org/P11364 [15:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:54] hnowlan: no worries, backlog on kafka is recovery already [15:37:11] !log cdanis@cumin1001 conftool action : set/pooled=no; selector: name=wtp1032.* [15:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:24] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS6461/IPv4: Active - Zayo, AS6461/IPv6: Active - Zayo https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:39:36] (03CR) 10Dzahn: [C: 03+2] "going ahead since it only influences the server 1001 currently not used" [puppet] - 10https://gerrit.wikimedia.org/r/601760 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [15:40:34] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: add additional mtail args support and set disable_fsnotify in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/601436 (https://phabricator.wikimedia.org/T254192) (owner: 10Cwhite) [15:40:43] 10Operations, 10ops-eqiad, 10serviceops-radar: wtp1032 bootlooping on CPU error - https://phabricator.wikimedia.org/T254258 (10CDanis) [15:41:06] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: add additional flags support for atsmtail and disable_fsnotify in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/601440 (https://phabricator.wikimedia.org/T254192) (owner: 10Cwhite) [15:41:12] (03CR) 10Filippo Giunchedi: [C: 03+1] varnish: add support for additional mtail args and set disable_fsnotify in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/601430 (https://phabricator.wikimedia.org/T254192) (owner: 10Cwhite) [15:41:29] (03PS3) 10Muehlenhoff: Ship the sysusers default config via systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/601743 (https://phabricator.wikimedia.org/T235162) [15:42:56] (03CR) 10jerkins-bot: [V: 04-1] Ship the sysusers default config via systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/601743 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [15:43:42] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)8 ge (W)1 ge 0.05417 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:44:34] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [15:45:06] cdanis: update: there was about 1m15s of delay before sshing via *any* bastion, and it went away when I rebooted my laptop, so I'm going to diagnose this as "computers, am I right?" and move on with my life [15:45:54] !log contint1001 - restarting docker afer changed data-root path (T224591) [15:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:58] T224591: Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 [15:46:45] rzl: πŸ€” [15:46:54] rzl: there is a lot of "computers, am I right?" today [15:47:02] (03CR) 10Ssingh: "> @Ssingh There is also modules/wmflib/lib/puppet/parser/functions/ssl_ciphersuite.rb" [puppet] - 10https://gerrit.wikimedia.org/r/601727 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:48:24] !log contint1001 - rm -rf /mnt/docker (T224591) [15:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:29] !log push frack fw rules - T254260 [15:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:14] !log thumbor1003 and thumbor1004 blipped, no obvious explanation, logs gathered at P11365 P11366 P11367 [15:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:46] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/601727 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:53:17] (03CR) 10Ssingh: [C: 03+2] dnsdist: add parameters for TLS configuration [puppet] - 10https://gerrit.wikimedia.org/r/601727 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:00:05] godog and _joe_: Your horoscope predicts another unfortunate Puppet SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200602T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:00:35] (03PS4) 10Muehlenhoff: Ship the sysusers default config via systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/601743 (https://phabricator.wikimedia.org/T235162) [16:02:19] (03CR) 10jerkins-bot: [V: 04-1] Ship the sysusers default config via systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/601743 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [16:07:17] (03PS1) 10Ssingh: dnsdist: update the configuration file template (improves 48144c89) [puppet] - 10https://gerrit.wikimedia.org/r/601782 (https://phabricator.wikimedia.org/T252132) [16:10:42] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1002/22932/" [puppet] - 10https://gerrit.wikimedia.org/r/601782 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:11:36] (03CR) 10Ssingh: [C: 03+2] dnsdist: update the configuration file template (improves 48144c89) [puppet] - 10https://gerrit.wikimedia.org/r/601782 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:22:30] looks like i broke puppet run on icinga by merging "add data types to monitoring::service" [16:22:39] which means the decom'ed hosts from earlier are not removed yet.. hrmm [16:23:30] parameter 'host' expects a Stdlib::Host which is normally ok.. but there are of course special cases ..like: [16:23:34] ncredir-lb.codfw.wikimedia.org_ncredir_v6 [16:23:54] the underscore part recently added to avoid duplicate definitions i think .. now causing this [16:23:57] (03CR) 10Volans: [C: 03+2] scripts: assign DNS name for mgmt address [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/601757 (owner: 10Volans) [16:26:08] (03CR) 10Dzahn: monitoring: add data types to monitoring::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551882 (owner: 10Dzahn) [16:26:51] 10Operations, 10DC-Ops, 10cloud-services-team (Hardware): labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286 (10Bstorm) [16:28:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs vms: stop using ns1 for resolving [puppet] - 10https://gerrit.wikimedia.org/r/601714 (https://phabricator.wikimedia.org/T253780) (owner: 10Andrew Bogott) [16:28:57] (03PS1) 10Dzahn: monitoring: allow "host" to also be a string [puppet] - 10https://gerrit.wikimedia.org/r/601789 [16:30:15] (03CR) 10Dzahn: [C: 03+2] monitoring: allow "host" to also be a string [puppet] - 10https://gerrit.wikimedia.org/r/601789 (owner: 10Dzahn) [16:33:10] ACKNOWLEDGEMENT - Host wtp1032 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T254258 [16:34:04] puppet on icinga fixed with follow-up above (for now) [16:34:18] decom'ed hosts disappearing now [16:40:00] (03CR) 10Cwhite: [C: 03+2] profile: add additional mtail args support and set disable_fsnotify in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/601436 (https://phabricator.wikimedia.org/T254192) (owner: 10Cwhite) [16:54:53] (03PS1) 10Ssingh: dnsdist: add a parameter for setting addDOHLocal's base URL [puppet] - 10https://gerrit.wikimedia.org/r/601796 (https://phabricator.wikimedia.org/T252132) [16:55:49] (03CR) 10jerkins-bot: [V: 04-1] dnsdist: add a parameter for setting addDOHLocal's base URL [puppet] - 10https://gerrit.wikimedia.org/r/601796 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:56:06] RECOVERY - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is OK: TCP OK - 0.037 second response time on 10.192.48.56 port 9042 https://phabricator.wikimedia.org/T93886 [16:57:48] (03PS2) 10Ssingh: dnsdist: add a parameter for setting addDOHLocal's base URL [puppet] - 10https://gerrit.wikimedia.org/r/601796 (https://phabricator.wikimedia.org/T252132) [16:59:20] (03PS1) 10Hnowlan: changeprop-jobqueue: enable all remaining jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/601798 (https://phabricator.wikimedia.org/T220399) [17:00:04] halfak and accraze: That opportune time is upon us again. Time for a Services – Graphoid / Citoid / ORES deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200602T1700). [17:00:19] ACKNOWLEDGEMENT - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS6461/IPv6: Active - Zayo, AS6461/IPv4: Active - Zayo Ayounsi TTN-0004129242 - The acknowledgement expires at: 2020-06-03 16:59:55. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:02:16] (03CR) 10Ssingh: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/22933/" [puppet] - 10https://gerrit.wikimedia.org/r/601796 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [17:03:04] (03CR) 10Ppchelko: [C: 03+1] changeprop-jobqueue: enable all remaining jobs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/601798 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [17:07:13] 10Operations, 10MediaWiki-General, 10serviceops-radar, 10Performance-Team (Radar), and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) [17:19:09] 10Operations, 10MediaWiki-General, 10serviceops-radar, 10Performance-Team (Radar), and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) >>! In the **task description,** @joe wrote: > I see the following possibilities […]: > >... [17:20:21] !log 1.35.0-wmf.35 was branched at 8d7015037d44c4fe21eee8e8a040720b838bc169 for T253023 [17:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:27] T253023: 1.35.0-wmf.35 deployment blockers - https://phabricator.wikimedia.org/T253023 [17:21:20] 10Operations, 10Core Platform Team, 10Performance-Team, 10Goal: Decommission the "session redis" cluster - https://phabricator.wikimedia.org/T243520 (10Krinkle) [17:21:23] 10Operations, 10MediaWiki-General, 10serviceops-radar, 10Performance-Team (Radar), and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) [17:26:47] 10Operations, 10Core Platform Team, 10Performance-Team, 10Goal: Decommission the "session redis" cluster - https://phabricator.wikimedia.org/T243520 (10Krinkle) Tentatively adding T212129 as sub task, but I think this task is trying to be two things at once, one of which is likely intended. 1. (Task title... [17:33:12] RECOVERY - BGP status on cr2-eqord is OK: BGP OK - up: 93, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:34:54] (03CR) 10EBernhardson: Role for SDoC WDQS (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) (owner: 10EBernhardson) [17:37:48] (03PS1) 10Reedy: Add wikimedia server_alias for api.wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/601808 (https://phabricator.wikimedia.org/T254185) [17:55:07] (03CR) 10Ppchelko: [C: 03+1] cpjobqueue: reduce log level [deployment-charts] - 10https://gerrit.wikimedia.org/r/601776 (owner: 10Hnowlan) [17:57:57] cdanis: Hey, I think you deploying earlier has led to me being unable to run scap clean somehow; I'm getting permission errors on /srv/mediawiki-staging/php-1.35.0-wmf.31/.git/objects/f2/106f4c399095ea62a064eb3a1a7bb90f189148 which is owned by you (on deploy1001). [17:58:06] (03PS5) 10Muehlenhoff: Ship the sysusers default config via systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/601743 (https://phabricator.wikimedia.org/T235162) [17:58:12] James_F: looking [17:59:06] (03CR) 10jerkins-bot: [V: 04-1] Ship the sysusers default config via systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/601743 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [17:59:15] It's still in the wikidev group though? [17:59:27] there's a lot of stuff that isn't group writable, don't know why [17:59:41] Oh, right, a+r no g+w. [17:59:45] That'd not help. [18:00:00] Hmm, my new branch also has that (but owned by me, so I guess I'd not notice). [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200602T1800) [18:00:08] yeah, I'm not sure what happened there... [18:00:09] Did we change how permissions work somehow? [18:00:17] I don't know! I'm curious how this is 'supposed' to work [18:00:29] I didn't do anything special, I just did a 'git revert' on a security patch in each branch earlier today. [18:00:35] Yeah. [18:00:47] How do we even get git to have permissions for the .git dir? [18:01:53] James_F: I just made all the subdirs of .git/objects g+w, maybe that fixes it? [18:02:13] cdanis: It fixed it for now, yes, thanks! [18:02:22] But we should work out how that happened. [18:02:44] Relying on a root to do weekly deploy cleanup would be tedious. [18:03:05] not sure if my umask is different or what [18:03:08] Unless there's some magic "roots shouldn't deploy" implicit rule ? [18:03:10] I'll do the same in the other branches I touched [18:03:14] Thanks. [18:03:16] (03PS1) 10Reedy: Update composer lock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601815 [18:05:19] !log fixing g+w permissions of deploy1001 /srv/mediawiki-staging/php-*/.git/objects/* [18:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:32] cdanis: Thank you so much. [18:13:21] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10wiki_willy) Thanks @jcrespo , our documentation looks to be a bit outdated, so we'll get this added in >>! In T250602#6185325, @jcrespo wrote: > Jclark changed the serial port and we... [18:16:00] (03PS1) 10Reedy: [beta] add apiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601817 (https://phabricator.wikimedia.org/T254185) [18:16:14] Reedy: apiwiki? Really? [18:16:32] Let's call it mediawikiapidocforkwiki. ;-) [18:16:35] lol [18:16:58] You go tell CPT the domain is going to be mediawikiapidocfor.wikimedai.org [18:17:01] * Reedy waits [18:17:06] mediawikipublicapidocumentationportalwiki? [18:17:15] The mapping of DB to domain is not automatic. [18:17:20] As well you know. :-P [18:17:29] It's easier when it makes sense [18:17:34] Wikimedai.org is for our AI off-shoot, right? [18:17:46] The articles write themselves [18:17:54] do we really need a new wiki for that? [18:18:03] can't we reuse one existing already? [18:18:05] We're going to have a beta one and a production one [18:18:08] hauskatze: Feel free to suggest to CPT that we shouldn't. [18:18:15] hauskatze: Above my paygrade [18:18:20] It's not our call. [18:18:24] James_F: do I note a hint of irony? :) [18:18:32] Nope. We're completely serious [18:18:39] Indeed. [18:18:47] We're doing it because we're being asked/told to do so [18:19:02] While we can make some tweaks... Yeah [18:19:17] Creating a wiki just for an API looks weird to me [18:19:17] Reedy: Can we at least call it wikimediaapiwiki? [18:19:39] Reedy: "api" is a valid language code. [18:20:15] I did wonder about that [18:20:21] But multi tasking, so didn't get round to checking [18:20:22] Not currently assigned, but… [18:20:27] (03CR) 10Majavah: [beta] add apiwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601817 (https://phabricator.wikimedia.org/T254185) (owner: 10Reedy) [18:21:12] apa is the code for the Apache macrolanguage cluster. [18:21:24] So it'd be plausible for one of the Apache languages to be granted api in the future. [18:22:04] !log jforrester@deploy1001 Pruned MediaWiki: 1.35.0-wmf.31 (duration: 19m 59s) [18:22:06] 10Operations, 10Wikimedia-Logstash, 10observability: Logstash: add SSD tier to ELK7 cluster - https://phabricator.wikimedia.org/T247376 (10herron) >>! In T247376#6184607, @Dzahn wrote: > logtash2028 is reporting as failed SSH since 2 days. There is noting in SAL or an open ticket. Notifications are disable... [18:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:22] We should probably try and the prevent the clash now, indeed [18:22:22] Hmm, I thought wtp1032.eqiad.wmnet was depooled? It's still in the dsh list. [18:22:45] We really should move all the wikipedias to foowikipedia [18:22:53] But that's never going to happen. [18:23:10] heh [18:23:18] I'll leave a comment on the beta and prod tasks [18:23:24] Ack. [18:23:48] 10Operations, 10ops-eqiad, 10serviceops-radar: wtp1032 bootlooping on CPU error - https://phabricator.wikimedia.org/T254258 (10wiki_willy) a:03Cmjohnson @Cmjohnson - looks like the warranty on this one just ended a few months ago, so just let me know whatever you find during troubleshooting, and we can ord... [18:24:21] (03PS1) 10Jforrester: testwikis wikis to 1.35.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601820 [18:24:23] (03CR) 10Jforrester: [C: 03+2] testwikis wikis to 1.35.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601820 (owner: 10Jforrester) [18:25:08] (03Merged) 10jenkins-bot: testwikis wikis to 1.35.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601820 (owner: 10Jforrester) [18:25:17] !log jforrester@deploy1001 Started scap: testwikis wikis to 1.35.0-wmf.35 [18:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:19] 10Operations, 10ops-eqiad, 10serviceops-radar: wtp1032 bootlooping on CPU error - https://phabricator.wikimedia.org/T254258 (10Jdforrester-WMF) Machine seems to still be in the `dsh` group; can this be fixed? [18:27:32] 10Operations, 10ops-eqiad, 10Analytics: an-presto1004 down - https://phabricator.wikimedia.org/T253438 (10Cmjohnson) There is a larger issue with this server, replaced the CPU but noticed the power supplies are both failed. I could also smell burning in the server, swapped the power supplies with decom spar... [18:33:59] James_F: lol, I note amir has used apiwikimedia as the suggestion for the prod wiki... [18:34:05] But wikimedia is only used for chapter wikis currently [18:34:10] so that's more confustion [18:34:16] and some confusion too [18:34:22] Yes, definitely not that. [18:35:24] (03PS16) 10EBernhardson: Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) [18:46:36] (03PS1) 10Andrew Bogott: cloud puppet api: add a route to get a list of projects [puppet] - 10https://gerrit.wikimedia.org/r/601825 (https://phabricator.wikimedia.org/T252224) [18:50:20] PROBLEM - Host wtp1032.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:52:36] RECOVERY - Host wtp1032 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [18:55:01] 10Operations, 10ops-eqiad, 10serviceops-radar: wtp1032 bootlooping on CPU error - https://phabricator.wikimedia.org/T254258 (10Cmjohnson) 05Openβ†’03Resolved the server is out of warranty, I reseated both CPUs and cleared the system event log. The server booted okay. I will resolve this for now, please o... [18:55:20] (03CR) 10Cwhite: [C: 03+2] profile: add additional flags support for atsmtail and disable_fsnotify in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/601440 (https://phabricator.wikimedia.org/T254192) (owner: 10Cwhite) [18:55:28] (03PS3) 10Cwhite: profile: add additional flags support for atsmtail and disable_fsnotify in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/601440 (https://phabricator.wikimedia.org/T254192) [18:56:10] RECOVERY - Host wtp1032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [18:56:43] (03CR) 10Andrew Bogott: [C: 03+2] cloud puppet api: add a route to get a list of projects [puppet] - 10https://gerrit.wikimedia.org/r/601825 (https://phabricator.wikimedia.org/T252224) (owner: 10Andrew Bogott) [18:58:38] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 95 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [19:00:05] James_F and longma: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Mediawiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200602T1900). [19:00:27] (Still in the endless testwiki deployment of doom.) [19:01:53] ah [19:09:44] Woo-hoo, we reached 50% of the sync-apaches. Maybe another half hour to go? [19:18:54] PROBLEM - MariaDB Slave Lag: s1 on db2097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1128.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:27:00] 10Operations, 10Wikimedia-Logstash, 10observability: Logstash: add SSD tier to ELK7 cluster - https://phabricator.wikimedia.org/T247376 (10Dzahn) Ah, thanks Herron! [19:27:15] Now onto the cdb rebuild. [19:48:16] (03PS1) 10Herron: centrallog: split syslogs into host directories [puppet] - 10https://gerrit.wikimedia.org/r/601836 [19:49:21] (03PS2) 10Herron: centrallog: split syslogs into host directories [puppet] - 10https://gerrit.wikimedia.org/r/601836 [19:51:40] This is ludicrous. [19:52:08] (03PS1) 10Legoktm: Drop unused static-web-sssd image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/601838 [19:52:10] (03PS1) 10Legoktm: Add html web image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/601839 (https://phabricator.wikimedia.org/T241817) [19:53:05] (03PS8) 10Jdlrobson: Use AddFooterLink hook for code of conduct and contact links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596277 (https://phabricator.wikimedia.org/T251817) [19:53:42] James_F: still rebuilding? [19:54:30] (03CR) 10Herron: [C: 03+1] icinga: delete unreferenced contact groups [puppet] - 10https://gerrit.wikimedia.org/r/601672 (https://phabricator.wikimedia.org/T254006) (owner: 10Filippo Giunchedi) [19:54:57] longma: Yup. :-( [19:55:07] (03CR) 10Jdlrobson: [C: 04-1] Use AddFooterLink hook for code of conduct and contact links (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596277 (https://phabricator.wikimedia.org/T251817) (owner: 10Jdlrobson) [19:56:29] PROBLEM - PHP opcache health on wtp1032 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:56:44] 19:56:35 Finished scap-cdb-rebuild (duration: 29m 33s) [19:56:47] * James_F sighs heavily. [19:56:53] woohoo [19:57:22] OK, so just an hour late. [19:59:09] !log jforrester@deploy1001 Finished scap: testwikis wikis to 1.35.0-wmf.35 (duration: 93m 52s) [19:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:15] 19:59:09 Finished scap: testwikis wikis to 1.35.0-wmf.35 (duration: 93m 52s) [19:59:33] All looks fine; proceeding immediately to group0. [19:59:42] (03PS1) 10Jforrester: group0 wikis to 1.35.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601841 [19:59:44] (03CR) 10Jforrester: [C: 03+2] group0 wikis to 1.35.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601841 (owner: 10Jforrester) [20:00:28] (03Merged) 10jenkins-bot: group0 wikis to 1.35.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601841 (owner: 10Jforrester) [20:01:00] (03PS1) 10Jdlrobson: Disable growth survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601842 (https://phabricator.wikimedia.org/T251741) [20:01:11] (03CR) 10Herron: [C: 03+1] varnish: add support for additional mtail args and set disable_fsnotify in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/601430 (https://phabricator.wikimedia.org/T254192) (owner: 10Cwhite) [20:02:06] !log jforrester@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.35 [20:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:09] (03CR) 10Herron: [C: 03+1] hiera: install mtail 3.0.0~rc35 from component in esams and eqiad [puppet] - 10https://gerrit.wikimedia.org/r/599474 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [20:03:27] longma: Things look OK to me; agreed? [20:03:42] yeah looks good here [20:05:12] (03PS1) 10Jdlrobson: Enable talk pages on Swedish Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601843 (https://phabricator.wikimedia.org/T253985) [20:05:53] OK, declaring train departed. [20:06:07] Reedy: Want to go Beta wiki creating? Be my guest. ;-) [20:06:26] I'm busy for the next hour maybe ;P [20:06:33] Need to finish bike shedding on the dbname ;) [20:06:41] * James_F grins. [20:07:15] (I note, that's not busy for the next hour bike shedding on the dbname name) [20:07:59] (03CR) 10jerkins-bot: [V: 04-1] Enable talk pages on Swedish Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601843 (https://phabricator.wikimedia.org/T253985) (owner: 10Jdlrobson) [20:16:05] (03PS1) 10Jdlrobson: Stop special casing the main page on several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601847 (https://phabricator.wikimedia.org/T32405) [20:17:51] (03PS1) 10Bearloga: profile::analytics::cluster::packages::common: Add libfontconfig1-dev [puppet] - 10https://gerrit.wikimedia.org/r/601848 (https://phabricator.wikimedia.org/T254278) [20:18:13] apiportalwiki seems to work a bit better than wikimediaapi [20:18:20] as you'd expect it to start with a not w [20:18:31] (03CR) 10jerkins-bot: [V: 04-1] Stop special casing the main page on several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601847 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson) [20:22:25] (03CR) 10Reedy: [C: 03+2] Update composer lock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601815 (owner: 10Reedy) [20:22:57] Reedy: Let's go with that, then. [20:23:14] (03Merged) 10jenkins-bot: Update composer lock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601815 (owner: 10Reedy) [20:24:36] (03PS2) 10Jdlrobson: Stop special casing the main page on several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601847 (https://phabricator.wikimedia.org/T32405) [20:24:46] !log reedy@deploy1001 Synchronized composer.lock: Update (duration: 01m 06s) [20:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:19] (03PS2) 10Jdlrobson: Enable talk pages on Swedish Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601843 (https://phabricator.wikimedia.org/T253985) [20:29:32] (03PS2) 10Reedy: [beta] add apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601817 (https://phabricator.wikimedia.org/T254185) [20:29:39] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-Site-requests, and 2 others: Remove "Cache-control: no-cache" hack from wmf-config - https://phabricator.wikimedia.org/T247783 (10Krinkle) [20:30:26] (03CR) 10Jforrester: [C: 03+1] [beta] add apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601817 (https://phabricator.wikimedia.org/T254185) (owner: 10Reedy) [20:31:24] (03CR) 10Krinkle: [C: 03+1] [beta] add apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601817 (https://phabricator.wikimedia.org/T254185) (owner: 10Reedy) [20:34:21] ottomata: quick question, does this look right for declaring wiki-specific streams in wgEventStreams? https://www.mediawiki.org/wiki/Wikimedia_Product/Analytics_Infrastructure/Stream_configuration#Stream_cc-ing [20:34:56] bearloga: No, it's not valid php :P [20:35:15] 'default' = [ needs to become 'default' => [ [20:35:18] etc [20:35:36] Reedy: ah, whoops, thanks! [20:36:00] You're also missing some trailing commas [20:36:24] After schema_title where you've got sampling too [20:36:28] bearloga: heh aside from ^ ithink that isright ya! [20:36:58] Is it really nested arrays? [20:37:50] thanks! and yup [20:38:11] foreach ( $streamConfigsArray as $streamConfig ) { [20:38:11] $this->streamConfigEntries[] = new StreamConfig( $streamConfig ); [20:38:11] } [20:38:20] Just looked a bit odd in the config :) [21:01:12] RECOVERY - PHP opcache health on wtp1032 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:06:53] (03PS3) 10Reedy: [beta] add apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601817 (https://phabricator.wikimedia.org/T254185) [21:12:18] (03CR) 10Reedy: [C: 03+2] [beta] add apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601817 (https://phabricator.wikimedia.org/T254185) (owner: 10Reedy) [21:12:33] !log repooled wtp1032 T254258 [21:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:37] T254258: wtp1032 bootlooping on CPU error - https://phabricator.wikimedia.org/T254258 [21:13:31] (03Merged) 10jenkins-bot: [beta] add apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601817 (https://phabricator.wikimedia.org/T254185) (owner: 10Reedy) [21:15:45] zomg, addwiki wasn't broken [21:16:05] for the first time [21:17:24] PROBLEM - PHP opcache health on wtp1032 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:17:53] !log reedy@deploy1001 Synchronized dblists/all-labs.dblist: beta apiportalwiki T254185 (duration: 01m 06s) [21:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:58] T254185: Create new wiki on the beta cluster to test API Portal - https://phabricator.wikimedia.org/T254185 [21:19:11] !log reedy@deploy1001 Synchronized wikiversions-labs.json: beta apiportalwiki T254185 (duration: 01m 05s) [21:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:36] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: beta apiportalwiki T254185 (duration: 01m 05s) [21:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:54] !log reedy@deploy1001 Synchronized wmf-config/config/apiportalwiki.yaml: beta apiportalwiki T254185 (duration: 01m 05s) [21:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:53] !log reedy@deploy1001 Synchronized multiversion/MWMultiVersion.php: beta apiportalwiki T254185 (duration: 01m 06s) [21:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:56] T254185: Create new wiki on the beta cluster to test API Portal - https://phabricator.wikimedia.org/T254185 [21:24:38] RECOVERY - PHP opcache health on wtp1032 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:24:52] (03PS3) 10Jdlrobson: Stop special casing the main page on several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601847 (https://phabricator.wikimedia.org/T32405) [21:26:02] (03PS3) 10Cwhite: varnish: add support for additional mtail args and set disable_fsnotify in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/601430 (https://phabricator.wikimedia.org/T254192) [21:29:13] (03CR) 10Cwhite: [C: 03+2] varnish: add support for additional mtail args and set disable_fsnotify in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/601430 (https://phabricator.wikimedia.org/T254192) (owner: 10Cwhite) [21:39:31] (03PS1) 10Reedy: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601867 [21:39:33] (03CR) 10Reedy: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601867 (owner: 10Reedy) [21:39:53] (03Abandoned) 10Reedy: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601867 (owner: 10Reedy) [21:41:15] (03PS17) 10EBernhardson: Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) [21:41:52] (03PS1) 10Reedy: Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601868 [21:41:54] (03CR) 10Reedy: [C: 03+2] Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601868 (owner: 10Reedy) [21:42:19] (03Abandoned) 10Reedy: Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601868 (owner: 10Reedy) [21:42:40] (03PS1) 10Reedy: Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601870 [21:42:42] (03CR) 10Reedy: [C: 03+2] Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601870 (owner: 10Reedy) [21:42:58] (03Abandoned) 10Reedy: Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601870 (owner: 10Reedy) [21:43:29] (03CR) 10Legoktm: "Ping, can I get a +2?" [puppet] - 10https://gerrit.wikimedia.org/r/568857 (owner: 10Legoktm) [21:43:45] (03PS1) 10Alex Monk: acme_chief: Don't bother with cert-sync timer if we have no passive host to send to [puppet] - 10https://gerrit.wikimedia.org/r/601871 [21:44:42] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: Don't bother with cert-sync timer if we have no passive host to send to [puppet] - 10https://gerrit.wikimedia.org/r/601871 (owner: 10Alex Monk) [21:45:30] (03PS2) 10Alex Monk: acme_chief: Don't bother with cert-sync timer if we have no passive host [puppet] - 10https://gerrit.wikimedia.org/r/601871 [21:45:59] (03PS1) 10Reedy: Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601872 [21:46:01] (03CR) 10Reedy: [C: 03+2] Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601872 (owner: 10Reedy) [21:46:28] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: Don't bother with cert-sync timer if we have no passive host [puppet] - 10https://gerrit.wikimedia.org/r/601871 (owner: 10Alex Monk) [21:46:48] (03Merged) 10jenkins-bot: Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601872 (owner: 10Reedy) [21:48:18] !log reedy@deploy1001 Synchronized wmf-config/interwiki-labs.php: laaaaabs (duration: 01m 05s) [21:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:44] (03PS1) 10Cwhite: hiera: install mtail 3.0.0~rc35 from component in ulsfo and codfw [puppet] - 10https://gerrit.wikimedia.org/r/601874 (https://phabricator.wikimedia.org/T251466) [21:50:38] (03PS3) 10Alex Monk: acme_chief: Don't bother with cert-sync timer if we have no passive host [puppet] - 10https://gerrit.wikimedia.org/r/601871 [21:52:53] (03PS2) 10Cwhite: hiera: install mtail 3.0.0~rc35 from component in esams and eqiad [puppet] - 10https://gerrit.wikimedia.org/r/599474 (https://phabricator.wikimedia.org/T251466) [21:55:02] (03PS1) 10Volans: scripts: add support for primary IP generation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/601877 (https://phabricator.wikimedia.org/T233183) [21:59:02] (03CR) 10Cwhite: "LGTM modulo inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/601836 (owner: 10Herron) [21:59:27] (03CR) 10Volans: "The script is testable on af-netbox.wmflabs.org, where the following DCs are "activated":" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/601877 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [22:03:16] (03PS1) 10Ryan Kemper: maintenance::cirrussearch: extract to file [puppet] - 10https://gerrit.wikimedia.org/r/601882 (https://phabricator.wikimedia.org/T253114) [22:05:57] (03PS2) 10Ryan Kemper: maintenance::cirrussearch: extract to file [puppet] - 10https://gerrit.wikimedia.org/r/601882 (https://phabricator.wikimedia.org/T253114) [22:09:05] (03PS3) 10Ryan Kemper: maintenance::cirrussearch: extract to file [puppet] - 10https://gerrit.wikimedia.org/r/601882 (https://phabricator.wikimedia.org/T253114) [22:16:12] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [22:17:52] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [22:26:29] (03PS4) 10Ryan Kemper: maintenance::cirrussearch: extract index rebuild [puppet] - 10https://gerrit.wikimedia.org/r/601882 (https://phabricator.wikimedia.org/T253114) [22:27:31] (03CR) 10GergΕ‘ Tisza: [C: 03+1] maintenance::cirrussearch: extract index rebuild [puppet] - 10https://gerrit.wikimedia.org/r/601882 (https://phabricator.wikimedia.org/T253114) (owner: 10Ryan Kemper) [22:28:38] (03PS4) 10Jdlrobson: Stop special casing the main page on several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601847 (https://phabricator.wikimedia.org/T32405) [22:32:28] (03PS1) 10CRusnov: netbox: Configure for netbox-dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/601893 (https://phabricator.wikimedia.org/T253140) [22:32:56] (03PS5) 10Ryan Kemper: maintenance::cirrussearch: extract index rebuild [puppet] - 10https://gerrit.wikimedia.org/r/601882 (https://phabricator.wikimedia.org/T253114) [22:33:30] (03CR) 10GergΕ‘ Tisza: maintenance::cirrussearch: extract index rebuild (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/601882 (https://phabricator.wikimedia.org/T253114) (owner: 10Ryan Kemper) [22:33:38] (03CR) 10EBernhardson: maintenance::cirrussearch: extract index rebuild (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/601882 (https://phabricator.wikimedia.org/T253114) (owner: 10Ryan Kemper) [22:35:12] RECOVERY - Juniper alarms on cr1-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [22:36:01] (03PS6) 10Ryan Kemper: maintenance::cirrussearch: extract index rebuild [puppet] - 10https://gerrit.wikimedia.org/r/601882 (https://phabricator.wikimedia.org/T253114) [22:36:03] (03CR) 10GergΕ‘ Tisza: maintenance::cirrussearch: extract index rebuild (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/601882 (https://phabricator.wikimedia.org/T253114) (owner: 10Ryan Kemper) [22:37:21] 10Operations, 10ops-eqiad, 10DC-Ops: Update Documentation for dl360 Motherboard Swap - https://phabricator.wikimedia.org/T254272 (10wiki_willy) [22:37:23] (03PS1) 10Andrew Bogott: wmcs-novastats-puppetleaks: clean up prefixes for delete projects [puppet] - 10https://gerrit.wikimedia.org/r/601896 (https://phabricator.wikimedia.org/T252224) [22:38:16] (03CR) 10jerkins-bot: [V: 04-1] wmcs-novastats-puppetleaks: clean up prefixes for delete projects [puppet] - 10https://gerrit.wikimedia.org/r/601896 (https://phabricator.wikimedia.org/T252224) (owner: 10Andrew Bogott) [22:38:24] (03PS7) 10Ryan Kemper: maintenance::cirrussearch: extract index rebuild [puppet] - 10https://gerrit.wikimedia.org/r/601882 (https://phabricator.wikimedia.org/T253114) [22:39:49] (03PS2) 10Andrew Bogott: wmcs-novastats-puppetleaks: clean up prefixes for delete projects [puppet] - 10https://gerrit.wikimedia.org/r/601896 (https://phabricator.wikimedia.org/T252224) [22:40:30] (03CR) 10Ryan Kemper: "> Patch Set 4:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/601882 (https://phabricator.wikimedia.org/T253114) (owner: 10Ryan Kemper) [22:42:35] (03PS8) 10Ryan Kemper: maintenance::cirrussearch: extract index rebuild [puppet] - 10https://gerrit.wikimedia.org/r/601882 (https://phabricator.wikimedia.org/T253114) [22:43:48] 10Operations, 10DC-Ops, 10decommission, 10Goal: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821 (10wiki_willy) codfw Cisco servers were returned last quarter via Cisco's takeback program: https://www.cisco.com/c/en/us/about/takeback-and-reuse/takeback-recycle-program.html... [22:48:03] 10Operations, 10DC-Ops: determine/process/document bios firmware tracking/updating policies - https://phabricator.wikimedia.org/T141128 (10wiki_willy) Demo for Dell's System Management Tool set up for next Monday on June 8, to evaluate if it's something we want to use going forward or if it's something the Inf... [22:56:22] (03CR) 10Bstorm: [C: 03+2] toolforge: Run clush helpers with Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/567192 (https://phabricator.wikimedia.org/T218427) (owner: 10Legoktm) [22:57:00] (03PS2) 10Bstorm: toolforge: Run clush helpers with Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/567192 (https://phabricator.wikimedia.org/T218427) (owner: 10Legoktm) [22:59:02] (03CR) 10Bstorm: [C: 03+2] toolforge: Run clush helpers with Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/567192 (https://phabricator.wikimedia.org/T218427) (owner: 10Legoktm) [23:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200602T2300). [23:00:05] Jdlrobson: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:24] and im here [23:03:59] (03PS18) 10EBernhardson: Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) [23:04:34] anyone around to help me swat some things? [23:07:56] (03CR) 10Ryan Kemper: "https://puppet-compiler.wmflabs.org/compiler1003/22940/mwmaint1002.eqiad.wmnet/index.html pcc run" [puppet] - 10https://gerrit.wikimedia.org/r/601882 (https://phabricator.wikimedia.org/T253114) (owner: 10Ryan Kemper) [23:09:20] (03CR) 10EBernhardson: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/601882 (https://phabricator.wikimedia.org/T253114) (owner: 10Ryan Kemper) [23:21:22] (03CR) 10Ryan Kemper: [C: 03+2] maintenance::cirrussearch: extract index rebuild [puppet] - 10https://gerrit.wikimedia.org/r/601882 (https://phabricator.wikimedia.org/T253114) (owner: 10Ryan Kemper) [23:35:55] (03CR) 10Andrew Bogott: [C: 03+1] "This seems good, the double pages don't shed any additional light." [puppet] - 10https://gerrit.wikimedia.org/r/601753 (owner: 10Bstorm) [23:37:38] (03CR) 10Andrew Bogott: "I support removing the systemd pages. I'm not clear on what the other two changes do, if anything, but ignoring disk failures seems bad." [puppet] - 10https://gerrit.wikimedia.org/r/601374 (owner: 10Bstorm)