[00:00:04] twentyafterfour: #bothumor I � Unicode. All rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200618T0000). [00:01:46] (03CR) 10Dzahn: [C: 04-1] "> Patch Set 10:" [puppet] - 10https://gerrit.wikimedia.org/r/569104 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [00:09:44] (03PS1) 10Mstyles: sdoc gui idea [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) [03:46:01] 10Operations, 10Mail, 10OTRS, 10Trust-and-Safety, 10WMF-Legal: Forward emails addressed to privacy@wikidata to privacy@wikimedia - https://phabricator.wikimedia.org/T255733 (10jrbs) [03:50:50] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:51:06] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:58:02] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:58:18] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:41:23] 10Operations, 10WMF-Design, 10Design: Create sub-directory URL for Design blog (https://design.wikimedia.org/blog) - https://phabricator.wikimedia.org/T254118 (10Prtksxna) >>! In T254118#6233578, @Iniquity wrote: >>>! In T254118#6182683, @Iniquity wrote: >> No matter, I missed. Sorry :) > Oh, I was not mista... [04:50:48] !log marostegui@cumin2001 dbctl commit (dc=all): 'Depool db1136', diff saved to https://phabricator.wikimedia.org/P11573 and previous config saved to /var/cache/conftool/dbconfig/20200618-045047-marostegui.json [04:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:52:10] (03PS1) 10Marostegui: db2091: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606311 (https://phabricator.wikimedia.org/T253217) [04:52:51] (03CR) 10Marostegui: [C: 03+2] db2091: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606311 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [04:56:17] (03PS1) 10Marostegui: report_users: Fix dbproxy1020 IP [software] - 10https://gerrit.wikimedia.org/r/606312 [04:56:50] (03CR) 10Marostegui: [C: 03+2] report_users: Fix dbproxy1020 IP [software] - 10https://gerrit.wikimedia.org/r/606312 (owner: 10Marostegui) [06:25:28] (03CR) 10RhinosF1: "The code is fine but I have 2 points:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606295 (owner: 10Bartosz Dziewoński) [06:57:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/606177 (https://phabricator.wikimedia.org/T255665) (owner: 10Jbond) [07:00:04] Deploy window Friday Holiday; no deploys Thursday. See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200618T0700) [07:05:26] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:07:06] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [07:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:12] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:10:40] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [07:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:33] (03PS1) 10DCausse: [cirrus] drop icinga check on deprecated cirrus bulk update topic [puppet] - 10https://gerrit.wikimedia.org/r/606372 [07:16:49] (03PS1) 10Elukey: profile::archiva: use /srv/archiva-public for rsync [puppet] - 10https://gerrit.wikimedia.org/r/606373 [07:17:30] (03PS1) 10Majavah: Disable NS_USER(_TALK) search engine indexing on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606374 (https://phabricator.wikimedia.org/T255538) [07:17:56] (03CR) 10jerkins-bot: [V: 04-1] profile::archiva: use /srv/archiva-public for rsync [puppet] - 10https://gerrit.wikimedia.org/r/606373 (owner: 10Elukey) [07:18:18] (03PS2) 10Elukey: profile::archiva: use /srv/archiva-public for rsync [puppet] - 10https://gerrit.wikimedia.org/r/606373 [07:19:25] (03CR) 10jerkins-bot: [V: 04-1] profile::archiva: use /srv/archiva-public for rsync [puppet] - 10https://gerrit.wikimedia.org/r/606373 (owner: 10Elukey) [07:19:42] (03CR) 10RhinosF1: [C: 03+1] Disable NS_USER(_TALK) search engine indexing on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606374 (https://phabricator.wikimedia.org/T255538) (owner: 10Majavah) [07:19:45] (03CR) 10Elukey: [C: 03+2] [cirrus] drop icinga check on deprecated cirrus bulk update topic [puppet] - 10https://gerrit.wikimedia.org/r/606372 (owner: 10DCausse) [07:20:35] (03PS1) 10Ayounsi: Add msw interfaces support [homer/public] - 10https://gerrit.wikimedia.org/r/606375 [07:22:29] (03PS3) 10Elukey: profile::archiva: use /srv/archiva-public for rsync [puppet] - 10https://gerrit.wikimedia.org/r/606373 [07:22:37] !log rolling reboot of ganeti servers in codfw [07:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:23] (03PS1) 10Volans: scripts: filter only active IPAddress objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/606376 [07:24:01] (03CR) 10Ayounsi: [C: 03+1] "lgtm! thx" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/606376 (owner: 10Volans) [07:24:06] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [07:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:33] (03CR) 10Volans: [C: 03+2] scripts: filter only active IPAddress objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/606376 (owner: 10Volans) [07:25:05] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/23316/" [puppet] - 10https://gerrit.wikimedia.org/r/606373 (owner: 10Elukey) [07:25:44] !log ayounsi@cumin2001 START - Cookbook sre.dns.netbox [07:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:20] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [07:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:22] !log ayounsi@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:15] !log marostegui@cumin2001 dbctl commit (dc=all): 'Repool db1136', diff saved to https://phabricator.wikimedia.org/P11574 and previous config saved to /var/cache/conftool/dbconfig/20200618-073414-marostegui.json [07:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:42] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [07:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:01] (03PS1) 10Elukey: Revert "profile::archiva: use /srv/archiva-public for rsync" [puppet] - 10https://gerrit.wikimedia.org/r/606380 [07:37:23] (03CR) 10Elukey: [C: 03+2] Revert "profile::archiva: use /srv/archiva-public for rsync" [puppet] - 10https://gerrit.wikimedia.org/r/606380 (owner: 10Elukey) [07:41:45] !log Reimage es1025 [07:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:54] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [07:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:12] (03PS2) 10Andrew-WMDE: TwoColConflict: Talk page small deployment InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605253 (https://phabricator.wikimedia.org/T254458) [07:42:39] (03PS2) 10Andrew-WMDE: TwoColConflict: Talk page small deployment CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605255 (https://phabricator.wikimedia.org/T254458) [07:45:32] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [07:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:49] (03CR) 10Ema: [C: 03+2] ATS: unset Transfer-Encoding on 304 responses from origins [puppet] - 10https://gerrit.wikimedia.org/r/606204 (https://phabricator.wikimedia.org/T255368) (owner: 10Ema) [07:50:25] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [07:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:07] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [07:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:19] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar) The easy trick is to have the rsync da... [07:53:22] (03PS1) 10Ayounsi: Add cr3-eqsin to DNS [dns] - 10https://gerrit.wikimedia.org/r/606385 (https://phabricator.wikimedia.org/T253246) [07:57:00] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [07:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:23] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [08:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:35] 10Operations, 10DNS, 10Traffic, 10netbox: Netbox DNS change not effective in gdns - https://phabricator.wikimedia.org/T255748 (10ayounsi) p:05Triage→03High [08:04:22] (03PS18) 10DCausse: [wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 [08:04:49] (03PS1) 10Elukey: archiva: allow to set a custom location for the user db [puppet] - 10https://gerrit.wikimedia.org/r/606386 [08:05:12] (03CR) 10Ayounsi: [C: 03+2] Add cr3-eqsin to DNS [dns] - 10https://gerrit.wikimedia.org/r/606385 (https://phabricator.wikimedia.org/T253246) (owner: 10Ayounsi) [08:05:15] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [08:05:37] (03CR) 10jerkins-bot: [V: 04-1] [wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse) [08:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:20] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [08:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:03] (03CR) 10Elukey: [C: 03+2] archiva: allow to set a custom location for the user db [puppet] - 10https://gerrit.wikimedia.org/r/606386 (owner: 10Elukey) [08:07:07] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/23317/" [puppet] - 10https://gerrit.wikimedia.org/r/606386 (owner: 10Elukey) [08:07:49] 10Operations, 10DNS, 10Traffic, 10netbox: Netbox DNS change not effective in gdns - https://phabricator.wikimedia.org/T255748 (10ayounsi) I then deployed https://gerrit.wikimedia.org/r/c/operations/dns/+/606385 and the `sudo -s authdns-update` fixed instantly: ` bast5001:~$ host cr3-eqsin.mgmt.eqsin.wmnet... [08:08:17] !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime [08:08:19] (03PS8) 10Jbond: dumps: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602645 (https://phabricator.wikimedia.org/T254480) [08:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:17] (03PS19) 10DCausse: [wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 [08:09:28] (03PS9) 10Jbond: dumps: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602645 (https://phabricator.wikimedia.org/T254480) [08:10:01] (03CR) 10Jbond: "updated" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602645 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [08:10:49] !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:11] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [08:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:26] (03PS1) 10Jcrespo: Revert "gerrit: add parameter for db_name, let gerrit1002 use test db" [puppet] - 10https://gerrit.wikimedia.org/r/606387 (https://phabricator.wikimedia.org/T255715) [08:19:29] (03PS1) 10Marostegui: install_server: Do not reimage es1025,db2091 [puppet] - 10https://gerrit.wikimedia.org/r/606388 [08:20:17] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage es1025,db2091 [puppet] - 10https://gerrit.wikimedia.org/r/606388 (owner: 10Marostegui) [08:22:39] (03PS1) 10Marostegui: es1025: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606389 [08:23:16] (03CR) 10Marostegui: [C: 03+2] es1025: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606389 (owner: 10Marostegui) [08:24:33] !log marostegui@cumin2001 dbctl commit (dc=all): 'Slowly repool es1025', diff saved to https://phabricator.wikimedia.org/P11576 and previous config saved to /var/cache/conftool/dbconfig/20200618-082432-marostegui.json [08:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:21] !log change archiva-ci password in archiva [08:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:07] (03CR) 10Jbond: [C: 03+2] profile::idp::client::httpd: remove trailing slash from proxied as [puppet] - 10https://gerrit.wikimedia.org/r/606177 (https://phabricator.wikimedia.org/T255665) (owner: 10Jbond) [08:34:13] (03PS2) 10Ema: purged: restart upon configuration changes [puppet] - 10https://gerrit.wikimedia.org/r/604716 [08:34:26] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/604716 (owner: 10Ema) [08:36:58] (03CR) 10QChris: "I'm obviously fine with reverting." [puppet] - 10https://gerrit.wikimedia.org/r/606387 (https://phabricator.wikimedia.org/T255715) (owner: 10Jcrespo) [08:37:51] !log marostegui@cumin2001 dbctl commit (dc=all): 'Slowly repool es1025', diff saved to https://phabricator.wikimedia.org/P11577 and previous config saved to /var/cache/conftool/dbconfig/20200618-083749-marostegui.json [08:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:31] (03CR) 10Vgutierrez: [C: 03+1] purged: restart upon configuration changes [puppet] - 10https://gerrit.wikimedia.org/r/604716 (owner: 10Ema) [08:39:02] (03CR) 10Ema: [C: 03+2] purged: restart upon configuration changes [puppet] - 10https://gerrit.wikimedia.org/r/604716 (owner: 10Ema) [08:40:21] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [08:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:02] (03PS1) 10Marostegui: mariadb: Reimage es2022 [puppet] - 10https://gerrit.wikimedia.org/r/606390 (https://phabricator.wikimedia.org/T250666) [08:42:33] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage es2022 [puppet] - 10https://gerrit.wikimedia.org/r/606390 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [08:47:06] (03PS8) 10Vgutierrez: ATS: Add http-redirect.lua [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235) [08:47:13] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [08:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:21] !log marostegui@cumin2001 dbctl commit (dc=all): 'Depool es2022 for reimage', diff saved to https://phabricator.wikimedia.org/P11578 and previous config saved to /var/cache/conftool/dbconfig/20200618-084720-marostegui.json [08:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:14] (03CR) 10jerkins-bot: [V: 04-1] ATS: Add http-redirect.lua [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235) (owner: 10Vgutierrez) [08:49:30] !log marostegui@cumin2001 dbctl commit (dc=all): 'Slowly repool es1025', diff saved to https://phabricator.wikimedia.org/P11580 and previous config saved to /var/cache/conftool/dbconfig/20200618-084929-marostegui.json [08:49:30] !log ayounsi@cumin2001 START - Cookbook sre.network.prepare-upgrade [08:49:30] !log ayounsi@cumin2001 END (FAIL) - Cookbook sre.network.prepare-upgrade (exit_code=99) [08:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:49] !log ayounsi@cumin2001 START - Cookbook sre.network.prepare-upgrade [08:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:30] !log ayounsi@cumin2001 END (FAIL) - Cookbook sre.network.prepare-upgrade (exit_code=1) [08:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:38] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10MoritzMuehlenhoff) >>! In T220853#6232931, @wiki_willy wrote: > @Andrew - just wanted to keep you posted with the latest u... [08:54:35] (03PS1) 10Ema: purged: remove unused argument 'stats_dir' [puppet] - 10https://gerrit.wikimedia.org/r/606393 [08:54:54] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 52 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:57:17] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/606393 (owner: 10Ema) [08:59:01] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [08:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:28] !log marostegui@cumin2001 dbctl commit (dc=all): 'Fully repool es1025', diff saved to https://phabricator.wikimedia.org/P11581 and previous config saved to /var/cache/conftool/dbconfig/20200618-085927-marostegui.json [08:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:30] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:01:06] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar) Actually we do not need any name mappi... [09:04:51] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [09:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:16] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:06:38] !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime [09:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:30] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [09:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:07] !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:22] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [09:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:51] (03PS1) 10Hashar: ci:master: prepare for contint switchover [puppet] - 10https://gerrit.wikimedia.org/r/606394 (https://phabricator.wikimedia.org/T224591) [09:18:51] (03PS5) 10Filippo Giunchedi: thanos: use object storage for data older than 15d [puppet] - 10https://gerrit.wikimedia.org/r/605950 (https://phabricator.wikimedia.org/T252186) [09:20:06] (03CR) 10Filippo Giunchedi: [C: 03+2] "I've tweaked the change to limit object storage to 15d or older only. We can limit Prometheus as well at the end of next quarter (i.e. whe" [puppet] - 10https://gerrit.wikimedia.org/r/605950 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [09:20:38] 10Operations, 10MediaWiki-General, 10Patch-For-Review, 10Sustainability (Incident Prevention): Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) - https://phabricator.wikimedia.org/T105378 (10akosiaris) >>! In T105378#6233389, @tstarling wrote: > The log messages r... [09:22:37] (03PS1) 10Marostegui: install_server: Do not reimage es2022 [puppet] - 10https://gerrit.wikimedia.org/r/606396 [09:23:26] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage es2022 [puppet] - 10https://gerrit.wikimedia.org/r/606396 (owner: 10Marostegui) [09:24:50] 10Operations, 10CAS-SSO, 10User-jbond: cas-puppetboard fails to show facts - https://phabricator.wikimedia.org/T255665 (10jbond) 05Open→03Resolved a:03jbond This is working now however users may need to clear there session cookies to clear out the bad values [09:25:07] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [09:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:18] godog: is your puppet merge going fine? the process has been locked for around 4 minutes [09:27:03] (03PS1) 10Alexandros Kosiaris: kubernetes: Set correct rows for nodes [puppet] - 10https://gerrit.wikimedia.org/r/606397 [09:27:37] marostegui: ugh yes, sorry! merging now [09:27:44] thank you [09:28:01] {{done}} [09:28:06] thanks! [09:29:29] (03PS2) 10Mvolz: Update citoid to include change Ia5bc189 [deployment-charts] - 10https://gerrit.wikimedia.org/r/604655 [09:29:47] !log temp stop logstash on elk7 to test 8 pipeline workers - T255243 [09:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:26] T255243: Increase logging pipeline ingestion capacity - https://phabricator.wikimedia.org/T255243 [09:31:09] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [09:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:56] (03CR) 10Alexandros Kosiaris: [C: 03+2] kubernetes: Set correct rows for nodes [puppet] - 10https://gerrit.wikimedia.org/r/606397 (owner: 10Alexandros Kosiaris) [09:33:43] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=jmx_logstash site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:34:28] !log Deploy schema change on s3 codfw master (this will create lag on codfw) - T250066 [09:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:45] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:46] T250066: text table still has old_* fields and indexes on some hosts - https://phabricator.wikimedia.org/T250066 [09:35:00] ooof, that's me on thanos-fe1001 [09:37:07] (03PS1) 10Marostegui: es2022: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606399 [09:37:34] (03CR) 10Marostegui: [C: 03+2] es2022: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606399 (owner: 10Marostegui) [09:38:03] !log uncordon kubernetes20{07..14} and kubernetes10{07..14}. Nodes are now fully put in rotation and ready to receive production traffic [09:38:04] !log marostegui@cumin2001 dbctl commit (dc=all): 'Repool es2022', diff saved to https://phabricator.wikimedia.org/P11582 and previous config saved to /var/cache/conftool/dbconfig/20200618-093803-marostegui.json [09:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:12] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [09:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:27] !log update wikifeeds to latest chart version in codfw [09:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2076', diff saved to https://phabricator.wikimedia.org/P11583 and previous config saved to /var/cache/conftool/dbconfig/20200618-094001-marostegui.json [09:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:58] (03PS1) 10Filippo Giunchedi: thanos: fix min/max time options parsing [puppet] - 10https://gerrit.wikimedia.org/r/606400 (https://phabricator.wikimedia.org/T252186) [09:42:08] (03PS1) 10Kormat: install_server: Use reuse-db.cfg by default for db machines. [puppet] - 10https://gerrit.wikimedia.org/r/606401 (https://phabricator.wikimedia.org/T251768) [09:42:55] (03PS2) 10Kormat: install_server: Use reuse-db.cfg by default for db machines. [puppet] - 10https://gerrit.wikimedia.org/r/606401 (https://phabricator.wikimedia.org/T251768) [09:42:59] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: fix min/max time options parsing [puppet] - 10https://gerrit.wikimedia.org/r/606400 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [09:45:11] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:22] (03PS3) 10Thiemo Kreuz (WMDE): TwoColConflict: Talk page small deployment CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605255 (https://phabricator.wikimedia.org/T254458) (owner: 10Andrew-WMDE) [09:48:35] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [09:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:47] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "Not needed any more, see T254458#6234250." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605253 (https://phabricator.wikimedia.org/T254458) (owner: 10Andrew-WMDE) [09:49:16] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] TwoColConflict: Talk page small deployment CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605255 (https://phabricator.wikimedia.org/T254458) (owner: 10Andrew-WMDE) [09:54:06] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [09:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:11] (03PS1) 10Marostegui: wmnet: Add es4 and es5 alias [dns] - 10https://gerrit.wikimedia.org/r/606403 [10:09:24] (03CR) 10Kormat: [C: 03+2] wmnet: Add es4 and es5 alias [dns] - 10https://gerrit.wikimedia.org/r/606403 (owner: 10Marostegui) [10:09:45] (03CR) 10Kormat: [C: 03+1] "Oops, that was over-enthusiastic :)" [dns] - 10https://gerrit.wikimedia.org/r/606403 (owner: 10Marostegui) [10:11:18] RECOVERY - Memcached on idp-test2001 is OK: TCP OK - 0.036 second response time on 208.80.153.25 port 11000 https://wikitech.wikimedia.org/wiki/Memcached [10:18:05] RECOVERY - Check systemd state on idp-test2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:20:34] 10Operations, 10DBA: Replace `role::prometheus::mysqld_exporter` with `profile::prometheus::mysqld_exporter_instance` - https://phabricator.wikimedia.org/T255758 (10Kormat) [10:21:10] 10Operations, 10DBA: Replace `role::prometheus::mysqld_exporter` with `profile::prometheus::mysqld_exporter_instance` - https://phabricator.wikimedia.org/T255758 (10Kormat) p:05Triage→03Medium [10:30:05] PROBLEM - Memcached on idp-test2001 is CRITICAL: connect to address 208.80.153.25 and port 11000: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [10:30:06] (03Abandoned) 10Andrew-WMDE: TwoColConflict: Talk page small deployment InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605253 (https://phabricator.wikimedia.org/T254458) (owner: 10Andrew-WMDE) [10:30:27] PROBLEM - Check systemd state on idp-test2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:39] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [10:32:51] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [10:33:09] RECOVERY - Check systemd state on idp-test2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:33:55] idp-test2001 is me, extending downtime [10:34:31] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [10:34:32] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:29] RECOVERY - Memcached on idp-test2001 is OK: TCP OK - 0.036 second response time on 208.80.153.25 port 11000 https://wikitech.wikimedia.org/wiki/Memcached [10:41:49] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:44:18] 10Operations, 10DNS, 10Traffic, 10netbox: Netbox DNS change not effective in gdns - https://phabricator.wikimedia.org/T255748 (10Volans) Glad it solved for now, that's what I tought could be happening. I tried an authdns update but with no changes it exit earlier. I guess the way the cookbook calls the gdn... [10:50:13] (03CR) 10Marostegui: [C: 03+2] wmnet: Add es4 and es5 alias [dns] - 10https://gerrit.wikimedia.org/r/606403 (owner: 10Marostegui) [10:50:41] all deployers: scap sync --canary-wait-time option is available (https://phabricator.wikimedia.org/T217924) [10:56:39] (03PS1) 10Ammarpad: Initialize $wgImportSources as array. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606406 (https://phabricator.wikimedia.org/T255762) [10:56:59] (03PS1) 10DannyS712: CommonSettings-labs: Set `$wgImportSources` to an empty array, not false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606407 (https://phabricator.wikimedia.org/T255762) [10:57:13] (03PS2) 10DannyS712: CommonSettings-labs: Set `$wgImportSources` to an empty array, not false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606407 (https://phabricator.wikimedia.org/T255762) [10:57:21] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: Increase logging pipeline ingestion capacity - https://phabricator.wikimedia.org/T255243 (10fgiunchedi) Yesterday I inquired the folks at -discovery on IRC and David suggested to increase the number of concurrent writers to Elastic, an... [10:57:30] (03CR) 10Jcrespo: [C: 03+2] transferpy: Remove wmfmariadbpy package [software/transferpy] - 10https://gerrit.wikimedia.org/r/602618 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [10:57:32] liw: excellent! [10:58:41] Is anyone available to deploy the fix for {T255762} to the beta cluster? [10:58:41] T255762: Special:Import fails to load on beta cluster - https://phabricator.wikimedia.org/T255762 [10:59:06] (03PS2) 10Ammarpad: Initialize $wgImportSources as array. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606406 (https://phabricator.wikimedia.org/T255762) [11:00:37] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [11:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:53] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=jmx_logstash site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:01:38] Ping @jbond42 ^ - are you available for a beta-cluster-only deployment? [11:02:37] DannyS712: which one? [11:02:55] https://gerrit.wikimedia.org/r/#/c/606407 [11:03:35] looking [11:06:07] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:32] (03PS5) 10JMeybohm: Initial commit of debian directory [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/605940 (https://phabricator.wikimedia.org/T253843) [11:12:05] DannyS712: i have not merged a wikimedia-config change before so just double checking on the process [11:14:05] (03CR) 10JMeybohm: Initial commit of debian directory (031 comment) [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/605940 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [11:18:23] (03PS5) 10Arturo Borrero Gonzalez: wmcs: paws: haproxy: add keepalived support [puppet] - 10https://gerrit.wikimedia.org/r/605944 (https://phabricator.wikimedia.org/T195217) [11:18:54] (03CR) 10Jcrespo: "Looks ok, see the below typos/rebase issues. We can add here or on a later patch the license: GPLv3, like cumin." (033 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/602719 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [11:20:32] (03CR) 10Nikerabbit: [C: 03+1] Remove TranslationNotifications user settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603167 (https://phabricator.wikimedia.org/T144780) (owner: 10DannyS712) [11:22:57] (03CR) 10Jcrespo: [C: 03+1] wmfmariadbpy: Remove transferpy package (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/602595 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [11:23:04] (03PS2) 10Jcrespo: wmfmariadbpy: Remove transferpy package [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/602595 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [11:24:55] (03CR) 10Jcrespo: [C: 03+2] wmfmariadbpy: Remove transferpy package [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/602595 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [11:25:17] (03PS1) 10Arturo Borrero Gonzalez: keepalived: simplify input arguments [puppet] - 10https://gerrit.wikimedia.org/r/606412 [11:25:38] (03CR) 10jerkins-bot: [V: 04-1] keepalived: simplify input arguments [puppet] - 10https://gerrit.wikimedia.org/r/606412 (owner: 10Arturo Borrero Gonzalez) [11:26:35] (03PS2) 10Arturo Borrero Gonzalez: keepalived: simplify input arguments [puppet] - 10https://gerrit.wikimedia.org/r/606412 [11:31:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] keepalived: simplify input arguments [puppet] - 10https://gerrit.wikimedia.org/r/606412 (owner: 10Arturo Borrero Gonzalez) [11:32:30] @jbond42 should I ping robh regarding the patch, or are you still looking at it? [11:33:19] DannyS712: not sure robh is around yet however im still looking however i think everyone who can help is on lucnh currently [11:33:57] since its the beta cluster no sync should be needed (trying to find the logs where this was discussed a few days ago) - just a +2 should result in it going live a few minutes later [11:34:36] (03CR) 10Jcrespo: "The main blocker here, aside from the below comments regarding attribution, is the lack of documentation: We need a readme addition commen" (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [11:36:58] DannyS712: im sure its supper easy however i would prefer to wait for someone who knows the system, sorry for the delay [11:41:57] Channel logs from 2020-06-08: `[18:30:09] Beta updates its config automatically every 10 minutes, so you'll probably have to wait a little bit` [11:42:10] easier that production config [11:42:20] (03CR) 10Jforrester: "Yeah, let's rename this to lockeddown or something. I'll take care of this, unless you really want to. ;-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606295 (owner: 10Bartosz Dziewoński) [11:42:24] but it makes sense if you want to wait; I'm just not sure I'll be able to stay around [11:43:41] DannyS712: ack thanks and sorry for the inconvinence [11:43:59] if I leave, can you ask whoever shows up to take a look? [11:44:52] DannyS712: ill keep on this and make sure it gets merged when someone comes, is there anything elses that needs doing? [11:45:20] (03CR) 10Jcrespo: "After building I see a few issues:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [11:45:29] someone should visit https://en.wikipedia.beta.wmflabs.org/wiki/Special:Import and confirm that it doesn't fail [11:45:38] DannyS712: akosi.aris is taking a look now [11:45:43] thanks [11:46:15] (03CR) 10Alexandros Kosiaris: [C: 03+1] CommonSettings-labs: Set `$wgImportSources` to an empty array, not false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606407 (https://phabricator.wikimedia.org/T255762) (owner: 10DannyS712) [11:46:39] (03CR) 10Marostegui: [C: 03+1] "Once merged and once puppet has run on install hosts, let's reimage db1077 to be triple sure that this works as intended with this last pa" [puppet] - 10https://gerrit.wikimedia.org/r/606401 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [11:47:05] DannyS712: it needs to be synced to prod as well, otherwise monitoring won't be happy [11:47:37] it does? https://www.mediawiki.org/wiki/Special:Import works fine for me, and its a `-labs` file [11:48:51] I mean that if it's merged but not synced Icinga will start complaining [11:49:08] (03CR) 10Jbond: [C: 03+2] CommonSettings-labs: Set `$wgImportSources` to an empty array, not false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606407 (https://phabricator.wikimedia.org/T255762) (owner: 10DannyS712) [11:50:02] (03CR) 10Jforrester: "> Patch Set 1:" [extensions/DiscussionTools] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/606292 (https://phabricator.wikimedia.org/T255704) (owner: 10Urbanecm) [11:50:43] DannyS712: Majavah: thanks, i have merged and it is syncing now [11:50:51] thanks jbond42 [11:51:00] !log jbond@deploy1001 Synchronized wmf-config/CommonSettings-labs.php: (no justification provided) (duration: 01m 00s) [11:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:39] 10Operations, 10Traffic: noc.wikimedia.org consistently 503s in eqsin and sometimes 503s in esams - https://phabricator.wikimedia.org/T255368 (10ema) 05Open→03Resolved a:03ema >>! In T255368#6231244, @ema wrote: > what we need to do is (1) ensure the origins don't send Transfer-Encoding on 304 responses... [11:52:12] (03CR) 10Urbanecm: "> Patch Set 1:" [extensions/DiscussionTools] (wmf/1.35.0-wmf.36) - 10https://gerrit.wikimedia.org/r/606292 (https://phabricator.wikimedia.org/T255704) (owner: 10Urbanecm) [11:52:29] DannyS712: deployed and synced hwoever it can still take 10 mins for labs to get the file [11:52:41] also "Beta updates its config automatically every 10 minutes" is false, they happen via Jenkins job https://integration.wikimedia.org/ci/job/beta-scap-eqiad/305034/console [11:52:41] jouncebot: Now [11:52:42] For the next 19 hour(s) and 7 minute(s): Friday Holiday; no deploys Thursday. See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200618T0700) [11:52:48] jbond42: ^^^ :-P [11:53:21] does it not count as an https://wikitech.wikimedia.org/wiki/Deployments/Emergencies ? [11:53:22] James_F: ahh sorry :S [11:53:24] wasn't me [11:53:25] Also, you don't need to sync -labs files; nothing in production reads them, and Beta Cluster isn't affected by the sync state. [11:53:41] but doesn't it cause icinga go angry or something? [11:53:56] it just needs to be fetched at deploy1001 to not confuse humans :) [11:54:10] jbond42: It's fine, but next time please check with RelEng. :-) [11:54:17] confirmed to work, https://en.wikipedia.beta.wmflabs.org/w/index.php?title=User:DannyS712&oldid=430365 imported and https://en.wikipedia.beta.wmflabs.org/wiki/Special:Import doesn't fatal [11:54:25] James_F: will do thx [11:54:27] Majavah: Yes? The jenkins job runs automatically every 10 minutes. What is false? [11:54:56] (03CR) 10Jcrespo: "Some more description reviews." (033 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [11:55:03] sorry for requesting it if it didn't qualify for emergency deployment [11:55:16] James_F: that job is triggered as a post-merge action [11:55:24] Majavah: No? [11:55:47] (03CR) 10DannyS712: "Duplicate of https://gerrit.wikimedia.org/r/606407, which has been merged and deployed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606406 (https://phabricator.wikimedia.org/T255762) (owner: 10Ammarpad) [11:55:49] Majavah: I mean, yes, but the general cron fetch also fetches it, I believe. [11:56:06] (03CR) 10Ema: [C: 03+2] purged: remove unused argument 'stats_dir' [puppet] - 10https://gerrit.wikimedia.org/r/606393 (owner: 10Ema) [12:01:27] (03CR) 10Kormat: [C: 03+2] install_server: Use reuse-db.cfg by default for db machines. [puppet] - 10https://gerrit.wikimedia.org/r/606401 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [12:04:51] !log reimaging db1077 for final test T251768 [12:05:03] (03CR) 10Jforrester: [C: 03+1] Update citoid to include change Ia5bc189 [deployment-charts] - 10https://gerrit.wikimedia.org/r/604655 (owner: 10Mvolz) [12:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:24] T251768: Make partman/custom/no-srv-format.cfg work - https://phabricator.wikimedia.org/T251768 [12:07:03] (03Abandoned) 10Ammarpad: Initialize $wgImportSources as array. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606406 (https://phabricator.wikimedia.org/T255762) (owner: 10Ammarpad) [12:08:13] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:09:55] 10Operations, 10ops-eqsin, 10netops: cr3-eqsin to production - https://phabricator.wikimedia.org/T255766 (10ayounsi) p:05Triage→03Medium [12:10:02] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T255520 (10elukey) @RobH we ordered 6 nodes IIRC, 3 with more storage space and 3 called "lightweight", which batch is this? I have in mind the following naming: - an-test-worke... [12:10:16] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T255520 (10elukey) a:05elukey→03Jclark-ctr [12:11:19] 10Operations, 10ops-eqsin, 10netops: cr3-eqsin to production - https://phabricator.wikimedia.org/T255766 (10ayounsi) [12:11:22] 10Operations, 10ops-eqsin, 10DC-Ops, 10netops: (Need By: TBD) rack/setup/install cr3-eqsin.wikimedia.org - https://phabricator.wikimedia.org/T253246 (10ayounsi) [12:16:25] (03PS1) 10Jbond: icinga: remove cachtpoint IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/606417 (https://phabricator.wikimedia.org/T95758) [12:20:29] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [12:20:50] (03PS1) 10Ayounsi: Replace cr1-eqsin with cr3-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/606419 (https://phabricator.wikimedia.org/T255766) [12:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:01] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:46] (03CR) 10Mvolz: [C: 03+2] Update citoid to include change Ia5bc189 [deployment-charts] - 10https://gerrit.wikimedia.org/r/604655 (owner: 10Mvolz) [12:24:11] (03Merged) 10jenkins-bot: Update citoid to include change Ia5bc189 [deployment-charts] - 10https://gerrit.wikimedia.org/r/604655 (owner: 10Mvolz) [12:24:29] (03CR) 10Ayounsi: [C: 03+1] "LGTM! Thanks." [homer/public] - 10https://gerrit.wikimedia.org/r/606206 (owner: 10CDanis) [12:30:17] (03PS1) 10Ssingh: add fake webserver password and API access key for dnsdist::wikidough [labs/private] - 10https://gerrit.wikimedia.org/r/606422 [12:31:50] (03CR) 10Ssingh: [C: 03+2] add fake webserver password and API access key for dnsdist::wikidough [labs/private] - 10https://gerrit.wikimedia.org/r/606422 (owner: 10Ssingh) [12:31:55] (03CR) 10Ssingh: [V: 03+2 C: 03+2] add fake webserver password and API access key for dnsdist::wikidough [labs/private] - 10https://gerrit.wikimedia.org/r/606422 (owner: 10Ssingh) [12:32:44] (03PS1) 10Ayounsi: Remove cr1-eqsin loopback & rename relevant links (cr1->cr3) [dns] - 10https://gerrit.wikimedia.org/r/606423 (https://phabricator.wikimedia.org/T255766) [12:38:09] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/606417 (https://phabricator.wikimedia.org/T95758) (owner: 10Jbond) [12:39:33] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/606417 (https://phabricator.wikimedia.org/T95758) (owner: 10Jbond) [12:42:31] (03CR) 10Jbond: [C: 03+2] icinga: remove cachtpoint IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/606417 (https://phabricator.wikimedia.org/T95758) (owner: 10Jbond) [12:43:29] (03PS1) 10Ayounsi: cr1-eqsin -> cr3-eqsin [puppet] - 10https://gerrit.wikimedia.org/r/606425 (https://phabricator.wikimedia.org/T255766) [12:45:02] (03PS6) 10Arturo Borrero Gonzalez: wmcs: paws: haproxy: add keepalived support [puppet] - 10https://gerrit.wikimedia.org/r/605944 (https://phabricator.wikimedia.org/T195217) [12:47:00] 10Operations, 10ops-eqsin, 10netops, 10Patch-For-Review: cr3-eqsin to production - https://phabricator.wikimedia.org/T255766 (10ayounsi) [12:48:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove weight from es5 master as es1024 is fully repooled now', diff saved to https://phabricator.wikimedia.org/P11585 and previous config saved to /var/cache/conftool/dbconfig/20200618-124801-marostegui.json [12:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:14] 10Operations, 10ops-eqsin, 10netops, 10Patch-For-Review: cr3-eqsin to production - https://phabricator.wikimedia.org/T255766 (10ayounsi) [12:51:58] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: paws: haproxy: add keepalived support [puppet] - 10https://gerrit.wikimedia.org/r/605944 (https://phabricator.wikimedia.org/T195217) (owner: 10Arturo Borrero Gonzalez) [12:52:17] !log marostegui@cumin2001 dbctl commit (dc=all): 'Depool db1075 for schema change', diff saved to https://phabricator.wikimedia.org/P11586 and previous config saved to /var/cache/conftool/dbconfig/20200618-125216-marostegui.json [12:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:37] (03PS9) 10Vgutierrez: ATS: Provide http to https redirection logic in lua [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235) [13:05:41] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:43] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:42] PROBLEM - Check systemd state on an-tool1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:13:45] this is me --^ [13:15:42] bad elukey bad! [13:15:44] ;P [13:20:06] vgutierrez: I know what I can say, you know me [13:20:11] :D [13:23:57] 10Operations, 10ops-eqiad, 10DC-Ops: Netbox report accounting icinga alert - https://phabricator.wikimedia.org/T250053 (10Jclark-ctr) Fixed alerts [13:24:08] 10Operations, 10ops-eqiad, 10DC-Ops: Netbox report accounting icinga alert - https://phabricator.wikimedia.org/T250053 (10Jclark-ctr) 05Open→03Resolved [13:25:07] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10Jclark-ctr) @elukey Sounds good. i will be taking a vacation in august so july would be best [13:29:35] (03PS1) 10Paladox: gerrit: Redirect /r/(#/)projects/(.+),dashboards/(.+) to /r/p/$1/+/dashboard/$2 [puppet] - 10https://gerrit.wikimedia.org/r/606432 [13:31:42] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10Ottomata) Bump [13:31:54] (03PS2) 10Paladox: gerrit: Redirect /r/(#/)?projects/(.+),dashboards/(.+) to /r/p/$1/+/dashboard/$2 [puppet] - 10https://gerrit.wikimedia.org/r/606432 [13:34:08] (03PS1) 10Muehlenhoff: Add CAP_DAC_OVERRIDE to the CapabilityBoundingSet [puppet] - 10https://gerrit.wikimedia.org/r/606433 (https://phabricator.wikimedia.org/T233933) [13:34:47] (03PS3) 10Paladox: gerrit: Redirect /r/(#/)?projects/(.+),dashboards/(.+) to /r/p/$2/+/dashboard/$3 [puppet] - 10https://gerrit.wikimedia.org/r/606432 [13:34:58] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:33] (03PS4) 10Paladox: gerrit: Redirect /r/(#/)?projects/(.+),dashboards/(.+) to /r/p/$2/+/dashboard/$3 [puppet] - 10https://gerrit.wikimedia.org/r/606432 [13:36:13] (03PS1) 10Paladox: gerrit: drop old redirect [puppet] - 10https://gerrit.wikimedia.org/r/606434 [13:36:56] (03PS2) 10Paladox: gerrit: drop old redirect [puppet] - 10https://gerrit.wikimedia.org/r/606434 [13:40:59] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:19] (03PS1) 10Jbond: profile::icinga: add vhost for external monitoring [puppet] - 10https://gerrit.wikimedia.org/r/606437 (https://phabricator.wikimedia.org/T239323) [13:43:33] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:59] (03CR) 10Jbond: "will abandon in favour of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/606437" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598742 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond) [13:44:03] (03Abandoned) 10Jbond: cas-icinga: Add an entry point for the external monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/598742 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond) [13:44:12] (03PS10) 10Vgutierrez: ATS: Provide http to https redirection logic in lua [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235) [13:46:23] 10Operations, 10LDAP-Access-Requests: Add Abban to the ldap/nda group - https://phabricator.wikimedia.org/T255775 (10Tobi_WMDE_SW) [13:46:40] PROBLEM - Host kubetcd2005 is DOWN: PING CRITICAL - Packet loss = 100% [13:47:53] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:30] RECOVERY - Host kubetcd2005 is UP: PING OK - Packet loss = 0%, RTA = 36.26 ms [13:52:18] !log restart logstash2005 for applying an increased ganeti migration_downtime of 10k [13:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:16] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [13:53:17] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:49] (03PS11) 10Vgutierrez: ATS: Provide http to https redirection logic in lua [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235) [13:57:36] kubetcd2005 was/is the ganeti reboot, those use plain disk format [13:57:56] 10Puppet, 10User-jbond: puppet-merge: answering no to merging labs-private prevents puppet-merge from pushing to all puppet masters - https://phabricator.wikimedia.org/T251104 (10jbond) 05Open→03Resolved a:03jbond [13:59:25] (03PS12) 10Vgutierrez: ATS: Provide http to https redirection logic in lua [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235) [14:02:04] !log marostegui@cumin2001 dbctl commit (dc=all): 'Repool db1075', diff saved to https://phabricator.wikimedia.org/P11589 and previous config saved to /var/cache/conftool/dbconfig/20200618-140203-marostegui.json [14:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:53] !log marostegui@cumin2001 dbctl commit (dc=all): 'Depool db1078 for schema change', diff saved to https://phabricator.wikimedia.org/P11590 and previous config saved to /var/cache/conftool/dbconfig/20200618-140352-marostegui.json [14:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:28] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:08:44] (03PS1) 10Kormat: mariadb: Add monitoring for lag spikes. [puppet] - 10https://gerrit.wikimedia.org/r/606441 (https://phabricator.wikimedia.org/T253120) [14:09:33] (03PS2) 10Kormat: mariadb: Add monitoring for lag spikes. [puppet] - 10https://gerrit.wikimedia.org/r/606441 (https://phabricator.wikimedia.org/T253120) [14:10:40] (03CR) 10Kormat: "Adding Filippo to make sure i've set up the notifications correctly." [puppet] - 10https://gerrit.wikimedia.org/r/606441 (https://phabricator.wikimedia.org/T253120) (owner: 10Kormat) [14:13:29] (03PS3) 10Kormat: mariadb: Add monitoring for lag spikes. [puppet] - 10https://gerrit.wikimedia.org/r/606441 (https://phabricator.wikimedia.org/T253120) [14:14:10] !og failover ganeti master in codfw to ganeti2021 [14:14:32] !log failover ganeti master in codfw to ganeti2021 [14:15:09] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 3 others: Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10AMooney) @tstarling anything left to do for this task? [14:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:50] PROBLEM - ganeti-mond running on ganeti2019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:19:42] !log marostegui@cumin2001 dbctl commit (dc=all): 'Repool db1078', diff saved to https://phabricator.wikimedia.org/P11591 and previous config saved to /var/cache/conftool/dbconfig/20200618-141941-marostegui.json [14:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:53] monitoring blip, 2019 is no longer the master, Icinga will soon realise... [14:23:48] RECOVERY - ganeti-mond running on ganeti2019 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [14:25:56] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=jmx_logstash site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:26:58] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:30:49] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Andrew) @wiki_willy honestly at this point the best outcome is probably getting 'store credit' towards future purchases.... [14:31:46] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash7-eqiad instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops& [14:31:46] g-eqiad&var-topic=All&var-consumer_group=All [14:32:36] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:32:52] (03PS13) 10Vgutierrez: ATS: Provide http to https redirection logic in lua [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235) [14:33:34] the logstash7 kafka lag is me ^ [14:33:44] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:33:58] 10Operations, 10ops-codfw, 10DC-Ops: Put rdb200[78] into service - https://phabricator.wikimedia.org/T255681 (10Papaul) a:05Papaul→03None [14:34:02] (03CR) 10jerkins-bot: [V: 04-1] ATS: Provide http to https redirection logic in lua [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235) (owner: 10Vgutierrez) [14:34:38] FFS [14:34:46] (03PS14) 10Vgutierrez: ATS: Provide http to https redirection logic in lua [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235) [14:34:52] (03PS4) 10Kormat: mariadb: Add monitoring for lag spikes. [puppet] - 10https://gerrit.wikimedia.org/r/606441 (https://phabricator.wikimedia.org/T253120) [14:35:02] another CR that's going to be an adult before getting merged [14:36:05] 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10Papaul) @ayounsi the configuation on mr1 is done. Can you take a look and see if i missed anything. The temp root password is the same as the mgmt password. [14:38:06] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [14:46:46] (03PS1) 10Jbond: taskgen: add new CI test to ensure defaults in prod are alos in cloud [puppet] - 10https://gerrit.wikimedia.org/r/606444 [14:46:48] (03PS1) 10Jbond: taskgen: test new CI check [puppet] - 10https://gerrit.wikimedia.org/r/606445 [14:46:50] (03PS1) 10Jbond: taskgen: fix CI issues [puppet] - 10https://gerrit.wikimedia.org/r/606446 [14:47:21] (03CR) 10jerkins-bot: [V: 04-1] taskgen: fix CI issues [puppet] - 10https://gerrit.wikimedia.org/r/606446 (owner: 10Jbond) [14:47:23] (03CR) 10jerkins-bot: [V: 04-1] taskgen: test new CI check [puppet] - 10https://gerrit.wikimedia.org/r/606445 (owner: 10Jbond) [14:48:15] (03PS2) 10Jbond: taskgen: add new CI test to ensure defaults in prod are alos in cloud [puppet] - 10https://gerrit.wikimedia.org/r/606444 [14:48:27] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606441 (https://phabricator.wikimedia.org/T253120) (owner: 10Kormat) [14:48:40] (03PS2) 10Jbond: taskgen: test new CI check [puppet] - 10https://gerrit.wikimedia.org/r/606445 [14:48:51] (03PS2) 10Jbond: taskgen: fix CI issues [puppet] - 10https://gerrit.wikimedia.org/r/606446 [14:49:03] (03CR) 10jerkins-bot: [V: 04-1] taskgen: test new CI check [puppet] - 10https://gerrit.wikimedia.org/r/606445 (owner: 10Jbond) [14:59:55] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/23321/" [puppet] - 10https://gerrit.wikimedia.org/r/606433 (https://phabricator.wikimedia.org/T233933) (owner: 10Muehlenhoff) [15:04:39] !log installing bind updates on jessie (client side tools/libs) [15:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:26] (03CR) 10Ema: ATS: Provide http to https redirection logic in lua (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235) (owner: 10Vgutierrez) [15:15:13] !log installing python-django security updates (packaged buster version) [15:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:27] (03CR) 10Jbond: [C: 03+1] "LGTM but do we also need the following?" [puppet] - 10https://gerrit.wikimedia.org/r/606433 (https://phabricator.wikimedia.org/T233933) (owner: 10Muehlenhoff) [15:18:18] (03CR) 10Dzahn: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/606387 (https://phabricator.wikimedia.org/T255715) (owner: 10Jcrespo) [15:18:43] (03CR) 10Muehlenhoff: "Ah, yes.Let's also include that one, this was present in the test setup, only forgot to merge into the patch. I'll amend." [puppet] - 10https://gerrit.wikimedia.org/r/606433 (https://phabricator.wikimedia.org/T233933) (owner: 10Muehlenhoff) [15:21:16] 10Operations, 10ops-eqsin, 10netops, 10Patch-For-Review: cr3-eqsin to production - https://phabricator.wikimedia.org/T255766 (10RobH) It isn't clear to me if this is for Jin (DreamICC) or for Equinix remote hands. If it is Jin, and I'll be supervising them, I prefer we never do work in eqsin on Monday, a... [15:21:36] (03PS2) 10Muehlenhoff: Add CAP_DAC_OVERRIDE to the CapabilityBoundingSet [puppet] - 10https://gerrit.wikimedia.org/r/606433 (https://phabricator.wikimedia.org/T233933) [15:21:43] 10Operations, 10Privacy Engineering, 10Research, 10Security-Team, and 2 others: wikiworkshop.org has Facebook button, external statcounter, https to http redirect - https://phabricator.wikimedia.org/T251732 (10Reedy) Minor issue, I going to https://wikiworkshop.org/2019 (and other older sites, rather than... [15:23:13] !log installing Ruby 2.1 security updates [15:23:47] (03CR) 10Jbond: [C: 03+1] Add CAP_DAC_OVERRIDE to the CapabilityBoundingSet [puppet] - 10https://gerrit.wikimedia.org/r/606433 (https://phabricator.wikimedia.org/T233933) (owner: 10Muehlenhoff) [15:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:08] 10Operations, 10Privacy Engineering, 10Research, 10Security-Team, and 2 others: wikiworkshop.org has Facebook button, external statcounter, https to http redirect - https://phabricator.wikimedia.org/T251732 (10Reedy) ` curl -I -L https://wikiworkshop.org/2019 HTTP/2 301 date: Thu, 18 Jun 2020 15:17:14 GMT... [15:26:24] (03PS15) 10Vgutierrez: ATS: Provide http to https redirection logic in lua [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235) [15:26:28] 10Operations, 10Privacy Engineering, 10Research, 10Security-Team, and 2 others: wikiworkshop.org has Facebook button, external statcounter, https to http redirect - https://phabricator.wikimedia.org/T251732 (10Reedy) https://github.com/wikimedia/puppet/blob/58ac95353aca3f0925017407ddebc2d397cd9f2f/modules/... [15:26:59] (03CR) 10Vgutierrez: ATS: Provide http to https redirection logic in lua (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235) (owner: 10Vgutierrez) [15:31:28] 10Operations, 10Privacy Engineering, 10Research, 10Security-Team, and 2 others: wikiworkshop.org has Facebook button, external statcounter, https to http redirect - https://phabricator.wikimedia.org/T251732 (10Vgutierrez) that's interesting.. why we don't have HSTS headers for wikiworkshop.org? [15:34:43] !log installing harfbuzz security updates [15:35:30] ou est jouncebot? [15:35:54] !log creatd bot_passwords tables on officeiwki and otrs_wikiwiki T254925 T246489 [15:36:11] And stashbot? [15:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:26] 10Operations, 10ops-codfw, 10serviceops: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10akosiaris) [15:36:27] lazy bot [15:36:31] !now [15:36:38] jouncebot: now [15:36:38] For the next 15 hour(s) and 23 minute(s): Friday Holiday; no deploys Thursday. See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200618T0700) [15:36:41] jouncebot: next [15:36:41] In 15 hour(s) and 23 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200619T0700) [15:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:25] T254925: Bot passwords for officewiki - https://phabricator.wikimedia.org/T254925 [15:37:25] T246489: enable bot passwords otrs-wiki.wikimedia.org - https://phabricator.wikimedia.org/T246489 [15:38:07] (03PS1) 10Reedy: Enable BotPasswords on officewiki and otrs_wikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606454 (https://phabricator.wikimedia.org/T254925) [15:39:13] I've closed the tracking task for this week's train; if things start going bad, it can be reopened for further work [15:43:06] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10ema) >>! In T242767#6207395, @Ottomata wrote: > Hm, I'm pretty sure the connection is terminated even whe... [15:50:58] (03CR) 10Hashar: [C: 03+1] CI: add CI to check shell scripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602693 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [15:53:57] (03PS1) 10Reedy: Remove comment about "Same regex as above in https_recv_redirect" [puppet] - 10https://gerrit.wikimedia.org/r/606457 [15:54:24] 10Operations, 10ops-codfw, 10netops: (Need by: End of July-2020 ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10Papaul) @wiki_willy in C8 we have two msw (fmsw-c8 and msw-c8) which msw are we going to replace since we have only 1 new mgmt switch per rack. [15:55:47] 10Puppet, 10Analytics, 10Cloud-VPS: Puppet failing on wikistats.analytics.eqiad.wmflabs: /usr/local/sbin/x509-bundle error - https://phabricator.wikimedia.org/T255464 (10fdans) a:03elukey For this site, the puppet configuration needs to skip TLS deployment. [15:56:31] 10Puppet, 10Analytics, 10Analytics-Kanban, 10Cloud-VPS: Puppet failing on wikistats.analytics.eqiad.wmflabs: /usr/local/sbin/x509-bundle error - https://phabricator.wikimedia.org/T255464 (10fdans) a:05elukey→03fdans [15:59:50] (03PS1) 10CRusnov: netbox::scripts: Fix error in acme deploy [puppet] - 10https://gerrit.wikimedia.org/r/606458 [16:01:54] (03PS1) 10Ssingh: dnsdist: add parameter for web server configuration [puppet] - 10https://gerrit.wikimedia.org/r/606459 (https://phabricator.wikimedia.org/T252132) [16:05:03] (03CR) 10SBassett: [C: 03+1] Enable BotPasswords on officewiki and otrs_wikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606454 (https://phabricator.wikimedia.org/T254925) (owner: 10Reedy) [16:07:34] (03CR) 10Majavah: [C: 04-1] "task numbers comments missing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606454 (https://phabricator.wikimedia.org/T254925) (owner: 10Reedy) [16:08:32] (03CR) 10Reedy: ">task numbers comments missing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606454 (https://phabricator.wikimedia.org/T254925) (owner: 10Reedy) [16:10:24] (03CR) 10Majavah: [C: 04-1] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606454 (https://phabricator.wikimedia.org/T254925) (owner: 10Reedy) [16:10:52] (03CR) 10CRusnov: [C: 03+2] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/606458 (owner: 10CRusnov) [16:13:49] 10Operations, 10OTRS, 10User-notice: Update OTRS to the latest stable version (6.x.x) - https://phabricator.wikimedia.org/T187984 (10jeblad) [16:15:03] !log reindexing French wiki in Elasticsearch [16:16:14] (03PS2) 10Ssingh: dnsdist: add parameter for web server configuration [puppet] - 10https://gerrit.wikimedia.org/r/606459 (https://phabricator.wikimedia.org/T252132) [16:16:40] 10Operations, 10ops-eqsin, 10netops, 10Patch-For-Review: cr3-eqsin to production - https://phabricator.wikimedia.org/T255766 (10ayounsi) Discussed over IRC. I don't have any strong preferences on the two questions above. It's a tradeoff between DC traffic, work quality and working hours for both remote han... [16:17:31] (03CR) 10Arturo Borrero Gonzalez: taskgen: add new CI test to ensure defaults in prod are alos in cloud (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606444 (owner: 10Jbond) [16:17:44] (03PS1) 10CRusnov: netbox::scripts: Fix typo in parameters [puppet] - 10https://gerrit.wikimedia.org/r/606462 [16:18:16] (03CR) 10CDanis: [C: 03+2] allow easy overriding of VRRP priority on all interfaces & update docs [homer/public] - 10https://gerrit.wikimedia.org/r/606206 (owner: 10CDanis) [16:18:40] (03Merged) 10jenkins-bot: allow easy overriding of VRRP priority on all interfaces & update docs [homer/public] - 10https://gerrit.wikimedia.org/r/606206 (owner: 10CDanis) [16:19:55] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: Increase logging pipeline ingestion capacity - https://phabricator.wikimedia.org/T255243 (10fgiunchedi) Results have been encouraging with a bigger backlog, peaking briefly at 15k logs/s overall submitted to ES and then decreasing. Als... [16:21:40] (03PS3) 10Jbond: taskgen: add new CI test to ensure defaults in prod are also in cloud [puppet] - 10https://gerrit.wikimedia.org/r/606444 [16:21:45] (03CR) 10Jbond: "updated thanks" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606444 (owner: 10Jbond) [16:23:08] (03PS3) 10Ssingh: dnsdist: add parameter for web server configuration [puppet] - 10https://gerrit.wikimedia.org/r/606459 (https://phabricator.wikimedia.org/T252132) [16:24:13] (03CR) 10jerkins-bot: [V: 04-1] dnsdist: add parameter for web server configuration [puppet] - 10https://gerrit.wikimedia.org/r/606459 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:26:22] 10Operations, 10ops-codfw, 10netops: (Need by: End of July-2020 ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10wiki_willy) @Papaul - if we're short by one msw from upgrading everything, then I would say to not upgrade the most recent msw that you have at codfw. And... [16:28:19] (03PS4) 10Ssingh: dnsdist: add parameter for web server configuration [puppet] - 10https://gerrit.wikimedia.org/r/606459 (https://phabricator.wikimedia.org/T252132) [16:29:00] (03PS5) 10Dzahn: jenkins: replace system user/group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) [16:29:50] (03CR) 10Andrew Bogott: [C: 03+1] "Parsing this ruby code in my head is beyond me, but I am very much in favor of adding this test or something like it." [puppet] - 10https://gerrit.wikimedia.org/r/606444 (owner: 10Jbond) [16:30:22] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) >>! In T224591#6235038, @MoritzMuehlenh... [16:30:38] (03PS9) 10Privacybatm: Write documentation using Sphinx [software/transferpy] - 10https://gerrit.wikimedia.org/r/602719 (https://phabricator.wikimedia.org/T253219) [16:31:45] (03CR) 10Privacybatm: "> Patch Set 8:" (033 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/602719 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [16:34:50] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [16:34:57] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10Ottomata) Just parking a crazy idea I just had, mostly irrelevant to this ticket. > Large downloads are v... [16:35:50] (03PS1) 10Ladsgroup: meet: Change owner of account manager code to www-data [puppet] - 10https://gerrit.wikimedia.org/r/606464 [16:37:53] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) I just noticed the permissions of files... [16:38:39] mutante: hey, can you take a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/606464 ? [16:39:04] Amir1: could you add me on Gerrit please? i can't currently do realtime [16:39:32] sure done [16:39:35] thanks [16:40:29] (03PS1) 10Andrew Bogott: codfw1dev: move keystone db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606466 (https://phabricator.wikimedia.org/T242455) [16:40:31] (03PS1) 10Andrew Bogott: codfw1dev: move nova db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606467 (https://phabricator.wikimedia.org/T242455) [16:40:33] (03PS1) 10Andrew Bogott: codfw1dev: move neutron db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606468 (https://phabricator.wikimedia.org/T242455) [16:40:35] (03PS1) 10Andrew Bogott: codfw1dev: move designate db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606469 (https://phabricator.wikimedia.org/T242455) [16:40:39] (03PS1) 10Andrew Bogott: codfw1dev: move puppet enc storage to galera [puppet] - 10https://gerrit.wikimedia.org/r/606470 (https://phabricator.wikimedia.org/T242455) [16:41:06] (03CR) 10jerkins-bot: [V: 04-1] codfw1dev: move neutron db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606468 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [16:41:09] (03CR) 10jerkins-bot: [V: 04-1] codfw1dev: move nova db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606467 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [16:41:24] (03CR) 10jerkins-bot: [V: 04-1] codfw1dev: move designate db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606469 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [16:41:45] (03CR) 10jerkins-bot: [V: 04-1] codfw1dev: move puppet enc storage to galera [puppet] - 10https://gerrit.wikimedia.org/r/606470 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [16:43:22] (03PS2) 10Andrew Bogott: codfw1dev: move nova db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606467 (https://phabricator.wikimedia.org/T242455) [16:43:26] (03PS2) 10Andrew Bogott: codfw1dev: move neutron db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606468 (https://phabricator.wikimedia.org/T242455) [16:43:28] (03PS2) 10Andrew Bogott: codfw1dev: move designate db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606469 (https://phabricator.wikimedia.org/T242455) [16:43:30] (03PS2) 10Andrew Bogott: codfw1dev: move puppet enc storage to galera [puppet] - 10https://gerrit.wikimedia.org/r/606470 (https://phabricator.wikimedia.org/T242455) [16:46:34] (03CR) 10CRusnov: [C: 03+2] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/606462 (owner: 10CRusnov) [16:49:06] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: move keystone db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606466 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [16:49:20] !log Shut off non-dockerised deployment-prep instance of changeprop [16:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:59] !log reindex suspended until deployment of code [16:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:14] (03PS1) 10CRusnov: netbox: Fix netbox http configurations to use correct certificates [puppet] - 10https://gerrit.wikimedia.org/r/606473 [17:05:26] (03CR) 10CRusnov: "compiler confirms noop on netbox1001 as expected." [puppet] - 10https://gerrit.wikimedia.org/r/606473 (owner: 10CRusnov) [17:06:00] (03CR) 10CRusnov: "https://puppet-compiler.wmflabs.org/compiler1001/23325/netbox1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/606473 (owner: 10CRusnov) [17:09:00] 10Operations, 10ops-eqsin, 10netops, 10Patch-For-Review: cr3-eqsin to production - https://phabricator.wikimedia.org/T255766 (10RobH) I've created a google doc, since Jin doesn't use phabricator, outlining all the steps above: https://docs.google.com/document/d/1s2_ALpvDT9xTGihYE8BIXo41dSR9T11sZNUFs1mMrFM... [17:12:11] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2339.codfw.wmnet [17:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:29] !log wkandek@cumin1001 conftool action : set/pooled=yes; selector: name=mw2339.codfw.wmnet [17:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:07] (03PS8) 10Dave Pifke: webperf: Remove XHGui dependency on MongoDB [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) [17:16:36] (03CR) 10jerkins-bot: [V: 04-1] webperf: Remove XHGui dependency on MongoDB [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [17:16:37] !log wkandek@cumin1001 conftool action : set/pooled=no; selector: name=mw2339.codfw.wmnet [17:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:50] (03CR) 10Dave Pifke: webperf: Remove XHGui dependency on MongoDB (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [17:23:04] (03PS9) 10Dave Pifke: webperf: Remove XHGui dependency on MongoDB [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) [17:40:32] 10Operations, 10ops-eqsin: eqsin ganeti cable IDs - https://phabricator.wikimedia.org/T250369 (10RobH) 05Open→03Resolved All updated [17:41:29] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T255520 (10Jclark-ctr) @elukey 10g or 1g? [17:42:01] 10Operations, 10ops-eqsin: apply asset tags to s[12]-60[34]-eqsin - https://phabricator.wikimedia.org/T244900 (10RobH) 05Stalled→03Resolved done and netbox updated [17:42:03] 10Operations, 10ops-eqsin: snag asset tags from ulsfo, ship some to eqsin - https://phabricator.wikimedia.org/T245056 (10RobH) [17:42:07] 10Operations, 10ops-eqsin: rack/setup/install ps[12]-60[34]-eqsin - https://phabricator.wikimedia.org/T242250 (10RobH) [17:46:06] (03PS3) 10Andrew Bogott: codfw1dev: move designate db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606469 (https://phabricator.wikimedia.org/T242455) [17:46:08] (03PS3) 10Andrew Bogott: codfw1dev: move neutron db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606468 (https://phabricator.wikimedia.org/T242455) [17:46:10] (03PS3) 10Andrew Bogott: codfw1dev: move nova db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606467 (https://phabricator.wikimedia.org/T242455) [17:46:12] (03PS3) 10Andrew Bogott: codfw1dev: move puppet enc storage to galera [puppet] - 10https://gerrit.wikimedia.org/r/606470 (https://phabricator.wikimedia.org/T242455) [17:46:48] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: move designate db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606469 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [17:47:07] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install 3 lightweight hadoop nodes - https://phabricator.wikimedia.org/T255518 (10Jclark-ctr) @elukey 10g or 1g? [18:00:43] (03PS4) 10Andrew Bogott: codfw1dev: move neutron db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606468 (https://phabricator.wikimedia.org/T242455) [18:00:45] (03PS4) 10Andrew Bogott: codfw1dev: move nova db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606467 (https://phabricator.wikimedia.org/T242455) [18:00:47] (03PS4) 10Andrew Bogott: codfw1dev: move puppet enc storage to galera [puppet] - 10https://gerrit.wikimedia.org/r/606470 (https://phabricator.wikimedia.org/T242455) [18:00:49] (03PS1) 10Andrew Bogott: designate database: expand grants to include cloudcontrol nodes [puppet] - 10https://gerrit.wikimedia.org/r/606483 (https://phabricator.wikimedia.org/T242455) [18:03:18] (03CR) 10Andrew Bogott: [C: 03+2] designate database: expand grants to include cloudcontrol nodes [puppet] - 10https://gerrit.wikimedia.org/r/606483 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [18:11:21] (03PS5) 10Andrew Bogott: codfw1dev: move puppet enc storage to galera [puppet] - 10https://gerrit.wikimedia.org/r/606470 (https://phabricator.wikimedia.org/T242455) [18:11:23] (03PS5) 10Andrew Bogott: codfw1dev: move neutron db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606468 (https://phabricator.wikimedia.org/T242455) [18:11:25] (03PS5) 10Andrew Bogott: codfw1dev: move nova db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606467 (https://phabricator.wikimedia.org/T242455) [18:16:53] (03PS6) 10Andrew Bogott: codfw1dev: move puppet enc storage to galera [puppet] - 10https://gerrit.wikimedia.org/r/606470 (https://phabricator.wikimedia.org/T242455) [18:16:55] (03PS6) 10Andrew Bogott: codfw1dev: move neutron db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606468 (https://phabricator.wikimedia.org/T242455) [18:16:57] (03PS6) 10Andrew Bogott: codfw1dev: move nova db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606467 (https://phabricator.wikimedia.org/T242455) [18:17:28] 10Operations, 10serviceops: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2339.codfw.wmnet ` The log can be found in `/var/log/... [18:18:03] (03CR) 10jerkins-bot: [V: 04-1] codfw1dev: move puppet enc storage to galera [puppet] - 10https://gerrit.wikimedia.org/r/606470 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [18:18:33] (03PS2) 10Mstyles: sdoc gui idea [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) [18:19:17] PROBLEM - Host 208.80.153.83 is DOWN: PING CRITICAL - Packet loss = 100% [18:20:10] (03PS7) 10Andrew Bogott: codfw1dev: move puppet enc storage to galera [puppet] - 10https://gerrit.wikimedia.org/r/606470 (https://phabricator.wikimedia.org/T242455) [18:20:12] (03PS7) 10Andrew Bogott: codfw1dev: move neutron db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606468 (https://phabricator.wikimedia.org/T242455) [18:20:14] (03PS7) 10Andrew Bogott: codfw1dev: move nova db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606467 (https://phabricator.wikimedia.org/T242455) [18:25:32] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: move puppet enc storage to galera [puppet] - 10https://gerrit.wikimedia.org/r/606470 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [18:32:31] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:32:54] (03PS8) 10Andrew Bogott: codfw1dev: move neutron db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606468 (https://phabricator.wikimedia.org/T242455) [18:32:56] (03PS8) 10Andrew Bogott: codfw1dev: move nova db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606467 (https://phabricator.wikimedia.org/T242455) [18:32:58] (03PS1) 10Andrew Bogott: Try to work around puppet's weird list handling [puppet] - 10https://gerrit.wikimedia.org/r/606489 [18:33:15] (03PS3) 10Mstyles: sdoc gui idea [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) [18:34:37] (03CR) 10Andrew Bogott: [C: 03+2] Try to work around puppet's weird list handling [puppet] - 10https://gerrit.wikimedia.org/r/606489 (owner: 10Andrew Bogott) [18:35:10] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:37] (03PS4) 10Mstyles: sdoc gui idea [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) [18:38:13] (03PS9) 10Andrew Bogott: codfw1dev: move neutron db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606468 (https://phabricator.wikimedia.org/T242455) [18:38:15] (03PS9) 10Andrew Bogott: codfw1dev: move nova db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606467 (https://phabricator.wikimedia.org/T242455) [18:38:17] (03PS1) 10Andrew Bogott: cloud-vps puppetmaster db grants: further attempt to get fix the template [puppet] - 10https://gerrit.wikimedia.org/r/606490 [18:39:40] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps puppetmaster db grants: further attempt to get fix the template [puppet] - 10https://gerrit.wikimedia.org/r/606490 (owner: 10Andrew Bogott) [18:45:10] (03PS10) 10Andrew Bogott: codfw1dev: move neutron db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606468 (https://phabricator.wikimedia.org/T242455) [18:45:12] (03PS10) 10Andrew Bogott: codfw1dev: move nova db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606467 (https://phabricator.wikimedia.org/T242455) [18:45:14] (03PS1) 10Andrew Bogott: cloud-vps puppetmasters: create /etc/labspuppet dir [puppet] - 10https://gerrit.wikimedia.org/r/606491 [18:46:37] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps puppetmasters: create /etc/labspuppet dir [puppet] - 10https://gerrit.wikimedia.org/r/606491 (owner: 10Andrew Bogott) [18:47:32] (03PS2) 10Andrew Bogott: cloud-vps puppetmasters: create /etc/labspuppet dir [puppet] - 10https://gerrit.wikimedia.org/r/606491 [18:47:34] (03PS11) 10Andrew Bogott: codfw1dev: move neutron db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606468 (https://phabricator.wikimedia.org/T242455) [18:47:36] (03PS11) 10Andrew Bogott: codfw1dev: move nova db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606467 (https://phabricator.wikimedia.org/T242455) [18:47:48] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/23327/ pcc error; the secret string lookup fails." [puppet] - 10https://gerrit.wikimedia.org/r/606459 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [18:49:04] 10Operations, 10ops-codfw, 10ops-eqiad, 10ops-eqsin, and 2 others: Audit & update spares part tracking for all sites - https://phabricator.wikimedia.org/T243450 (10RobH) [18:49:06] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps puppetmasters: create /etc/labspuppet dir [puppet] - 10https://gerrit.wikimedia.org/r/606491 (owner: 10Andrew Bogott) [18:51:25] 10Operations, 10ops-eqsin: update power ports for ps[12]-603-eqiad - https://phabricator.wikimedia.org/T255812 (10RobH) [18:51:30] 10Operations, 10serviceops: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2339.codfw.wmnet'] ` and were **ALL** successful. [18:53:07] !log wkandek@cumin1001 conftool action : set/pooled=yes; selector: name=mw2339.codfw.wmnet [18:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:53] 10Operations, 10ops-eqsin: update power ports for ps[12]-603-eqiad - https://phabricator.wikimedia.org/T255812 (10RobH) also need to update the following cables Cr3-eqsin Ps1 ID 1145 port 5, ps2 ID 1146 port 5 Cr3 mgmt is black utp cable ID 1147 2M black UTP from cr3 mgmt to msw1 port 8 [19:09:31] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:11:11] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:13:23] (03PS12) 10Andrew Bogott: codfw1dev: move neutron db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606468 (https://phabricator.wikimedia.org/T242455) [19:13:25] (03PS12) 10Andrew Bogott: codfw1dev: move nova db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606467 (https://phabricator.wikimedia.org/T242455) [19:13:27] (03PS1) 10Andrew Bogott: galera: provide access to puppetmaster (for the enc) [puppet] - 10https://gerrit.wikimedia.org/r/606496 [19:15:22] (03CR) 10Andrew Bogott: [C: 03+2] galera: provide access to puppetmaster (for the enc) [puppet] - 10https://gerrit.wikimedia.org/r/606496 (owner: 10Andrew Bogott) [19:17:10] (03PS5) 10Mstyles: sdoc gui idea [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) [19:17:43] (03CR) 10jerkins-bot: [V: 04-1] sdoc gui idea [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) (owner: 10Mstyles) [19:20:00] (03PS13) 10Andrew Bogott: codfw1dev: move neutron db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606468 (https://phabricator.wikimedia.org/T242455) [19:20:02] (03PS13) 10Andrew Bogott: codfw1dev: move nova db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606467 (https://phabricator.wikimedia.org/T242455) [19:20:04] (03PS1) 10Andrew Bogott: codfw1dev: update recursor names [puppet] - 10https://gerrit.wikimedia.org/r/606498 [19:21:04] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: update recursor names [puppet] - 10https://gerrit.wikimedia.org/r/606498 (owner: 10Andrew Bogott) [19:22:19] (03PS6) 10Mstyles: sdoc gui idea [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) [19:24:22] (03PS1) 10Jeena Huneidi: blubberoid: Update to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/606501 (https://phabricator.wikimedia.org/T248927) [19:31:23] (03CR) 10Dzahn: "Do you know which files in that directory the webserver actually needs to write to? Or is it just about reading? A compromised webserver w" [puppet] - 10https://gerrit.wikimedia.org/r/606464 (owner: 10Ladsgroup) [19:32:31] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: move neutron db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606468 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [19:35:04] (03CR) 10Dzahn: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/606394 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [19:37:35] (03CR) 10Dzahn: "Thanks for this! I can add the secrets to puppet-private (maybe you can just put them somewhere on tungsten and tell me where). I can als" [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [19:39:42] (03PS1) 10Ottomata: EventLogging - use EventGate on group1 wikis for SearchSatisfaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606515 (https://phabricator.wikimedia.org/T249261) [19:41:06] (03CR) 10Dzahn: "I don't think manual cleanup is needed. ferm rules and rsyncd snippets get removed automatically. The only thing to manually cleanup would" [puppet] - 10https://gerrit.wikimedia.org/r/606394 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [19:41:40] (03PS2) 10Dzahn: ci:master: prepare for contint switchover [puppet] - 10https://gerrit.wikimedia.org/r/606394 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [19:41:44] (03CR) 10Ottomata: [C: 03+2] EventLogging - use EventGate on group1 wikis for SearchSatisfaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606515 (https://phabricator.wikimedia.org/T249261) (owner: 10Ottomata) [19:42:07] (03CR) 10Dzahn: [C: 03+2] ci:master: prepare for contint switchover [puppet] - 10https://gerrit.wikimedia.org/r/606394 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [19:44:02] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: EventLogging to EventGate: - SearchSatisfaction on group1 wikis - T249261 (duration: 00m 57s) [19:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:07] T249261: Vertical: Migrate SearchSatisfaction EventLogging event stream to Event Platform - https://phabricator.wikimedia.org/T249261 [19:44:25] (03CR) 10Hashar: "> I don't think manual cleanup is needed. ferm rules and rsyncd snippets get removed automatically. The only thing to manually cleanup wou" [puppet] - 10https://gerrit.wikimedia.org/r/606394 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [19:44:39] (03CR) 10Dzahn: "Notice: /Stage[main]/Ferm/File[/etc/ferm/conf.d/10_ci-migration-rsync]/ensure: removed" [puppet] - 10https://gerrit.wikimedia.org/r/606394 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [19:44:44] mutante: <3 [19:45:26] (03CR) 10Dzahn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/606394 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [19:45:46] hashar: yea, so the rsyncd config snippets stay.. but i think that's ok, as long as the firewall hole is closed.. and it is [19:46:00] prevents accidentally syncing in the wrong direction [19:46:16] then anyone with access on contint1001 can just rsync anything from contint2001 ? [19:46:23] and in addition the rsyncd also has "hosts allow" [19:47:02] not anything, just the pathes we defined that should be synced between them [19:47:33] well.. and no.. because ferm hole is closed [19:47:46] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: move nova db to galera [puppet] - 10https://gerrit.wikimedia.org/r/606467 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [19:48:45] but since that change switches the src and dst. I guess it is now open on contint1001 [19:48:57] (03CR) 10Muehlenhoff: "I don't think that's true, the rsyncd config snippets do stay around, some groundswork was done last year, but not completed yet." [puppet] - 10https://gerrit.wikimedia.org/r/606394 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [19:49:15] yea, 1001 will accept connections from 2001.. that is the intention of the change though [19:49:45] right. but not the other way around [20:10:14] (03PS1) 10Ottomata: EventLogging - use EventGate on all wikis for SearchSatisfaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606517 (https://phabricator.wikimedia.org/T249261) [20:10:29] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/606473 (owner: 10CRusnov) [20:11:25] (03CR) 10Dzahn: "deleted the rsyncd fragments and restarted rsync on contint2001 anyways" [puppet] - 10https://gerrit.wikimedia.org/r/606394 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [20:15:55] (03CR) 10Ottomata: [C: 03+2] EventLogging - use EventGate on all wikis for SearchSatisfaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606517 (https://phabricator.wikimedia.org/T249261) (owner: 10Ottomata) [20:17:19] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: EventLogging to EventGate: - SearchSatisfaction on all wikis - T249261 (duration: 00m 57s) [20:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:24] T249261: Vertical: Migrate SearchSatisfaction EventLogging event stream to Event Platform - https://phabricator.wikimedia.org/T249261 [20:34:20] (03CR) 10Hashar: "Sounds like having fixed UID is a good idea. Note the jenkins Debian package does create user and group, so maybe we need a before => Pac" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [20:36:28] (03PS1) 10CDanis: add alerting for Mediawiki PHP-FPM worker pool saturation [puppet] - 10https://gerrit.wikimedia.org/r/606519 (https://phabricator.wikimedia.org/T252605) [20:42:17] PROBLEM - Host 208.80.153.83 is DOWN: PING CRITICAL - Packet loss = 100% [20:52:31] PROBLEM - Check the last execution of check-homer-diff on cumin2001 is CRITICAL: CRITICAL: Status of the systemd unit check-homer-diff https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:52:37] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:53:26] XioNoX: ^^^ [20:53:26] (03CR) 10Dave Pifke: "Adding the DB is being tracked in T254795." [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [20:53:39] or you ignore icinga-wm too? :-P [20:53:53] what's icinga-wm? [20:54:00] :) [20:54:25] volans: what's the error about ? [20:55:02] the homer diff email [20:55:07] it runs as a systemd timer [20:55:19] wird [20:55:20] Jun 18 20:45:46 cumin2001 check-homer-diff[32486]: /usr/local/sbin/check-homer-diff: line 19: mail: command not found [20:55:23] Jun 18 20:45:46 cumin2001 check-homer-diff[32486]: /usr/local/sbin/check-homer-diff: line 19: mail: command not found [20:55:50] I think in Buster `mail` got replaced by `slack` [20:55:55] ahahaha [21:00:18] I should have said `Debian 10` to trigger even more people [21:02:08] (03PS1) 10Volans: homer: add dependency on bsd-mailx [puppet] - 10https://gerrit.wikimedia.org/r/606526 [21:02:18] XioNoX: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/606526 [21:03:13] volans: did you try to install it manually? [21:03:35] overall it lgtm, but I can't really verify it [21:03:59] it's what we have on 1001 [21:04:14] /usr/bin/mail goes to alternatives that goes to bsd-mailx [21:04:28] ok! [21:04:39] (03CR) 10Ayounsi: [C: 03+1] homer: add dependency on bsd-mailx [puppet] - 10https://gerrit.wikimedia.org/r/606526 (owner: 10Volans) [21:11:09] (03CR) 10Volans: [C: 03+2] homer: add dependency on bsd-mailx [puppet] - 10https://gerrit.wikimedia.org/r/606526 (owner: 10Volans) [21:14:31] !log start check-homer-diff.service on cumin2001 after merging the fix r/606526 [21:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:03] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:15:52] thx! [21:19:01] I'll check with m.o.ri.tz tomorrow if another approach is preferable [21:23:33] RECOVERY - Check the last execution of check-homer-diff on cumin2001 is OK: OK: Status of the systemd unit check-homer-diff https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:24:56] 10Operations, 10Mail, 10OTRS, 10Trust-and-Safety, and 2 others: Forward emails addressed to privacy@wikidata to privacy@wikimedia - https://phabricator.wikimedia.org/T255733 (10Josve05a) [21:37:23] (03CR) 10RLazarus: [C: 03+1] add alerting for Mediawiki PHP-FPM worker pool saturation [puppet] - 10https://gerrit.wikimedia.org/r/606519 (https://phabricator.wikimedia.org/T252605) (owner: 10CDanis) [21:38:31] (03CR) 10VolkerE: "May I ask, why are we leaving metadata in?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [21:43:55] (03CR) 10VolkerE: "Goes alongside optimizations by TinyPNG. It is even saving a bit more. And my guess this is from removing metadata. Otherwise looks good t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [21:45:19] (03PS1) 10QChris: gerrit: Add option to mark gerrit servers as upgraded [puppet] - 10https://gerrit.wikimedia.org/r/606530 [21:45:21] (03PS1) 10QChris: gerrit: Mark gerrit1002 (gerrit-test) as upgraded [puppet] - 10https://gerrit.wikimedia.org/r/606531 [21:45:23] (03PS1) 10QChris: gerrit: Add dedicated home dir for new Gerrit version [puppet] - 10https://gerrit.wikimedia.org/r/606532 [21:45:25] (03PS1) 10QChris: gerrit: Drop its configuration for draft changes for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606533 [21:46:10] (03CR) 10jerkins-bot: [V: 04-1] gerrit: Add dedicated home dir for new Gerrit version [puppet] - 10https://gerrit.wikimedia.org/r/606532 (owner: 10QChris) [21:46:12] (03CR) 10QChris: "This whole topic is completely untested." [puppet] - 10https://gerrit.wikimedia.org/r/606530 (owner: 10QChris) [21:52:17] (03CR) 10CDanis: [C: 03+2] "thanks! PCC lgtm https://puppet-compiler.wmflabs.org/compiler1001/23328/icinga1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/606519 (https://phabricator.wikimedia.org/T252605) (owner: 10CDanis) [21:53:29] (03PS2) 10QChris: gerrit: Add dedicated home dir for new Gerrit version [puppet] - 10https://gerrit.wikimedia.org/r/606532 [21:53:31] (03PS2) 10QChris: gerrit: Drop its configuration for draft changes for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606533 [21:58:48] (03PS1) 10Andrew Bogott: Galera/mysql: increase max_connections to 500 [puppet] - 10https://gerrit.wikimedia.org/r/606534 (https://phabricator.wikimedia.org/T242455) [21:58:50] (03PS1) 10Andrew Bogott: haproxy: add some settings for tcp backends [puppet] - 10https://gerrit.wikimedia.org/r/606535 (https://phabricator.wikimedia.org/T242455) [21:59:30] (03CR) 10Andrew Bogott: [C: 03+2] Galera/mysql: increase max_connections to 500 [puppet] - 10https://gerrit.wikimedia.org/r/606534 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [22:00:23] (03CR) 10Andrew Bogott: [C: 03+2] haproxy: add some settings for tcp backends [puppet] - 10https://gerrit.wikimedia.org/r/606535 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [22:05:55] 10Operations, 10observability, 10serviceops, 10Patch-For-Review: Reliable metrics for idle/busy PHP-FPM workers - https://phabricator.wikimedia.org/T252605 (10CDanis) 05Open→03Resolved We now have an alert and a graph based on scraping the status string that php-fpm provides to systemd, which is reliab... [22:09:18] (03PS1) 10QChris: gerrit: Stop setting up a database for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606536 [22:18:57] (03CR) 10CRusnov: [C: 03+2] netbox: Fix netbox http configurations to use correct certificates [puppet] - 10https://gerrit.wikimedia.org/r/606473 (owner: 10CRusnov) [22:26:23] (03PS1) 10Ottomata: Revert "EventLogging - use EventGate on all wikis for SearchSatisfaction" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606539 [22:27:03] (03CR) 10Ottomata: [V: 03+2 C: 03+2] "Events are being sent to varnish beacon in new format and failing validation!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606539 (owner: 10Ottomata) [22:27:27] (03PS1) 10Ottomata: Revert "EventLogging - use EventGate on group1 wikis for SearchSatisfaction" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606540 [22:27:45] (03PS2) 10Ottomata: Revert "EventLogging - use EventGate on group1 wikis for SearchSatisfaction" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606540 [22:28:10] (03CR) 10Ottomata: [V: 03+2 C: 03+2] "Events are going to varnish beacon in new format and failing validation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606540 (owner: 10Ottomata) [22:29:11] (03Merged) 10jenkins-bot: Revert "EventLogging - use EventGate on group1 wikis for SearchSatisfaction" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606540 (owner: 10Ottomata) [22:30:48] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: EventLogging to EventGate: - SearchSatisfaction on all wikis - T249261 (duration: 00m 56s) [22:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:52] T249261: Vertical: Migrate SearchSatisfaction EventLogging event stream to Event Platform - https://phabricator.wikimedia.org/T249261 [22:34:15] 10Operations, 10Mail, 10OTRS, 10Trust-and-Safety, and 2 others: Forward emails addressed to privacy@wikidata to privacy@wikimedia - https://phabricator.wikimedia.org/T255733 (10Dzahn) To clarify: Do you want the mail to still be in OTRS but additionally also forward to the external address or do you want i... [22:35:05] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:36:53] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:45:49] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:49:25] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:02:40] (03PS1) 10Volans: deploy-check: fix detection of need reload [dns] - 10https://gerrit.wikimedia.org/r/606542 (https://phabricator.wikimedia.org/T255748) [23:02:57] (03PS1) 10Bstorm: cloud nfs: only run nfs-exportd on the current active node [puppet] - 10https://gerrit.wikimedia.org/r/606543 (https://phabricator.wikimedia.org/T253353) [23:10:32] 10Operations, 10SRE-Access-Requests: Requesting access to centralauth database for Jennifer Wang - https://phabricator.wikimedia.org/T255836 (10jwang) [23:10:37] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 55 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:16:23] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 46 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:30:45] (03PS7) 10Mstyles: sdoc gui custom config [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) [23:37:32] (03CR) 10Mstyles: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) (owner: 10Mstyles)