[00:00:04] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Evening backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210310T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:00:33] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-backup1001.eqiad.wmnet with reason: REIMAGE [00:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:24] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-backup1002.eqiad.wmnet with reason: REIMAGE [00:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:33] Anyone up for late a late backport? [00:05:07] Might roll out https://gerrit.wikimedia.org/r/670125 otherwise [00:07:23] 10SRE, 10serviceops, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Krinkle) [00:09:04] (03CR) 10Ladsgroup: "I deploy this tomorrow or the day after. Just quick question, what is with musical-notation? AFAIK, score is disabled for now. We can enab" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612918 (owner: 10Matěj Suchánek) [00:09:12] (03PS1) 10Bstorm: wikireplicas: add new columns for abuse_filter_log to wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/670325 (https://phabricator.wikimedia.org/T234615) [00:09:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: 2021-03-31) rack/setup/install ms-backup100[12] - https://phabricator.wikimedia.org/T274206 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-backup1001.eqiad.wmnet', 'ms-backup1002.eqiad.wmnet'] ` and were **ALL** successful. [00:16:33] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup100[12] - https://phabricator.wikimedia.org/T274206 (10RobH) [00:17:06] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup100[12] - https://phabricator.wikimedia.org/T274206 (10RobH) 05Open→03Resolved [00:20:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:21:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:27:44] 10SRE, 10Analytics, 10observability: Set up cross DC topic mirroring for Kafka logging clusters - https://phabricator.wikimedia.org/T276972 (10crusnov) p:05Triage→03Medium [00:28:15] 10SRE, 10Patch-For-Review: Role with quote in description causes bash syntax error - https://phabricator.wikimedia.org/T276868 (10crusnov) p:05Triage→03Medium [00:33:50] (03CR) 10Krinkle: [C: 03+2] Fix layout shift class name parsing for SVGElement [extensions/NavigationTiming] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670125 (https://phabricator.wikimedia.org/T276826) (owner: 10Krinkle) [00:54:10] PROBLEM - Check systemd state on webperf1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:56:07] (03Merged) 10jenkins-bot: Fix layout shift class name parsing for SVGElement [extensions/NavigationTiming] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670125 (https://phabricator.wikimedia.org/T276826) (owner: 10Krinkle) [01:02:44] (03CR) 10CRusnov: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/670243 (owner: 10Volans) [01:05:05] (03PS1) 10Papaul: DHCP: Add MAC address for kafka-loggin200[123] [puppet] - 10https://gerrit.wikimedia.org/r/670327 (https://phabricator.wikimedia.org/T274905) [01:06:22] (03CR) 10CRusnov: [C: 03+1] "Looks good!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/670233 (owner: 10Volans) [01:06:43] (03CR) 10CRusnov: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/670234 (owner: 10Volans) [01:08:25] !log krinkle@deploy1002 Synchronized php-1.36.0-wmf.34/extensions/NavigationTiming/modules/ext.navigationTiming.js: T276826 Ibd9ddf14d64 (duration: 01m 14s) [01:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:33] T276826: Fundraising banner with inline SVG triggers `Uncaught TypeError: node.className.replace is not a function` for layout shift due to SVGAnimatedString className attribute. - https://phabricator.wikimedia.org/T276826 [01:09:48] RECOVERY - Memory correctable errors -EDAC- on thumbor2001 is OK: (C)4 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor2001&var-datasource=codfw+prometheus/ops [01:23:48] (03CR) 10Papaul: [C: 03+2] DHCP: Add MAC address for kafka-loggin200[123] [puppet] - 10https://gerrit.wikimedia.org/r/670327 (https://phabricator.wikimedia.org/T274905) (owner: 10Papaul) [01:25:14] (03PS1) 10Dzahn: add gitlab.wikimedia.org service alias, point to gitlab1001 [dns] - 10https://gerrit.wikimedia.org/r/670330 (https://phabricator.wikimedia.org/T276170) [01:27:54] (03PS3) 10Phamhi: wikireplica: depool clouddb1018 [puppet] - 10https://gerrit.wikimedia.org/r/670190 [01:28:11] (03PS2) 10Dzahn: add gitlab.wikimedia.org service alias, point to gitlab1001 [dns] - 10https://gerrit.wikimedia.org/r/670330 (https://phabricator.wikimedia.org/T276170) [01:30:45] 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Dzahn) Ok, thanks. I will prepare patches but wait with merging them for now. [01:33:39] (03PS1) 10Dzahn: gitlab: open port 80 and 443 on gitlab1001 to the world [puppet] - 10https://gerrit.wikimedia.org/r/670331 (https://phabricator.wikimedia.org/T276144) [01:34:49] (03CR) 10jerkins-bot: [V: 04-1] gitlab: open port 80 and 443 on gitlab1001 to the world [puppet] - 10https://gerrit.wikimedia.org/r/670331 (https://phabricator.wikimedia.org/T276144) (owner: 10Dzahn) [01:35:31] (03CR) 10Phamhi: [C: 03+2] wikireplica: depool clouddb1018 [puppet] - 10https://gerrit.wikimedia.org/r/670190 (owner: 10Phamhi) [01:37:24] (03PS2) 10Dzahn: gitlab: open port 80 and 443 on gitlab1001 to the world [puppet] - 10https://gerrit.wikimedia.org/r/670331 (https://phabricator.wikimedia.org/T276144) [01:40:09] 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Dzahn) >>! In T276144#6890920, @Sergey.Trofimovsky.SF wrote: > There's not much risk in kee... [01:40:33] (03PS1) 10Papaul: Add kafka-logging200[123] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/670332 (https://phabricator.wikimedia.org/T274905) [01:42:06] (03CR) 10Papaul: [C: 03+2] Add kafka-logging200[123] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/670332 (https://phabricator.wikimedia.org/T274905) (owner: 10Papaul) [01:49:38] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:53:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:53:50] (03PS3) 10Dzahn: add gitlab.wikimedia.org service alias, point to gitlab1001 [dns] - 10https://gerrit.wikimedia.org/r/670330 (https://phabricator.wikimedia.org/T276170) [01:54:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:59:54] (03CR) 10Dzahn: "note: currently this just affects gitlab1001 because there is no other VM with that role, but if there was gitlab1002 it would automatical" [puppet] - 10https://gerrit.wikimedia.org/r/670331 (https://phabricator.wikimedia.org/T276144) (owner: 10Dzahn) [02:14:24] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:37:58] (03PS2) 10Whym: Fix obsolete comments on wgCheckUserLogLogins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670180 (https://phabricator.wikimedia.org/T253802) [03:39:52] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ldap site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:42:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:23:46] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: cloudvirt1038.eqiad.wmnet, pki2001.codfw.wmnet, ms-be1022.eqiad.wmnet, wdqs1010.eqiad.wmnet, maps1009.eqiad.wmnet, ms-be1019.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [04:52:47] !log T266470 Temporarily disabling puppet on all `wdqs*` hosts in preparation for `wdqs.discovery.wmnet` certificate revocation [04:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:52:57] T266470: Expose wdqs1009 to wdqs users and gather feedback - https://phabricator.wikimedia.org/T266470 [04:53:38] !log T266470 ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-all' 'sudo disable-puppet "revoking old cert and generating new one with new alt_names - T266470"' [04:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:45] !log T266470 `ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-all' 'sudo disable-puppet "revoking old cert and generating new one with new alt_names - T266470"'` [04:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:08] !log T266470 Certificate revoked: `ryankemper@puppetmaster1001:/srv/private$ sudo puppet cert clean wdqs.discovery.wmnet` [04:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:08] !log T266470 In the `/srv/private` repo, `/srv/private/modules/secret/secrets/certificates/certificate.manifests.d/wdqs.certs.yaml` has been edited to add the relevant `alt_names` [04:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:18] !log T266470 `sudo rm -fv certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.crt.pem certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.csr.pem certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.keystore.jks certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.keystore.p12 certificates/wdqs.discovery.wmnet/truststore.jks` (full paths not provided to fit the IRC line) [04:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:58:17] !log T266470 The above two actions mean that we're ready to generate the new certificate files. Proceeding: `sudo cergen -c 'wdqs.*' --generate --base-path /srv/private/modules/secret/secrets/certificates /srv/private/modules/secret/secrets/certificates/certificate.manifests.d` on `ryankemper@puppetmaster1001:/srv/private` [04:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:58:23] T266470: Expose wdqs1009 to wdqs users and gather feedback - https://phabricator.wikimedia.org/T266470 [05:05:32] (03PS1) 10Ryan Kemper: wdqs: update cert for query-preview.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/670337 [05:06:03] (03PS2) 10Ryan Kemper: wdqs: update cert for query-preview.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/670337 (https://phabricator.wikimedia.org/T266470) [05:06:38] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: update cert for query-preview.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/670337 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper) [05:06:47] !log T266470 New `wdqs.discovery.wmnet.crt` added to public `operations/puppet` repo: https://gerrit.wikimedia.org/r/c/operations/puppet/+/670337/ [05:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:54] T266470: Expose wdqs1009 to wdqs users and gather feedback - https://phabricator.wikimedia.org/T266470 [05:13:29] !log T266470 [`/srv/private`] `chown gitpuppet:gitpuppet` on all modified files (were owned by root, probably because I sudo'd - may be that a git commit hook would have caught that but explicitly chowning just to be safe) [05:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:36] T266470: Expose wdqs1009 to wdqs users and gather feedback - https://phabricator.wikimedia.org/T266470 [05:15:25] !log T266470 [`/srv/private`] All changes commited to private git repo, commit SHA `ec1d6cfae8c72e4f807b343cdb9f25c27817d98d` [05:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:15] !log Enabling puppet on single public wdqs host to verify certificate update is without issue: `ryankemper@wdqs1004:~$ sudo enable-puppet "revoking old cert and generating new one with new alt_names - T266470 - root"` followed by `ryankemper@wdqs1004:~$ sudo run-puppet-agent` [05:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:53] grr forgot the ticket on that last log message, let's try again: [05:20:55] !log T266470 Enabled puppet on single public wdqs host to verify certificate update is without issue: `ryankemper@wdqs1004:~$ sudo enable-puppet "revoking old cert and generating new one with new alt_names - T266470 - root"` followed by `ryankemper@wdqs1004:~$ sudo run-puppet-agent` [05:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:02] T266470: Expose wdqs1009 to wdqs users and gather feedback - https://phabricator.wikimedia.org/T266470 [05:21:52] 10SRE, 10ops-codfw, 10DBA: Upgrade firmware on db2073 - https://phabricator.wikimedia.org/T276909 (10Marostegui) Thank you Papaul, that made the server come back to life. Which is good news, as we are now fully aware that T216240 is an issue. [05:22:06] 10SRE, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) [05:24:33] !log T266470 Test queries passing on `wdqs1004`, and `https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&var-cluster_name=wdqs&from=now-1h&to=now` looks as expected. Proceeding to rest of fleet [05:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:29] !log T266470 `ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-all' 'sudo enable-puppet "revoking old cert and generating new one with new alt_names - T266470 - root"'` and `ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-all' 'sudo run-puppet-agent'` [05:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:36] T266470: Expose wdqs1009 to wdqs users and gather feedback - https://phabricator.wikimedia.org/T266470 [05:27:31] (03PS1) 10Marostegui: db2073: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/670338 [05:27:34] !log T266470 Rollout of updated certificate complete. We're now ready to implement envoy for `wdqs-test` which will allow `wdqs1009` to be reachable via port 443 and thereby allow us to go live with `query-preview.wikidata.org` when the time comes [05:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:15] (03CR) 10Marostegui: [C: 03+2] db2073: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/670338 (owner: 10Marostegui) [05:32:22] (03PS1) 10Ryan Kemper: wdqs: impl. envoy for wdqs-test [puppet] - 10https://gerrit.wikimedia.org/r/670339 (https://phabricator.wikimedia.org/T266470) [05:39:27] (03CR) 10Ryan Kemper: wdqs: impl. envoy for wdqs-test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670339 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper) [05:39:57] (03CR) 10Marostegui: [C: 03+1] dbbackups: Limit concurrency of es backups on codfw, too [puppet] - 10https://gerrit.wikimedia.org/r/670174 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [05:48:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:50:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:56:15] (03PS1) 10Phamhi: Revert "wikireplica: depool clouddb1018" [puppet] - 10https://gerrit.wikimedia.org/r/670346 [05:56:50] (03CR) 10Phamhi: [C: 03+2] Revert "wikireplica: depool clouddb1018" [puppet] - 10https://gerrit.wikimedia.org/r/670346 (owner: 10Phamhi) [05:59:04] (03CR) 10Phamhi: [C: 03+2] wikireplica: depool clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/670209 (owner: 10Phamhi) [06:09:52] PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 15 down 2 https://wikitech.wikimedia.org/wiki/HAProxy [06:13:28] RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 17 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [06:14:15] (03PS1) 10Phamhi: Revert "wikireplica: depool clouddb1019" [puppet] - 10https://gerrit.wikimedia.org/r/670347 [06:15:43] (03CR) 10Phamhi: [C: 03+2] Revert "wikireplica: depool clouddb1019" [puppet] - 10https://gerrit.wikimedia.org/r/670347 (owner: 10Phamhi) [06:16:08] (03PS2) 10Legoktm: [WIP] Add shellbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 [06:17:06] !log reimage an-worker1111 to buster [06:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:03] (03CR) 10Phamhi: [C: 03+2] wikireplica: depool clouddb1020 [puppet] - 10https://gerrit.wikimedia.org/r/670221 (owner: 10Phamhi) [06:23:36] (03CR) 10Ayounsi: [C: 03+1] doc: move ClusterShell URL to HTTPS (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/670243 (owner: 10Volans) [06:24:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:27:13] (03CR) 10Ayounsi: [V: 03+1] netbox: fix object type returned for status [software/spicerack] - 10https://gerrit.wikimedia.org/r/670234 (owner: 10Volans) [06:27:19] (03CR) 10Ayounsi: [C: 03+1] netbox: fix object type returned for status [software/spicerack] - 10https://gerrit.wikimedia.org/r/670234 (owner: 10Volans) [06:29:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:56] PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 15 down 2 https://wikitech.wikimedia.org/wiki/HAProxy [06:35:10] RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 17 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [06:36:08] (03PS1) 10Phamhi: Revert "wikireplica: depool clouddb1020" [puppet] - 10https://gerrit.wikimedia.org/r/670348 [06:37:20] (03CR) 10Phamhi: [C: 03+2] Revert "wikireplica: depool clouddb1020" [puppet] - 10https://gerrit.wikimedia.org/r/670348 (owner: 10Phamhi) [06:38:58] 10ops-eqiad, 10Analytics: analytics1066's BBU might need to be replaced - https://phabricator.wikimedia.org/T277005 (10elukey) [06:43:05] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1111.eqiad.wmnet with reason: REIMAGE [06:43:07] 10ops-eqiad, 10Analytics: analytics1066's BBU might need to be replaced - https://phabricator.wikimedia.org/T277005 (10elukey) @razzi the error in icinga is `CRITICAL: 12 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteTh... [06:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:14] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1111.eqiad.wmnet with reason: REIMAGE [06:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:36] (03CR) 10Ayounsi: [C: 03+1] netbox: add NetboxServer class (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/670235 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [06:59:12] (03PS1) 10Marostegui: dbproxy1019: Depool clouddb1016 (s5 and s8) [puppet] - 10https://gerrit.wikimedia.org/r/670342 (https://phabricator.wikimedia.org/T269211) [06:59:50] (03CR) 10Marostegui: [C: 03+2] dbproxy1019: Depool clouddb1016 (s5 and s8) [puppet] - 10https://gerrit.wikimedia.org/r/670342 (https://phabricator.wikimedia.org/T269211) (owner: 10Marostegui) [07:00:09] !log Depool clouddb1016 [07:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:53] !log remove the oldest kernel on ganeti nodes to free space for /boot [07:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1113:3316 for schema change', diff saved to https://phabricator.wikimedia.org/P14713 and previous config saved to /var/cache/conftool/dbconfig/20210310-070312-marostegui.json [07:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:54] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 16 down 2 https://wikitech.wikimedia.org/wiki/HAProxy [07:05:08] ^ expected [07:05:29] ACKNOWLEDGEMENT - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 16 down 2 Marostegui known https://wikitech.wikimedia.org/wiki/HAProxy [07:06:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2145', diff saved to https://phabricator.wikimedia.org/P14714 and previous config saved to /var/cache/conftool/dbconfig/20210310-070642-marostegui.json [07:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:26] (03PS1) 10Marostegui: db2145: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/670343 [07:07:59] !log sudo apt-get remove linux-image-4.9.0-9-amd64 on sodium to free space for /boot [07:08:01] (03CR) 10Marostegui: [C: 03+2] db2145: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/670343 (owner: 10Marostegui) [07:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:58] RECOVERY - Check systemd state on webperf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:16:55] 10ops-eqiad, 10DBA, 10DC-Ops: Upgrade firmware on db1136 - https://phabricator.wikimedia.org/T277007 (10Marostegui) [07:17:06] 10ops-eqiad, 10DBA, 10DC-Ops: Upgrade firmware on db1136 - https://phabricator.wikimedia.org/T277007 (10Marostegui) p:05Triage→03High [07:17:41] 10ops-eqiad, 10DBA, 10DC-Ops: Upgrade firmware on db1136 - https://phabricator.wikimedia.org/T277007 (10Marostegui) [07:17:44] 10SRE, 10DBA: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui) [07:17:47] 10SRE, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) [07:23:42] (03PS1) 10Elukey: cumin: review some analytics aliases [puppet] - 10https://gerrit.wikimedia.org/r/670368 [07:24:58] I'll do a graphite1004 reboot shortly to upgrade the kernel, no impact expected [07:25:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1113:3316', diff saved to https://phabricator.wikimedia.org/P14715 and previous config saved to /var/cache/conftool/dbconfig/20210310-072508-marostegui.json [07:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1085 for schema change', diff saved to https://phabricator.wikimedia.org/P14716 and previous config saved to /var/cache/conftool/dbconfig/20210310-072642-marostegui.json [07:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:11] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite1004.eqiad.wmnet [07:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:56] (03PS1) 10Marostegui: db2095: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/670370 [07:31:42] (03CR) 10Marostegui: [C: 03+2] db2095: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/670370 (owner: 10Marostegui) [07:33:36] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite1004.eqiad.wmnet [07:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:38] (03CR) 10Matěj Suchánek: "> Patch Set 7:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612918 (owner: 10Matěj Suchánek) [07:46:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 30%: 10', diff saved to https://phabricator.wikimedia.org/P14717 and previous config saved to /var/cache/conftool/dbconfig/20210310-074618-root.json [07:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:39] !log Deploy schema change on s7 codfw (lag will appear) T276150 T276156 [07:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:47] T276150: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 [07:52:48] T276156: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 [07:57:13] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/670368 (owner: 10Elukey) [08:00:38] (03CR) 10Elukey: [C: 03+2] cumin: review some analytics aliases [puppet] - 10https://gerrit.wikimedia.org/r/670368 (owner: 10Elukey) [08:01:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 60%: 10', diff saved to https://phabricator.wikimedia.org/P14718 and previous config saved to /var/cache/conftool/dbconfig/20210310-080123-root.json [08:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:18] (03PS1) 10Elukey: cumin: fix typo in alias [puppet] - 10https://gerrit.wikimedia.org/r/670410 [08:03:30] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host analytics-tool1001.eqiad.wmnet [08:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:34] (03CR) 10Elukey: [C: 03+2] cumin: fix typo in alias [puppet] - 10https://gerrit.wikimedia.org/r/670410 (owner: 10Elukey) [08:05:52] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host analytics-tool1001.eqiad.wmnet [08:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:03] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host thorium.eqiad.wmnet [08:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:25] !log Check tables on db1150:3315 - T276742 [08:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:31] T276742: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 [08:16:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14719 and previous config saved to /var/cache/conftool/dbconfig/20210310-081627-root.json [08:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:14] !log powercycling thorium, stuck on reboot [08:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:03] !log pruning obsolete kernels from ganeti hosts in eqiad/codfw [08:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:58] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thorium.eqiad.wmnet [08:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:08] (03PS1) 10Marostegui: db1159: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/670412 (https://phabricator.wikimedia.org/T258361) [08:23:12] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [08:23:43] (03CR) 10Marostegui: [C: 03+2] db1159: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/670412 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [08:25:19] !log Upgrade mysql and kernel on db2078 [08:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=mysql-misc site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:33:02] PROBLEM - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:33:57] 10SRE, 10User-MoritzMuehlenhoff: Automated removal of obsolete kernels - https://phabricator.wikimedia.org/T277011 (10MoritzMuehlenhoff) [08:34:06] ^ expected [08:39:12] RECOVERY - haproxy failover on dbproxy2002 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:39:16] !log Upgrade mysql and kernel on db2132 [08:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:36] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:53:54] PROBLEM - haproxy failover on dbproxy2001 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:54:01] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging100[123] - https://phabricator.wikimedia.org/T273778 (10fgiunchedi) >>! In T273778#6898356, @RobH wrote: > I'm having partman issues since this needs to use a new recipe for hw raid 10 flat filesystem. flat.cfg didn't work, so went... [08:54:37] haproxy ^ expected [08:54:56] RECOVERY - haproxy failover on dbproxy2001 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:58:56] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:01:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:02:16] (03CR) 10DCausse: [C: 03+1] add new updater job properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/667034 (https://phabricator.wikimedia.org/T273095) (owner: 10Mstyles) [09:04:45] (03CR) 10DCausse: "@Ryan/@Trey: this is just a quick note for next time we change the plugin versions to also update the docker dev image" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/669720 (owner: 10DCausse) [09:12:38] (03PS9) 10Kormat: mariadb: Use section params: remaining profiles. [puppet] - 10https://gerrit.wikimedia.org/r/669845 (https://phabricator.wikimedia.org/T275497) [09:12:56] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2028.codfw.wmnet [09:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:28] (03PS1) 10Marostegui: clouddb1021: Enable s5 and s8 [puppet] - 10https://gerrit.wikimedia.org/r/670417 (https://phabricator.wikimedia.org/T269211) [09:15:06] (03CR) 10Marostegui: [C: 03+2] clouddb1021: Enable s5 and s8 [puppet] - 10https://gerrit.wikimedia.org/r/670417 (https://phabricator.wikimedia.org/T269211) (owner: 10Marostegui) [09:17:16] 10SRE, 10cloud-services-team (Kanban): cloudvirt2003-dev: debian installer partman recipe prompts for actions - https://phabricator.wikimedia.org/T277014 (10aborrero) [09:18:54] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2028.codfw.wmnet [09:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:55] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2029.codfw.wmnet [09:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:11] !log aborrero@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2003-dev.codfw.wmnet with reason: REIMAGE [09:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:18] !log aborrero@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2003-dev.codfw.wmnet with reason: REIMAGE [09:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:11] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2029.codfw.wmnet [09:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:15] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2030.codfw.wmnet [09:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:33] (03CR) 10Kormat: [C: 03+1] dbbackups: Move files and templates to dbbackups hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/670217 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [09:35:29] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2030.codfw.wmnet [09:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:32] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2031.codfw.wmnet [09:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:48] 10SRE, 10DC-Ops, 10homer, 10netops: Remove servers interface names from switches interfaces descriptions - https://phabricator.wikimedia.org/T277006 (10Volans) I agree with (1) for now, given that probably we'll go for (3) anyway later on. [09:41:53] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 18 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:46:27] (03CR) 10JMeybohm: [C: 04-1] builder/docker: break out docker ferm rules into own profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670286 (https://phabricator.wikimedia.org/T276869) (owner: 10Dzahn) [09:47:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:48:01] (03PS1) 10Marostegui: Revert "dbproxy1019: Depool clouddb1016 (s5 and s8)" [puppet] - 10https://gerrit.wikimedia.org/r/670349 [09:48:51] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1019: Depool clouddb1016 (s5 and s8)" [puppet] - 10https://gerrit.wikimedia.org/r/670349 (owner: 10Marostegui) [09:49:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:49:23] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2031.codfw.wmnet [09:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:20] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Limit concurrency of es backups on codfw, too [puppet] - 10https://gerrit.wikimedia.org/r/670174 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [09:52:41] (03PS3) 10Jcrespo: dbbackups: Move files and templates to dbbackups hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/670217 (https://phabricator.wikimedia.org/T138562) [09:52:50] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2032.codfw.wmnet [09:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:14] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2032.codfw.wmnet [09:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:34] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2033.codfw.wmnet [09:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:11] (03PS1) 10Jbond: acme_chief: add gitlab certificate [puppet] - 10https://gerrit.wikimedia.org/r/670424 (https://phabricator.wikimedia.org/T276673) [10:00:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28479/console" [puppet] - 10https://gerrit.wikimedia.org/r/670424 (https://phabricator.wikimedia.org/T276673) (owner: 10Jbond) [10:02:43] (03CR) 10Jcrespo: [C: 03+2] "Thank you for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/670217 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [10:03:37] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2033.codfw.wmnet [10:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:25] (03CR) 10Jcrespo: "Code logic seems ok to me. However, because I have not tested it/checked thoroughly, nor I know too much about the correspondance between " [puppet] - 10https://gerrit.wikimedia.org/r/670171 (owner: 10Muehlenhoff) [10:08:20] (03PS1) 10Jbond: P:gitlab: Deploy acme chief certificate [puppet] - 10https://gerrit.wikimedia.org/r/670427 (https://phabricator.wikimedia.org/T276673) [10:09:14] (03PS2) 10Jbond: P:gitlab: Deploy acme chief certificate [puppet] - 10https://gerrit.wikimedia.org/r/670427 (https://phabricator.wikimedia.org/T276673) [10:11:55] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2034.codfw.wmnet [10:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:44] !log Drop testreduce_vd from m5 master - T276787 [10:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:56] T276787: Drop testreduce and testreduce_vd from m5 master - https://phabricator.wikimedia.org/T276787 [10:17:01] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [10:17:18] (03CR) 10Jbond: P:gitlab: Deploy acme chief certificate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670427 (https://phabricator.wikimedia.org/T276673) (owner: 10Jbond) [10:18:16] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2034.codfw.wmnet [10:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1098:3317 for schema change', diff saved to https://phabricator.wikimedia.org/P14721 and previous config saved to /var/cache/conftool/dbconfig/20210310-101922-marostegui.json [10:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:02] 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10jbond) > If internal IP and behind caching layer, then cergen can be used. Now that the VM... [10:22:09] (03PS1) 10Klausman: conftool: Add entry for ML k8s masters in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/670430 [10:23:57] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/670171 (owner: 10Muehlenhoff) [10:26:32] (03CR) 10Elukey: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/670430 (owner: 10Klausman) [10:29:24] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1023.eqiad.wmnet [10:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:18] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10MoritzMuehlenhoff) 05Resolved→03Open @OlyKalinichenkoSpeedAndFunction your production SSH key is still in Cloud VPS LDAP, please remove it at... [10:31:02] 10SRE, 10netops, 10cloud-services-team (Kanban): cloudgw eqiad1: review & allocate subnets and VLANs - https://phabricator.wikimedia.org/T277020 (10aborrero) p:05Triage→03Medium [10:32:04] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2035.codfw.wmnet [10:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:26] (03CR) 10Ladsgroup: "I see. Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612918 (owner: 10Matěj Suchánek) [10:37:46] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2035.codfw.wmnet [10:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:18] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1168.eqiad.wmnet with reason: schema change T267767 [10:38:18] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1168.eqiad.wmnet with reason: schema change T267767 [10:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:24] T267767: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 [10:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:37] !log kormat@cumin1001 dbctl commit (dc=all): 'db1168 depooling: schema change T267767', diff saved to https://phabricator.wikimedia.org/P14722 and previous config saved to /var/cache/conftool/dbconfig/20210310-103836-kormat.json [10:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:41] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2036.codfw.wmnet [10:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:01] !log upgrade memcached on mc2019, mc1019 [10:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:42] !log kormat@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 25%: schema change T267767', diff saved to https://phabricator.wikimedia.org/P14723 and previous config saved to /var/cache/conftool/dbconfig/20210310-104042-kormat.json [10:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:25] (03PS1) 10Kormat: debian: Ignore integration_env cache dir. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/670432 [10:45:31] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2036.codfw.wmnet [10:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:08] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2037.codfw.wmnet [10:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 30%: 10', diff saved to https://phabricator.wikimedia.org/P14724 and previous config saved to /var/cache/conftool/dbconfig/20210310-104856-root.json [10:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:39] (03CR) 10Jbond: [C: 03+1] "> Patch Set 3:" (036 comments) [software/netbox] - 10https://gerrit.wikimedia.org/r/668574 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [10:50:24] (03PS9) 10Jbond: netbox, profile::netbox: Switch to CAS authentication [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [10:51:36] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28481/console" [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [10:51:51] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2037.codfw.wmnet [10:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:00] (03CR) 10Kormat: [C: 03+2] debian: Ignore integration_env cache dir. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/670432 (owner: 10Kormat) [10:53:57] (03PS3) 10Hnowlan: aqs: make aqs1010 a separate AQS cluster [puppet] - 10https://gerrit.wikimedia.org/r/670197 (https://phabricator.wikimedia.org/T257572) [10:54:58] (03CR) 10Jbond: [V: 03+1 C: 03+1] "LGTM also added pcc" [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [10:54:59] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1023.eqiad.wmnet [10:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:31] (03Merged) 10jenkins-bot: debian: Ignore integration_env cache dir. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/670432 (owner: 10Kormat) [10:55:46] !log kormat@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 50%: schema change T267767', diff saved to https://phabricator.wikimedia.org/P14725 and previous config saved to /var/cache/conftool/dbconfig/20210310-105545-kormat.json [10:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:53] T267767: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 [11:00:04] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1028.eqiad.wmnet [11:00:05] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2038.codfw.wmnet [11:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:35] (03CR) 10Hnowlan: [C: 03+2] aqs: make aqs1010 a separate AQS cluster [puppet] - 10https://gerrit.wikimedia.org/r/670197 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [11:04:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 60%: 10', diff saved to https://phabricator.wikimedia.org/P14726 and previous config saved to /var/cache/conftool/dbconfig/20210310-110359-root.json [11:04:02] (03CR) 10Volans: [C: 03+2] doc: move ClusterShell URL to HTTPS (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/670243 (owner: 10Volans) [11:04:04] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=CREATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:22] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1028.eqiad.wmnet [11:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:50] (03PS3) 10Volans: netbox: refactor unit tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/670233 [11:06:15] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2038.codfw.wmnet [11:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:19] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2039.codfw.wmnet [11:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:02] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/670430 (owner: 10Klausman) [11:08:28] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:09:11] (03Merged) 10jenkins-bot: doc: move ClusterShell URL to HTTPS [software/spicerack] - 10https://gerrit.wikimedia.org/r/670243 (owner: 10Volans) [11:10:06] (03CR) 10Klausman: [C: 03+2] conftool: Add entry for ML k8s masters in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/670430 (owner: 10Klausman) [11:10:50] !log kormat@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 75%: schema change T267767', diff saved to https://phabricator.wikimedia.org/P14727 and previous config saved to /var/cache/conftool/dbconfig/20210310-111049-kormat.json [11:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:57] T267767: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 [11:11:00] (03CR) 10Volans: [C: 03+2] netbox: refactor unit tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/670233 (owner: 10Volans) [11:15:47] (03Merged) 10jenkins-bot: netbox: refactor unit tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/670233 (owner: 10Volans) [11:15:59] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2039.codfw.wmnet [11:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:20] (03PS3) 10Volans: netbox: fix object type returned for status [software/spicerack] - 10https://gerrit.wikimedia.org/r/670234 [11:16:58] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2040.codfw.wmnet [11:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14728 and previous config saved to /var/cache/conftool/dbconfig/20210310-111903-root.json [11:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:30] (03CR) 10Volans: [C: 03+2] netbox: fix object type returned for status [software/spicerack] - 10https://gerrit.wikimedia.org/r/670234 (owner: 10Volans) [11:22:28] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2040.codfw.wmnet [11:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1101:3317 for schema change', diff saved to https://phabricator.wikimedia.org/P14729 and previous config saved to /var/cache/conftool/dbconfig/20210310-112427-marostegui.json [11:24:41] (03CR) 10Daimona Eaytoy: [C: 03+1] "Yes, this is correct." [puppet] - 10https://gerrit.wikimedia.org/r/670325 (https://phabricator.wikimedia.org/T234615) (owner: 10Bstorm) [11:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:53] !log kormat@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: schema change T267767', diff saved to https://phabricator.wikimedia.org/P14730 and previous config saved to /var/cache/conftool/dbconfig/20210310-112553-kormat.json [11:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:00] T267767: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 [11:26:09] (03Merged) 10jenkins-bot: netbox: fix object type returned for status [software/spicerack] - 10https://gerrit.wikimedia.org/r/670234 (owner: 10Volans) [11:27:09] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc1024.eqiad.wmnet [11:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:36] (03PS1) 10Jbond: systemd::timer::job: Add ability to redirect stdout/stdin/stderr [puppet] - 10https://gerrit.wikimedia.org/r/670436 (https://phabricator.wikimedia.org/T273673) [11:29:02] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1013.eqiad.wmnet [11:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:28] (03PS1) 10Ladsgroup: Add shy name (same as shy-latn) [core] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670350 (https://phabricator.wikimedia.org/T259360) [11:29:50] (03PS1) 10Ladsgroup: Add shy name (same as shy-latn) [core] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/670351 (https://phabricator.wikimedia.org/T259360) [11:29:58] (03CR) 10Ladsgroup: [C: 03+2] Add shy name (same as shy-latn) [core] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670350 (https://phabricator.wikimedia.org/T259360) (owner: 10Ladsgroup) [11:30:02] (03CR) 10Ladsgroup: [C: 03+2] Add shy name (same as shy-latn) [core] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/670351 (https://phabricator.wikimedia.org/T259360) (owner: 10Ladsgroup) [11:30:05] (03CR) 10jerkins-bot: [V: 04-1] systemd::timer::job: Add ability to redirect stdout/stdin/stderr [puppet] - 10https://gerrit.wikimedia.org/r/670436 (https://phabricator.wikimedia.org/T273673) (owner: 10Jbond) [11:30:38] (03PS2) 10Jbond: systemd::timer::job: Add ability to redirect stdout/stdin/stderr [puppet] - 10https://gerrit.wikimedia.org/r/670436 (https://phabricator.wikimedia.org/T273673) [11:32:01] (03CR) 10Jgiannelos: Map tiles for 3rd parties: allow consultant to access maps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670229 (https://phabricator.wikimedia.org/T276317) (owner: 10MSantos) [11:32:17] (03CR) 10jerkins-bot: [V: 04-1] systemd::timer::job: Add ability to redirect stdout/stdin/stderr [puppet] - 10https://gerrit.wikimedia.org/r/670436 (https://phabricator.wikimedia.org/T273673) (owner: 10Jbond) [11:33:48] (03PS1) 10Effie Mouzeli: manifests: decomm remove mc1024 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/670438 (https://phabricator.wikimedia.org/T272074) [11:34:39] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1013.eqiad.wmnet [11:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:57] PROBLEM - cassandra-a CQL 10.64.0.88:9042 on aqs1010 is CRITICAL: connect to address 10.64.0.88 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [11:35:03] (03PS2) 10Giuseppe Lavagetto: [WiP] Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) [11:35:37] aqs1010 is new, Hugh is working on it, all good :) [11:38:04] (03PS1) 10Klausman: wmnet: Add LVS IPs for ML Team k8s masters [dns] - 10https://gerrit.wikimedia.org/r/670440 [11:39:41] (03CR) 10Volans: "Thanks for the review, replies inline." (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/670235 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [11:39:43] PROBLEM - cassandra-b CQL 10.64.0.120:9042 on aqs1010 is CRITICAL: connect to address 10.64.0.120 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [11:40:33] (03PS3) 10Jbond: systemd::timer::job: Add ability to redirect stdout/stdin/stderr [puppet] - 10https://gerrit.wikimedia.org/r/670436 (https://phabricator.wikimedia.org/T273673) [11:41:22] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28484/console" [puppet] - 10https://gerrit.wikimedia.org/r/670436 (https://phabricator.wikimedia.org/T273673) (owner: 10Jbond) [11:42:34] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on aqs1010.eqiad.wmnet with reason: New buster host [11:42:34] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on aqs1010.eqiad.wmnet with reason: New buster host [11:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:34] (03CR) 10Volans: wmnet: Add LVS IPs for ML Team k8s masters (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/670440 (owner: 10Klausman) [11:46:45] (03CR) 10Klausman: wmnet: Add LVS IPs for ML Team k8s masters (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/670440 (owner: 10Klausman) [11:49:56] (03CR) 10Elukey: [C: 03+1] "LGTM if Volans is ok :)" [dns] - 10https://gerrit.wikimedia.org/r/670440 (owner: 10Klausman) [11:54:28] (03CR) 10jerkins-bot: [V: 04-1] Add shy name (same as shy-latn) [core] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670350 (https://phabricator.wikimedia.org/T259360) (owner: 10Ladsgroup) [11:57:25] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mc1024.eqiad.wmnet [11:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European mid-day backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210310T1200). [12:00:04] Amir1: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:11] o/ [12:00:19] go go go [12:00:24] (03CR) 10Ladsgroup: [C: 03+2] "again." [core] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670350 (https://phabricator.wikimedia.org/T259360) (owner: 10Ladsgroup) [12:01:08] (03CR) 10Ladsgroup: [C: 03+2] Update several Wikidata-related configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612918 (owner: 10Matěj Suchánek) [12:02:49] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/670440 (owner: 10Klausman) [12:03:16] (03CR) 10Klausman: [C: 03+2] wmnet: Add LVS IPs for ML Team k8s masters [dns] - 10https://gerrit.wikimedia.org/r/670440 (owner: 10Klausman) [12:06:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3317 (re)pooling @ 30%: 10', diff saved to https://phabricator.wikimedia.org/P14731 and previous config saved to /var/cache/conftool/dbconfig/20210310-120647-root.json [12:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:16] !log klausman@cumin1001 START - Cookbook sre.dns.netbox [12:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:25] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2041.codfw.wmnet [12:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:10] (03Merged) 10jenkins-bot: Update several Wikidata-related configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612918 (owner: 10Matěj Suchánek) [12:09:35] !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:28] Amir1: please ping me once you're done :) [12:10:39] Surio [12:10:47] (03PS1) 10Urbanecm: nowiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670352 (https://phabricator.wikimedia.org/T276816) [12:12:01] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:612918|Update several Wikidata-related configs]] (duration: 01m 32s) [12:12:06] (03CR) 10jerkins-bot: [V: 04-1] nowiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670352 (https://phabricator.wikimedia.org/T276816) (owner: 10Urbanecm) [12:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:53] (03PS2) 10Urbanecm: nowiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670352 (https://phabricator.wikimedia.org/T276816) [12:13:07] can someone tell me why jenkins is so slow this week [12:13:14] and flaky [12:14:05] (03PS1) 10Klausman: service catalog: Add entry for ML Team k8s control plane [puppet] - 10https://gerrit.wikimedia.org/r/670444 (https://phabricator.wikimedia.org/T272918) [12:14:13] You see a new thing everyday [12:14:15] https://usercontent.irccloud-cdn.com/file/eLlwpGm0/image.png [12:15:34] (03PS3) 10Urbanecm: nowiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670352 (https://phabricator.wikimedia.org/T276816) [12:15:54] (03Merged) 10jenkins-bot: Add shy name (same as shy-latn) [core] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/670351 (https://phabricator.wikimedia.org/T259360) (owner: 10Ladsgroup) [12:15:55] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2041.codfw.wmnet [12:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:25] Amir1: you mean postmerge, or merging generally? [12:16:28] (03CR) 10Elukey: [C: 03+1] service catalog: Add entry for ML Team k8s control plane [puppet] - 10https://gerrit.wikimedia.org/r/670444 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [12:16:37] Urbanecm: all [12:16:57] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2042.codfw.wmnet [12:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:19:48] (03PS1) 10Jbond: P:pli::client: add defaults to cloud hiera [puppet] - 10https://gerrit.wikimedia.org/r/670446 [12:20:27] (03PS1) 10Klausman: conftool: Add codfw LVS entry for ML Team k8s [puppet] - 10https://gerrit.wikimedia.org/r/670447 [12:20:46] (03CR) 10Jbond: [C: 03+2] P:pli::client: add defaults to cloud hiera [puppet] - 10https://gerrit.wikimedia.org/r/670446 (owner: 10Jbond) [12:21:33] (03CR) 10Elukey: [C: 03+1] conftool: Add codfw LVS entry for ML Team k8s [puppet] - 10https://gerrit.wikimedia.org/r/670447 (owner: 10Klausman) [12:21:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:21:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3317 (re)pooling @ 60%: 10', diff saved to https://phabricator.wikimedia.org/P14732 and previous config saved to /var/cache/conftool/dbconfig/20210310-122150-root.json [12:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:02] !log ladsgroup@deploy1002 Synchronized php-1.36.0-wmf.33/languages: [[gerrit:670351|Add shy name (same as shy-latn)]] (T259360) (duration: 01m 10s) [12:22:08] (03CR) 10Klausman: [C: 03+2] conftool: Add codfw LVS entry for ML Team k8s [puppet] - 10https://gerrit.wikimedia.org/r/670447 (owner: 10Klausman) [12:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:10] T259360: Cognate doesn't properly create interwiki links for Shawiya Wiktionary (shy.wiktionary.org) - https://phabricator.wikimedia.org/T259360 [12:22:41] (03PS2) 10Klausman: service catalog: Add entry for ML Team k8s control plane [puppet] - 10https://gerrit.wikimedia.org/r/670444 (https://phabricator.wikimedia.org/T272918) [12:22:43] (03PS1) 10Jbond: hiera - cloud project: enable pki client [puppet] - 10https://gerrit.wikimedia.org/r/670450 [12:23:06] (03PS1) 10Ladsgroup: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670451 [12:23:08] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2042.codfw.wmnet [12:23:08] (03CR) 10Ladsgroup: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670451 (owner: 10Ladsgroup) [12:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:54] (03CR) 10Ladsgroup: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670451 (owner: 10Ladsgroup) [12:23:58] (03CR) 10Jbond: [C: 03+2] hiera - cloud project: enable pki client [puppet] - 10https://gerrit.wikimedia.org/r/670450 (owner: 10Jbond) [12:24:01] (03Abandoned) 10Ladsgroup: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670451 (owner: 10Ladsgroup) [12:24:30] klausman: is it ok to merge your chaage [12:24:35] yes [12:24:51] thx merged [12:24:52] (03PS1) 10WMDE-Fisch: Enable CodeMirror accessibility color schema on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670452 (https://phabricator.wikimedia.org/T271895) [12:28:09] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2043.codfw.wmnet [12:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:30] Amir1: still deploying? [12:29:47] yup, let me see how long it's left for the 1.34 to mege [12:30:01] zuul says 2 minutes [12:30:09] okay [12:30:22] * Urbanecm has a config change, btw [12:31:28] !log ariel@cumin1001 START - Cookbook sre.cassandra.roll-restart [12:31:28] !log ariel@cumin1001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) [12:31:31] is anyone aware of cirrussearch issue with rest api? [12:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:49] (03PS3) 10Klausman: service catalog: Add entry for ML Team k8s control plane [puppet] - 10https://gerrit.wikimedia.org/r/670444 (https://phabricator.wikimedia.org/T272918) [12:32:04] spamming mediawiki-errors like crazy [12:32:09] !log ariel@cumin1001 START - Cookbook sre.cassandra.roll-restart [12:32:09] (03Merged) 10jenkins-bot: Add shy name (same as shy-latn) [core] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670350 (https://phabricator.wikimedia.org/T259360) (owner: 10Ladsgroup) [12:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:15] \o [12:32:18] \o/ [12:32:21] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host theemin.codfw.wmnet [12:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:45] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2043.codfw.wmnet [12:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:06] dcausse: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-deploy-2021.03.10?id=qEUYHHgBfVMx58vqFTaU FYI [12:34:19] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2044.codfw.wmnet [12:34:22] !log ladsgroup@deploy1002 Synchronized php-1.36.0-wmf.34/languages: [[gerrit:670350|Add shy name (same as shy-latn)]] (T259360) (duration: 01m 10s) [12:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:31] (03PS1) 10Kormat: dbutil: Use fqdn if host doesn't match a dc shortname. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/670454 [12:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:32] T259360: Cognate doesn't properly create interwiki links for Shawiya Wiktionary (shy.wiktionary.org) - https://phabricator.wikimedia.org/T259360 [12:35:09] Urbanecm: the floor is yours [12:35:13] RECOVERY - cassandra-a SSL 10.192.16.98:7001 on restbase2019 is OK: SSL OK - Certificate restbase2019-a valid until 2023-03-10 12:12:02 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:35:16] thanks [12:35:37] (03CR) 10Urbanecm: [C: 03+2] nowiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670352 (https://phabricator.wikimedia.org/T276816) (owner: 10Urbanecm) [12:35:49] RECOVERY - cassandra-b SSL 10.192.16.99:7001 on restbase2019 is OK: SSL OK - Certificate restbase2019-b valid until 2023-03-10 12:12:05 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:36:36] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host theemin.codfw.wmnet [12:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3317 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14733 and previous config saved to /var/cache/conftool/dbconfig/20210310-123654-root.json [12:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:14] (03Merged) 10jenkins-bot: nowiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670352 (https://phabricator.wikimedia.org/T276816) (owner: 10Urbanecm) [12:37:15] (03PS1) 10Kormat: tox: Upgrade mypy to a version that has --exclude [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/670455 [12:37:51] RECOVERY - cassandra-c SSL 10.192.16.100:7001 on restbase2019 is OK: SSL OK - Certificate restbase2019-c valid until 2023-03-10 12:12:07 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:40:42] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2044.codfw.wmnet [12:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:00] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2045.codfw.wmnet [12:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:11] (03CR) 10Kormat: [C: 03+2] dbutil: Use fqdn if host doesn't match a dc shortname. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/670454 (owner: 10Kormat) [12:41:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1170:3317 for schema change', diff saved to https://phabricator.wikimedia.org/P14734 and previous config saved to /var/cache/conftool/dbconfig/20210310-124140-marostegui.json [12:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:44] (03CR) 10Kormat: [C: 03+2] tox: Upgrade mypy to a version that has --exclude [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/670455 (owner: 10Kormat) [12:43:36] (03Merged) 10jenkins-bot: dbutil: Use fqdn if host doesn't match a dc shortname. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/670454 (owner: 10Kormat) [12:43:39] RECOVERY - cassandra-a SSL 10.192.32.119:7001 on restbase2020 is OK: SSL OK - Certificate restbase2020-a valid until 2023-03-10 12:12:10 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:44:15] (03PS1) 10Urbanecm: nowiki: add missing wgGEHelpPanelViewMoreTitle link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670457 [12:44:17] (03CR) 10Urbanecm: [C: 03+2] nowiki: add missing wgGEHelpPanelViewMoreTitle link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670457 (owner: 10Urbanecm) [12:45:54] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 623ed48472e63c8f1c5965289163d7ef80ab4412: nowiki: Enable Growth features in stealth mode (T276816) (duration: 01m 07s) [12:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:02] T276816: Deploy Growth features on Norwegian Bokmål Wikipedia - https://phabricator.wikimedia.org/T276816 [12:46:19] (03Merged) 10jenkins-bot: nowiki: add missing wgGEHelpPanelViewMoreTitle link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670457 (owner: 10Urbanecm) [12:46:46] (03Merged) 10jenkins-bot: tox: Upgrade mypy to a version that has --exclude [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/670455 (owner: 10Kormat) [12:46:59] RECOVERY - cassandra-b SSL 10.192.32.120:7001 on restbase2020 is OK: SSL OK - Certificate restbase2020-b valid until 2023-03-10 12:12:13 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:47:13] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1029.eqiad.wmnet [12:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:22] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2045.codfw.wmnet [12:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:51] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2046.codfw.wmnet [12:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:37] RECOVERY - cassandra-c SSL 10.192.32.121:7001 on restbase2020 is OK: SSL OK - Certificate restbase2020-c valid until 2023-03-10 12:12:16 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:49:24] * Urbanecm done [12:52:06] !log ariel@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) [12:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:13] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2046.codfw.wmnet [12:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:39] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2047.codfw.wmnet [12:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:52] * CFisch_WMDE merging a labs only patch - stay tuned. [12:59:35] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2047.codfw.wmnet [12:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:53] (03PS1) 10Jbond: cloud pki: add deployment-prep agents as authorised clients [puppet] - 10https://gerrit.wikimedia.org/r/670462 [13:03:22] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2048.codfw.wmnet [13:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:41] PROBLEM - ensure kvm processes are running on cloudvirt1029 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:07:53] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1029.eqiad.wmnet [13:07:57] (03PS1) 10JMeybohm: chromium-render: Add default labels and fix name of configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/670464 [13:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:59] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2048.codfw.wmnet [13:10:59] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2049.codfw.wmnet [13:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:33] (03CR) 10Muehlenhoff: [C: 03+1] manifests: decomm remove mc1024 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/670438 (https://phabricator.wikimedia.org/T272074) (owner: 10Effie Mouzeli) [13:12:02] (03PS3) 10Muehlenhoff: Temporarily switch to deb.debian.org for sodium reboot [puppet] - 10https://gerrit.wikimedia.org/r/670211 [13:14:27] RECOVERY - ensure kvm processes are running on cloudvirt1029 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:17:04] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2049.codfw.wmnet [13:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:25] 10SRE, 10Packaging, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Backport hyperkitty 1.3.4 for buster - https://phabricator.wikimedia.org/T276687 (10Ladsgroup) [13:17:29] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Fix the problem with gravatar and mailman3 - https://phabricator.wikimedia.org/T256541 (10Ladsgroup) [13:18:30] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2050.codfw.wmnet [13:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:40] (03CR) 10Effie Mouzeli: [C: 03+2] manifests: decomm remove mc1024 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/670438 (https://phabricator.wikimedia.org/T272074) (owner: 10Effie Mouzeli) [13:24:02] (03CR) 10Muehlenhoff: [C: 03+2] Simplify microcode check [puppet] - 10https://gerrit.wikimedia.org/r/670171 (owner: 10Muehlenhoff) [13:24:08] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2050.codfw.wmnet [13:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:55] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2051.codfw.wmnet [13:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:48] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2051.codfw.wmnet [13:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:18] (03CR) 10Jbond: [C: 03+2] cloud pki: add deployment-prep agents as authorised clients [puppet] - 10https://gerrit.wikimedia.org/r/670462 (owner: 10Jbond) [13:31:27] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2052.codfw.wmnet [13:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:37] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2052.codfw.wmnet [13:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:56] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2053.codfw.wmnet [13:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:24] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2053.codfw.wmnet [13:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:34] (03CR) 10Awight: [C: 03+1] Enable CodeMirror accessibility color schema on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670452 (https://phabricator.wikimedia.org/T271895) (owner: 10WMDE-Fisch) [13:48:23] * CFisch_WMDE merging that labs patch now [13:48:49] (03CR) 10WMDE-Fisch: [C: 03+2] Enable CodeMirror accessibility color schema on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670452 (https://phabricator.wikimedia.org/T271895) (owner: 10WMDE-Fisch) [13:49:26] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2054.codfw.wmnet [13:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:42] (03Merged) 10jenkins-bot: Enable CodeMirror accessibility color schema on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670452 (https://phabricator.wikimedia.org/T271895) (owner: 10WMDE-Fisch) [13:53:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3317 (re)pooling @ 30%: 10', diff saved to https://phabricator.wikimedia.org/P14736 and previous config saved to /var/cache/conftool/dbconfig/20210310-135309-root.json [13:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:45] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2054.codfw.wmnet [13:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:17] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2055.codfw.wmnet [13:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:05] brennen and liw: Time to snap out of that daydream and deploy Mediawiki train - American+European Version (secondary timeslot). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210310T1400). [14:01:05] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2055.codfw.wmnet [14:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:20] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2056.codfw.wmnet [14:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:10] 10SRE, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10Mvolz) [14:06:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2056.codfw.wmnet [14:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:47] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2057.codfw.wmnet [14:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3317 (re)pooling @ 60%: 10', diff saved to https://phabricator.wikimedia.org/P14738 and previous config saved to /var/cache/conftool/dbconfig/20210310-140812-root.json [14:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:07] (03CR) 10DannyS712: [C: 03+1] wikireplicas: add new columns for abuse_filter_log to wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/670325 (https://phabricator.wikimedia.org/T234615) (owner: 10Bstorm) [14:14:37] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2057.codfw.wmnet [14:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:03] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2058.codfw.wmnet [14:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:48] (03CR) 10CDanis: [C: 03+2] Map tiles for 3rd parties: allow consultant to access maps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670229 (https://phabricator.wikimedia.org/T276317) (owner: 10MSantos) [14:19:17] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [14:19:17] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [14:19:17] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [14:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3317 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14739 and previous config saved to /var/cache/conftool/dbconfig/20210310-142316-root.json [14:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:24] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/670436 (https://phabricator.wikimedia.org/T273673) (owner: 10Jbond) [14:24:24] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2058.codfw.wmnet [14:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:21] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2059.codfw.wmnet [14:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:41] (03CR) 10Jbond: [V: 03+1 C: 03+2] systemd::timer::job: Add ability to redirect stdout/stdin/stderr [puppet] - 10https://gerrit.wikimedia.org/r/670436 (https://phabricator.wikimedia.org/T273673) (owner: 10Jbond) [14:34:19] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2059.codfw.wmnet [14:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:39] (03CR) 10Muehlenhoff: [C: 03+2] Temporarily switch to deb.debian.org for sodium reboot [puppet] - 10https://gerrit.wikimedia.org/r/670211 (owner: 10Muehlenhoff) [14:35:35] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2060.codfw.wmnet [14:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1127', diff saved to https://phabricator.wikimedia.org/P14740 and previous config saved to /var/cache/conftool/dbconfig/20210310-143547-marostegui.json [14:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:03] (03CR) 10Jbond: [V: 03+2 C: 03+2] CAS style changes [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/661734 (owner: 10Jbond) [14:39:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:42:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:43:50] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2060.codfw.wmnet [14:43:51] (03PS4) 10Jbond: 6.4.0-RC2: test to see if issue is still present [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/661721 (https://phabricator.wikimedia.org/T273867) [14:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:21] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-be2061.codfw.wmnet [14:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 10%: Repool db1127 after schema change', diff saved to https://phabricator.wikimedia.org/P14741 and previous config saved to /var/cache/conftool/dbconfig/20210310-144813-root.json [14:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:26] (03CR) 10ZPapierski: [C: 03+1] add new updater job properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/667034 (https://phabricator.wikimedia.org/T273095) (owner: 10Mstyles) [14:52:05] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2061.codfw.wmnet [14:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:35] !log klausman@puppetmaster1001 conftool action : set/pooled=yes:weight=1; selector: cluster=ml_serve,service=kubemaster [14:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:07] (03CR) 10Gehel: "See minor comment inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670339 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper) [14:59:54] (03PS1) 10Jbond: 6.3.2: create 6.3.2 release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/670483 [15:03:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 30%: Repool db1127 after schema change', diff saved to https://phabricator.wikimedia.org/P14742 and previous config saved to /var/cache/conftool/dbconfig/20210310-150316-root.json [15:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:41] (03PS1) 10Gerrit Patch Uploader: https://phabricator.wikimedia.org/T266933 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670484 [15:14:43] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670484 (owner: 10Gerrit Patch Uploader) [15:16:22] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host sodium.wikimedia.org [15:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:20] (03PS1) 10RobH: fixing kafkalogging partman [puppet] - 10https://gerrit.wikimedia.org/r/670485 (https://phabricator.wikimedia.org/T273778) [15:18:05] (03CR) 10RobH: [C: 03+2] fixing kafkalogging partman [puppet] - 10https://gerrit.wikimedia.org/r/670485 (https://phabricator.wikimedia.org/T273778) (owner: 10RobH) [15:18:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 60%: Repool db1127 after schema change', diff saved to https://phabricator.wikimedia.org/P14743 and previous config saved to /var/cache/conftool/dbconfig/20210310-151820-root.json [15:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:45] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sodium.wikimedia.org [15:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install kafka-logging100[123] - https://phabricator.wikimedia.org/T273778 (10fgiunchedi) [15:24:32] (03PS1) 10Muehlenhoff: Revert "Temporarily switch to deb.debian.org for sodium reboot" [puppet] - 10https://gerrit.wikimedia.org/r/670486 [15:25:17] (03PS1) 10RobH: kafka-logging should be buster [puppet] - 10https://gerrit.wikimedia.org/r/670487 (https://phabricator.wikimedia.org/T273778) [15:26:04] (03CR) 10RobH: [C: 03+2] kafka-logging should be buster [puppet] - 10https://gerrit.wikimedia.org/r/670487 (https://phabricator.wikimedia.org/T273778) (owner: 10RobH) [15:27:03] (03PS3) 10Waihorace: Enable RelatedArticles Extension in zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670359 (https://phabricator.wikimedia.org/T266933) [15:27:59] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging200[123].codfw.wmnet - https://phabricator.wikimedia.org/T274905 (10RobH) [15:31:38] (03CR) 10SBassett: [C: 03+1] "This sounds low-risk from a security/privacy perspective, given T234615#6899973." [puppet] - 10https://gerrit.wikimedia.org/r/670325 (https://phabricator.wikimedia.org/T234615) (owner: 10Bstorm) [15:32:15] 10SRE, 10CAS-SSO, 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10jbond) I have started to look at this again and wanted to look at the difference between the KRYO vs other encodings and noticed that when i try to use the following python script ` lang=pytho... [15:32:51] (03CR) 10Zoranzoki21: [C: 04-1] "Duplicate of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/670359/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670484 (owner: 10Gerrit Patch Uploader) [15:33:13] (03CR) 10Zoranzoki21: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670359 (https://phabricator.wikimedia.org/T266933) (owner: 10Waihorace) [15:33:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 100%: Repool db1127 after schema change', diff saved to https://phabricator.wikimedia.org/P14744 and previous config saved to /var/cache/conftool/dbconfig/20210310-153324-root.json [15:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:29] (03PS4) 10Tjones: Add a note for the elasticsearch image in releng/dev-images [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/669720 (owner: 10DCausse) [15:40:31] (03CR) 10Tjones: [V: 03+2 C: 03+2] "Looks good!" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/669720 (owner: 10DCausse) [15:43:49] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Temporarily switch to deb.debian.org for sodium reboot" [puppet] - 10https://gerrit.wikimedia.org/r/670486 (owner: 10Muehlenhoff) [15:46:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:49:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:03:14] (03CR) 10Volans: "Did a first pass, some random comments inline." (037 comments) [alerts] - 10https://gerrit.wikimedia.org/r/670231 (https://phabricator.wikimedia.org/T272977) (owner: 10Filippo Giunchedi) [16:04:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install kafka-logging100[123] - https://phabricator.wikimedia.org/T273778 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['kafka-logging1001.eqiad.wmnet', 'kafka-logging100... [16:04:23] 10SRE, 10CAS-SSO, 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10jbond) I have just built 6.3.2 and python can at least read the value without the above stack trace however cas still has trouble retrieving the value from memcache although now with a slightly... [16:07:33] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/668231 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [16:17:20] (03Abandoned) 10Urbanecm: https://phabricator.wikimedia.org/T266933 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670484 (owner: 10Gerrit Patch Uploader) [16:18:05] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1001.eqiad.wmnet with reason: REIMAGE [16:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:54] (03PS3) 10Majavah: betacluster: add db[07-08], promote db06, remove db05 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670273 (https://phabricator.wikimedia.org/T276968) [16:20:09] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1001.eqiad.wmnet with reason: REIMAGE [16:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:34] 10SRE, 10Analytics, 10observability: Set up cross DC topic mirroring for Kafka logging clusters - https://phabricator.wikimedia.org/T276972 (10Ottomata) Basically, a non-aggregate Kafka cluster (like Kafka jumbo) is the source of stream data. Here, a 'stream' refers to mulitple topics, in our case, every DC... [16:24:52] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Epic, 10Goal: WMF media storage must be adequately backed up in a remote location - https://phabricator.wikimedia.org/T262668 (10jcrespo) [16:25:33] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Epic, 10Goal: WMF media storage must be adequately backed up in a remote location - https://phabricator.wikimedia.org/T262668 (10jcrespo) [16:27:20] (03PS1) 10Bartosz Dziewoński: Allow users to continue using reply tool after disabling A/B test [extensions/DiscussionTools] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/670362 (https://phabricator.wikimedia.org/T276967) [16:27:32] (03PS1) 10Bartosz Dziewoński: Allow users to continue using reply tool after disabling A/B test [extensions/DiscussionTools] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670363 (https://phabricator.wikimedia.org/T276967) [16:31:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install kafka-logging100[123] - https://phabricator.wikimedia.org/T273778 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` kafka-logging1001.eqiad.wmnet ` The log can be foun... [16:34:58] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging200[123].codfw.wmnet - https://phabricator.wikimedia.org/T274905 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` kafka-logging2001.codfw.wmnet ` The log can be found in `/v... [16:40:24] (03PS1) 10Jason Linehan: Enable session tick on testwiki with 100% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670496 [16:45:17] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1001.eqiad.wmnet with reason: REIMAGE [16:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:32] (03PS2) 10Jason Linehan: Enable session tick on testwiki with 100% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670496 [16:47:14] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1001.eqiad.wmnet with reason: REIMAGE [16:47:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:52] (03CR) 10Mholloway: [C: 03+1] Enable session tick on testwiki with 100% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670496 (owner: 10Jason Linehan) [16:48:58] mholloway: you only have to copy the individual stream config block [16:48:58] (03CR) 10Bstorm: [C: 03+2] "Merging it!" [puppet] - 10https://gerrit.wikimedia.org/r/670325 (https://phabricator.wikimedia.org/T234615) (owner: 10Bstorm) [16:49:02] not all of wgEventStreams [16:49:03] so [16:49:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:49:20] testwiki => [ [ 'mediawiki.client.session_tick' => [ ... ] ] [16:49:20] part [16:49:29] (03PS3) 10Mholloway: Enable session tick on testwiki with 100% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670496 (https://phabricator.wikimedia.org/T276515) (owner: 10Jason Linehan) [16:50:16] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging2001.codfw.wmnet with reason: REIMAGE [16:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:35] (03CR) 10Ottomata: Enable session tick on testwiki with 100% sampling (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670496 (https://phabricator.wikimedia.org/T276515) (owner: 10Jason Linehan) [16:50:43] (03CR) 10Ottomata: [C: 04-1] Enable session tick on testwiki with 100% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670496 (https://phabricator.wikimedia.org/T276515) (owner: 10Jason Linehan) [16:50:49] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1030.eqiad.wmnet [16:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:16] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kafka-logging2001.codfw.wmnet with reason: REIMAGE [16:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:16] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Upgrade firmware on db1136 - https://phabricator.wikimedia.org/T277007 (10wiki_willy) a:05wiki_willy→03Cmjohnson [16:54:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install kafka-logging100[123] - https://phabricator.wikimedia.org/T273778 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kafka-logging1001.eqiad.wmnet'] ` and were **ALL** successful. [16:54:38] (03CR) 10Urbanecm: [C: 03+2] betacluster: add db[07-08], promote db06, remove db05 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670273 (https://phabricator.wikimedia.org/T276968) (owner: 10Majavah) [16:55:23] (03PS3) 10Ahmon Dancy: Merge branch 'master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/670258 [16:55:30] (03PS4) 10Mholloway: Enable session tick on testwiki with 100% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670496 (https://phabricator.wikimedia.org/T276515) (owner: 10Jason Linehan) [16:55:57] (03Merged) 10jenkins-bot: betacluster: add db[07-08], promote db06, remove db05 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670273 (https://phabricator.wikimedia.org/T276968) (owner: 10Majavah) [16:56:41] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1030.eqiad.wmnet [16:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:54] (03PS4) 10Ahmon Dancy: Merge branch 'master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/670258 [16:58:49] (03PS1) 10Alexandros Kosiaris: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/670504 (https://phabricator.wikimedia.org/T274262) [16:59:37] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging200[123].codfw.wmnet - https://phabricator.wikimedia.org/T274905 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kafka-logging2001.codfw.wmnet'] ` and were **ALL** successful. [17:00:37] (03PS5) 10Mholloway: Enable session tick on testwiki with 100% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670496 (https://phabricator.wikimedia.org/T276515) (owner: 10Jason Linehan) [17:01:15] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging100[123] - https://phabricator.wikimedia.org/T273778 (10RobH) [17:02:26] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging200[123].codfw.wmnet - https://phabricator.wikimedia.org/T274905 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` kafka-logging2002.codfw.wmnet ` The log can be found in `/v... [17:03:24] (03PS1) 10Urbanecm: Revert "beta: Switch beta to read only on mediawiki level" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670364 (https://phabricator.wikimedia.org/T276968) [17:03:35] (03CR) 10jerkins-bot: [V: 04-1] Revert "beta: Switch beta to read only on mediawiki level" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670364 (https://phabricator.wikimedia.org/T276968) (owner: 10Urbanecm) [17:04:25] 10SRE, 10ops-codfw, 10User-fgiunchedi: Decom ms-be[2016-2027] - https://phabricator.wikimedia.org/T272837 (10Papaul) [17:04:53] 10SRE, 10ops-codfw, 10User-fgiunchedi: Decom ms-be[2016-2027] - https://phabricator.wikimedia.org/T272837 (10Papaul) 05Open→03Resolved complete [17:05:29] (03PS2) 10Urbanecm: Revert "beta: Switch beta to read only on mediawiki level" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670364 (https://phabricator.wikimedia.org/T276968) [17:06:08] (03CR) 10Urbanecm: [C: 03+2] Revert "beta: Switch beta to read only on mediawiki level" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670364 (https://phabricator.wikimedia.org/T276968) (owner: 10Urbanecm) [17:06:25] (03PS1) 10Mforns: Set mediawiki_client_session_tick sampling rate to 100% on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670509 [17:07:25] (03Merged) 10jenkins-bot: Revert "beta: Switch beta to read only on mediawiki level" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670364 (https://phabricator.wikimedia.org/T276968) (owner: 10Urbanecm) [17:12:50] (03PS1) 10Phuedx: searchSatisfaction: Allow for async initialisation [extensions/WikimediaEvents] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/670365 (https://phabricator.wikimedia.org/T274869) [17:13:11] (03PS1) 10Phuedx: searchSatisfaction: Allow for async initialisation [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670526 (https://phabricator.wikimedia.org/T274869) [17:15:11] (03PS1) 10Cwhite: profile,prometheus: add enable_indices_stats flag [puppet] - 10https://gerrit.wikimedia.org/r/670514 [17:17:17] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging2002.codfw.wmnet with reason: REIMAGE [17:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:20] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kafka-logging2002.codfw.wmnet with reason: REIMAGE [17:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:08] (03CR) 10Dzahn: "I don't know which webserver they are planning to use or if they even want the cert from us. So far it sounded like they have "built-in" L" [puppet] - 10https://gerrit.wikimedia.org/r/670427 (https://phabricator.wikimedia.org/T276673) (owner: 10Jbond) [17:23:53] mutante: re ^^ there is more converstion on the linked task [17:26:39] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging200[123].codfw.wmnet - https://phabricator.wikimedia.org/T274905 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kafka-logging2002.codfw.wmnet'] ` and were **ALL** successful. [17:26:46] (03PS6) 10Ottomata: Enable session tick on testwiki with 100% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670496 (https://phabricator.wikimedia.org/T276515) (owner: 10Jason Linehan) [17:27:09] jbond42: ACK, if that is preferred then cool [17:28:03] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging200[123].codfw.wmnet - https://phabricator.wikimedia.org/T274905 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` kafka-logging2003.codfw.wmnet ` The log can be found in `/v... [17:28:59] (03CR) 10Mstyles: [C: 03+2] add new updater job properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/667034 (https://phabricator.wikimedia.org/T273095) (owner: 10Mstyles) [17:29:39] 10SRE, 10Packaging, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Backport hyperkitty 1.3.4 for buster - https://phabricator.wikimedia.org/T276687 (10Legoktm) You want to be running Debian Buster/testing/unstable, because that has all the tooling available (you can also do this in a container). `apt-get ins... [17:29:47] (03Merged) 10jenkins-bot: add new updater job properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/667034 (https://phabricator.wikimedia.org/T273095) (owner: 10Mstyles) [17:30:27] (03CR) 10Mholloway: [C: 03+1] Enable session tick on testwiki with 100% sampling (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670496 (https://phabricator.wikimedia.org/T276515) (owner: 10Jason Linehan) [17:31:27] (03PS1) 10Razzi: Update comment to reflect that an-conf100[1-3] are used [puppet] - 10https://gerrit.wikimedia.org/r/670519 [17:35:19] 10SRE, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) 05Stalled→03Open [17:40:03] (03PS1) 10Gerrit maintenance bot: Add trv to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/670521 (https://phabricator.wikimedia.org/T276246) [17:42:10] (03CR) 10Wolfgang Kandek: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/670427 (https://phabricator.wikimedia.org/T276673) (owner: 10Jbond) [17:42:55] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging2003.codfw.wmnet with reason: REIMAGE [17:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:46] (03CR) 10Dzahn: [C: 03+2] "https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Kari_Seediq" [dns] - 10https://gerrit.wikimedia.org/r/670521 (https://phabricator.wikimedia.org/T276246) (owner: 10Gerrit maintenance bot) [17:45:07] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kafka-logging2003.codfw.wmnet with reason: REIMAGE [17:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:02] (03PS1) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [17:46:19] (03PS1) 10Cwhite: logstash: extract index label from logEvent indexing errors [puppet] - 10https://gerrit.wikimedia.org/r/670525 (https://phabricator.wikimedia.org/T234565) [17:46:22] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:46:58] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:47] (03CR) 10jerkins-bot: [V: 04-1] profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [17:48:50] !log new Wikimedia project language "trv" added - Seediq is an Atayalic language spoken in the mountains of Northern Taiwan by the Seediq and Taroko people. [17:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:52:17] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging200[123].codfw.wmnet - https://phabricator.wikimedia.org/T274905 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kafka-logging2003.codfw.wmnet'] ` and were **ALL** successful. [17:53:13] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging200[123].codfw.wmnet - https://phabricator.wikimedia.org/T274905 (10Papaul) [17:53:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:53:52] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging200[123].codfw.wmnet - https://phabricator.wikimedia.org/T274905 (10Papaul) 05Open→03Resolved @herron all yours [17:54:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:55:16] (03PS2) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [17:56:30] (03CR) 10jerkins-bot: [V: 04-1] profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [17:56:55] (03CR) 10Ahmon Dancy: [C: 04-1] profile::ci::slave::labs::common: move to cinder-based storage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [18:02:32] 10SRE: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10Majavah) [18:04:41] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/670522 (owner: 10CRusnov) [18:11:47] (03CR) 10Legoktm: [C: 04-1] mailman3: Add exim4 configuration [puppet] - 10https://gerrit.wikimedia.org/r/669182 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup) [18:12:33] (03PS1) 10Dwisehaupt: Move frdb primary to frdb1004 and remove decom'd frqueue hosts [dns] - 10https://gerrit.wikimedia.org/r/670549 (https://phabricator.wikimedia.org/T268056) [18:14:59] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:15:22] (03PS1) 10Jgreen: nsca_frack.cfg.erb: swap frdb1002/frdb1004, remove frqueue1001 [puppet] - 10https://gerrit.wikimedia.org/r/670550 (https://phabricator.wikimedia.org/T268056) [18:17:33] (03CR) 10Jgreen: [C: 03+2] nsca_frack.cfg.erb: swap frdb1002/frdb1004, remove frqueue1001 [puppet] - 10https://gerrit.wikimedia.org/r/670550 (https://phabricator.wikimedia.org/T268056) (owner: 10Jgreen) [18:18:19] (03CR) 10Jgreen: [C: 03+2] Move frdb primary to frdb1004 and remove decom'd frqueue hosts [dns] - 10https://gerrit.wikimedia.org/r/670549 (https://phabricator.wikimedia.org/T268056) (owner: 10Dwisehaupt) [18:18:27] !log mforns@deploy1002 Started deploy [analytics/refinery@7fbc3c7]: Regular analytics weekly train [analytics/refinery@7fbc3c700ccb3c598690da9a38990ef7cb187656] [18:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:43] (03PS7) 10Ottomata: Enable session tick on testwiki with 100% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670496 (https://phabricator.wikimedia.org/T276515) (owner: 10Jason Linehan) [18:22:22] (03PS3) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [18:22:24] (03PS1) 10Andrew Bogott: cinderutils::ensure: Add a 'mount' resource [puppet] - 10https://gerrit.wikimedia.org/r/670553 (https://phabricator.wikimedia.org/T269511) [18:22:26] (03CR) 10Ottomata: Enable session tick on testwiki with 100% sampling (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670496 (https://phabricator.wikimedia.org/T276515) (owner: 10Jason Linehan) [18:24:10] (03CR) 10jerkins-bot: [V: 04-1] profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [18:28:44] (03Abandoned) 10Mforns: Set mediawiki_client_session_tick sampling rate to 100% on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670509 (owner: 10Mforns) [18:32:57] !log mforns@deploy1002 Finished deploy [analytics/refinery@7fbc3c7]: Regular analytics weekly train [analytics/refinery@7fbc3c700ccb3c598690da9a38990ef7cb187656] (duration: 14m 30s) [18:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:05] !log mforns@deploy1002 Started deploy [analytics/refinery@7fbc3c7] (thin): Regular analytics weekly train THIN [analytics/refinery@7fbc3c700ccb3c598690da9a38990ef7cb187656] [18:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:12] !log mforns@deploy1002 Finished deploy [analytics/refinery@7fbc3c7] (thin): Regular analytics weekly train THIN [analytics/refinery@7fbc3c700ccb3c598690da9a38990ef7cb187656] (duration: 00m 07s) [18:33:22] !log mforns@deploy1002 Started deploy [analytics/refinery@7fbc3c7] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@7fbc3c700ccb3c598690da9a38990ef7cb187656] [18:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:41] (03PS2) 10Razzi: Update comment to reflect that an-conf100[1-3] are in use [puppet] - 10https://gerrit.wikimedia.org/r/670519 [18:37:34] !log mforns@deploy1002 Finished deploy [analytics/refinery@7fbc3c7] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@7fbc3c700ccb3c598690da9a38990ef7cb187656] (duration: 04m 12s) [18:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:12] (03CR) 10Elukey: [C: 03+1] Update comment to reflect that an-conf100[1-3] are in use [puppet] - 10https://gerrit.wikimedia.org/r/670519 (owner: 10Razzi) [18:40:22] (03CR) 10Razzi: [C: 03+2] Update comment to reflect that an-conf100[1-3] are in use [puppet] - 10https://gerrit.wikimedia.org/r/670519 (owner: 10Razzi) [18:42:54] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/670427 (https://phabricator.wikimedia.org/T276673) (owner: 10Jbond) [18:45:03] (03CR) 10Ryan Kemper: wdqs: impl. envoy for wdqs-test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670339 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper) [18:45:11] 10SRE: Improve alerting for hosts with Puppet disabled for longer periods - https://phabricator.wikimedia.org/T277083 (10Volans) p:05Triage→03Medium [18:50:37] (03PS1) 10Razzi: Enable maintenance mode for matomo reboot [puppet] - 10https://gerrit.wikimedia.org/r/670559 (https://phabricator.wikimedia.org/T273278) [18:50:54] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/670522 (owner: 10CRusnov) [18:52:02] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Update to v2.10.4-wmf2 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/670522 (owner: 10CRusnov) [18:53:28] (03CR) 10Dduvall: [C: 03+1] releases: include profile::docker::ferm in releases role [puppet] - 10https://gerrit.wikimedia.org/r/670289 (https://phabricator.wikimedia.org/T276869) (owner: 10Dzahn) [18:53:35] (03CR) 10Mholloway: Enable session tick on testwiki with 100% sampling (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670496 (https://phabricator.wikimedia.org/T276515) (owner: 10Jason Linehan) [18:53:37] (03CR) 10Ottomata: "This is not working because somehow the merging is doing weird stuff for the numerically indexed array. Reading the code in StaticSiteCon" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670496 (https://phabricator.wikimedia.org/T276515) (owner: 10Jason Linehan) [18:55:38] 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console for pcoombe@wikimedia.org - https://phabricator.wikimedia.org/T277065 (10Gilles) @JKatzWMF can you make that happen? [19:00:04] brennen and liw: My dear minions, it's time we take the moon! Just kidding. Time for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210310T1900). [19:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210310T1900). [19:00:04] MatmaRex, stephanebisson, and Jdlrobson: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:12] o/ [19:00:15] hello [19:00:18] i can deploy today [19:00:20] hi everyone [19:00:38] hi [19:00:52] my backports need to go before my configs [19:00:53] (03CR) 10Urbanecm: [C: 03+2] searchSatisfaction: Allow for async initialisation [extensions/WikimediaEvents] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/670365 (https://phabricator.wikimedia.org/T274869) (owner: 10Phuedx) [19:00:56] (03CR) 10Urbanecm: [C: 03+2] searchSatisfaction: Allow for async initialisation [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670526 (https://phabricator.wikimedia.org/T274869) (owner: 10Phuedx) [19:01:03] MatmaRex: acknowledged, thanks [19:01:10] (03CR) 10Urbanecm: [C: 03+2] Allow users to continue using reply tool after disabling A/B test [extensions/DiscussionTools] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670363 (https://phabricator.wikimedia.org/T276967) (owner: 10Bartosz Dziewoński) [19:01:12] (03CR) 10Urbanecm: [C: 03+2] Allow users to continue using reply tool after disabling A/B test [extensions/DiscussionTools] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/670362 (https://phabricator.wikimedia.org/T276967) (owner: 10Bartosz Dziewoński) [19:01:55] (03PS2) 10Urbanecm: Remove unused config for InukaPageView [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667219 (https://phabricator.wikimedia.org/T265921) (owner: 10Sbisson) [19:01:59] (03CR) 10Urbanecm: [C: 03+2] Remove unused config for InukaPageView [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667219 (https://phabricator.wikimedia.org/T265921) (owner: 10Sbisson) [19:02:00] Urbanecm: mine is a no-op config cleanup. [19:02:29] stephanebisson: yup, thanks. Do you wish to test it before syncing? I think it's fine to just sync, as i don't see that var anywhere in codesearch [19:02:34] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging100[123] - https://phabricator.wikimedia.org/T273778 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` kafka-logging1002.eqiad.wmnet ` The log can be found in `/var/log/wmf-aut... [19:02:55] Urbanecm: I did a code search as well, nothing to test [19:03:05] okay, excellent. I'll sync it once it merges. [19:03:11] (03Merged) 10jenkins-bot: Remove unused config for InukaPageView [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667219 (https://phabricator.wikimedia.org/T265921) (owner: 10Sbisson) [19:06:22] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: fe99c312b3ce635342cbd690c34e2610184b74b0: Remove unused config for InukaPageView (T265921) (duration: 01m 26s) [19:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:30] T265921: Stop collecting InukaPageView data from web clients - https://phabricator.wikimedia.org/T265921 [19:06:57] stephanebisson: done. Anything else from you? [19:07:04] Urbanecm: nope, thank you [19:07:09] np :) [19:07:15] (03Merged) 10jenkins-bot: searchSatisfaction: Allow for async initialisation [extensions/WikimediaEvents] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/670365 (https://phabricator.wikimedia.org/T274869) (owner: 10Phuedx) [19:07:18] (03Merged) 10jenkins-bot: searchSatisfaction: Allow for async initialisation [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670526 (https://phabricator.wikimedia.org/T274869) (owner: 10Phuedx) [19:08:42] Jdlrobson: both are pulled on mwdebug1001, please test. [19:09:25] (03Merged) 10jenkins-bot: Allow users to continue using reply tool after disabling A/B test [extensions/DiscussionTools] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670363 (https://phabricator.wikimedia.org/T276967) (owner: 10Bartosz Dziewoński) [19:09:28] (03Merged) 10jenkins-bot: Allow users to continue using reply tool after disabling A/B test [extensions/DiscussionTools] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/670362 (https://phabricator.wikimedia.org/T276967) (owner: 10Bartosz Dziewoński) [19:10:30] MatmaRex: your backports are on mwdebug1002, please test [19:11:03] Urbanecm: they don't do anything without the config patch about A/B test [19:11:27] MatmaRex: aha. I'll pull the config patches there as well then :) [19:11:27] we don't have any wikis right now where we disabled it [19:11:33] yeah. thanks [19:11:40] MatmaRex: do you need all of them? or just some? [19:11:53] (03PS2) 10Urbanecm: Disable DiscussionTools Reply Tool A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670294 (https://phabricator.wikimedia.org/T276967) (owner: 10Bartosz Dziewoński) [19:12:03] just that one for this, i guess [19:12:07] if it rebases cleanly [19:12:21] (03CR) 10Urbanecm: [C: 03+2] Disable DiscussionTools Reply Tool A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670294 (https://phabricator.wikimedia.org/T276967) (owner: 10Bartosz Dziewoński) [19:12:21] oh, never mind, i put it first. yeah [19:12:37] okay, I'll ping you once it's ready. [19:12:49] Jdlrobson: did you see my message? [19:13:13] (03Merged) 10jenkins-bot: Disable DiscussionTools Reply Tool A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670294 (https://phabricator.wikimedia.org/T276967) (owner: 10Bartosz Dziewoński) [19:13:38] Urbanecm: did not .. [19:13:41] checking now [19:13:44] sorry! :) [19:13:48] np :) [19:14:18] !log T266470 Temporarily disabling puppet on all `wdqs*` hosts in preparation for `wdqs.discovery.wmnet` certificate revocation [19:14:19] MatmaRex: pulled the patch onto mwdebug1002 as well, please test. [19:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:25] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28485/console" [puppet] - 10https://gerrit.wikimedia.org/r/670559 (https://phabricator.wikimedia.org/T273278) (owner: 10Razzi) [19:14:25] T266470: Expose wdqs1009 to wdqs users and gather feedback - https://phabricator.wikimedia.org/T266470 [19:14:27] !log T266470 on `ryankemper@cumin1001`: `sudo -E cumin 'A:wdqs-all' 'sudo disable-puppet "revoking old cert and generating new one with new alt_names - T266470"'` [19:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:58] Urbanecm: looks good! you can sync those changes! [19:15:03] syncing, thanks! [19:15:15] Urbanecm: looks good [19:15:30] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1002.eqiad.wmnet with reason: REIMAGE [19:15:30] (testing as "Matma Rex test 2021-02-03" on ptwiki, who is in the A/B test) [19:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:43] !log T266470 `sudo puppet cert clean wdqs.discovery.wmnet` [19:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:51] thanks MatmaRex, will sync [19:16:07] (03PS3) 10Urbanecm: Enable DiscussionTools' beta feature for newtopictool on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669893 (https://phabricator.wikimedia.org/T275827) (owner: 10Bartosz Dziewoński) [19:16:11] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools' beta feature for newtopictool on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669893 (https://phabricator.wikimedia.org/T275827) (owner: 10Bartosz Dziewoński) [19:16:45] !log T266470 `sudo rm -fv certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.crt.pem certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.csr.pem certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.keystore.jks certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.keystore.p12 certificates/wdqs.discovery.wmnet/truststore.jks` (full paths not provided to fit the IRC line) [19:16:51] !log urbanecm@deploy1002 Synchronized php-1.36.0-wmf.33/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js: d9bad12cdb02e13517cecd1775162fde88af48eb: searchSatisfaction: Allow for async initialisation (T274869) (duration: 01m 08s) [19:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:00] T274869: Instrumentation QA for new search widget - https://phabricator.wikimedia.org/T274869 [19:17:37] (03Merged) 10jenkins-bot: Enable DiscussionTools' beta feature for newtopictool on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669893 (https://phabricator.wikimedia.org/T275827) (owner: 10Bartosz Dziewoński) [19:17:41] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1002.eqiad.wmnet with reason: REIMAGE [19:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:18:17] !log T266470 `sudo cergen -c 'wdqs.*' --generate --base-path /srv/private/modules/secret/secrets/certificates /srv/private/modules/secret/secrets/certificates/certificate.manifests.d` [19:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:38] (03CR) 10Wolfgang Kandek: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/670427 (https://phabricator.wikimedia.org/T276673) (owner: 10Jbond) [19:18:45] !log urbanecm@deploy1002 Synchronized php-1.36.0-wmf.34/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js: e998086f7cf7839d2c9aa917776509b3198c3142: searchSatisfaction: Allow for async initialisation (T274869) (duration: 01m 08s) [19:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:05] thanks Urbanecm [19:19:07] MatmaRex: your second config is on mwdebug1002 [19:19:09] np Jdlrobson [19:19:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:20:36] Urbanecm: also looks good [19:20:41] thanks, will sync [19:20:48] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging100[123] - https://phabricator.wikimedia.org/T273778 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` kafka-logging1003.eqiad.wmnet ` The log can be found in `/var/log/wmf-aut... [19:20:52] !log urbanecm@deploy1002 Synchronized php-1.36.0-wmf.33/extensions/DiscussionTools/includes/Hooks/HookUtils.php: 4193ff71df421f2fe2ed3e1f2fa1c54334e722e2: Allow users to continue using reply tool after disabling A/B test (T276967) (duration: 01m 09s) [19:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:01] T276967: Turn off Reply Tool A/B test - https://phabricator.wikimedia.org/T276967 [19:22:08] (03PS1) 10Ryan Kemper: wdqs: revert wdqs.discovery.wmnet changes [puppet] - 10https://gerrit.wikimedia.org/r/670562 (https://phabricator.wikimedia.org/T266470) [19:22:16] !log urbanecm@deploy1002 Synchronized php-1.36.0-wmf.34/extensions/DiscussionTools/includes/Hooks/HookUtils.php: 9cb48f08f452a124868e1bf9d700a45c1d7255f4: Allow users to continue using reply tool after disabling A/B test (T276967) (duration: 01m 07s) [19:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:55] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: revert wdqs.discovery.wmnet changes [puppet] - 10https://gerrit.wikimedia.org/r/670562 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper) [19:23:51] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 4824679d79d462459eba6b77a5af787817f186d2: Disable DiscussionTools Reply Tool A/B test (T276967) (duration: 01m 07s) [19:23:55] !log T266470 Deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/670562 (copies over new pubkey) [19:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:04] T266470: Expose wdqs1009 to wdqs users and gather feedback - https://phabricator.wikimedia.org/T266470 [19:24:31] (03PS3) 10Urbanecm: Enable DiscussionTools' beta features on frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669895 (https://phabricator.wikimedia.org/T276189) (owner: 10Bartosz Dziewoński) [19:24:36] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools' beta features on frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669895 (https://phabricator.wikimedia.org/T276189) (owner: 10Bartosz Dziewoński) [19:25:18] (03CR) 10Razzi: "I'll confirm before I merge this" [puppet] - 10https://gerrit.wikimedia.org/r/670559 (https://phabricator.wikimedia.org/T273278) (owner: 10Razzi) [19:25:20] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging100[123] - https://phabricator.wikimedia.org/T273778 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kafka-logging1002.eqiad.wmnet'] ` and were **ALL** successful. [19:25:29] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 5093618d5069dd287a4f33c1d49b5e5c8a05a13c: Enable DiscussionTools beta feature for newtopictool on most wikis (T275827) (duration: 01m 08s) [19:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:36] T275827: Deploy config change to scale New Discussion Tool's availability as a beta feature - https://phabricator.wikimedia.org/T275827 [19:25:42] (03PS5) 10Urbanecm: Enable Growth features on eowiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667694 (https://phabricator.wikimedia.org/T276123) [19:26:23] !log T266470 `sudo chown -Rv gitpuppet:gitpuppet /srv/private/modules/secret/secrets/certificates/wdqs.discovery.wmnet/` && `sudo chown -v gitpuppet:gitpuppet /srv/private/modules/secret/secrets/ssl/wdqs.discovery.wmnet.key` [19:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:39] (03Merged) 10jenkins-bot: Enable DiscussionTools' beta features on frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669895 (https://phabricator.wikimedia.org/T276189) (owner: 10Bartosz Dziewoński) [19:27:18] MatmaRex: pulled onto mwdebug1002, can you test? [19:27:18] !log T266470 `/srv/private` commit SHA for this change is `45852086679616bccb5bba3dd6396082b0f25a3d` [19:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:37] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging100[123] - https://phabricator.wikimedia.org/T273778 (10RobH) [19:27:54] Urbanecm: yeah, looks good to me [19:28:05] thanks, syncing [19:28:33] !log T266470 `ryankemper@wdqs1004:~$ sudo enable-puppet "revoking old cert and generating new one with new alt_names - T266470 - root"` && `sudo run-puppet-agent` [19:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:38] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 84271f616081e28e48676a2dd498bd904d5c0b76: Enable DiscussionTools beta features on frwiktionary (T276189) (duration: 01m 09s) [19:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:46] T276189: Enable the Reply tool as a Beta Feature for the French Wiktionary - https://phabricator.wikimedia.org/T276189 [19:29:50] and done [19:29:52] anything else? [19:29:54] thank you [19:30:07] np [19:30:11] (03PS6) 10Urbanecm: Enable Growth features on eowiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667694 (https://phabricator.wikimedia.org/T276123) [19:30:18] (03PS7) 10Urbanecm: Enable Growth features on eowiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667694 (https://phabricator.wikimedia.org/T276123) [19:30:23] (03CR) 10Urbanecm: [C: 03+2] Enable Growth features on eowiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667694 (https://phabricator.wikimedia.org/T276123) (owner: 10Urbanecm) [19:32:03] (03Merged) 10jenkins-bot: Enable Growth features on eowiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667694 (https://phabricator.wikimedia.org/T276123) (owner: 10Urbanecm) [19:32:17] !log T266470 `ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-all' 'sudo enable-puppet "revoking old cert and generating new one with new alt_names - T266470 - root"'` && `ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-all' 'sudo run-puppet-agent'` [19:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:24] T266470: Expose wdqs1009 to wdqs users and gather feedback - https://phabricator.wikimedia.org/T266470 [19:33:46] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging1003.eqiad.wmnet with reason: REIMAGE [19:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:45] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging1003.eqiad.wmnet with reason: REIMAGE [19:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:36] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: a130e9f2eab6dec12aec4380efdfd6bde1767aeb: Enable Growth features on eowiki in stealth mode (T276123) (duration: 01m 08s) [19:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:43] T276123: Deploy Growth features on Esperanto Wikipedia - https://phabricator.wikimedia.org/T276123 [19:38:40] 10SRE: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10Majavah) Hi, thanks! Copying from the subtask: >>! In T277081#6902189, @Majavah wrote: > deployment-ircd02 is now working on Buster and from a very quick look it seems to be working properly. [19:43:11] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging100[123] - https://phabricator.wikimedia.org/T273778 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kafka-logging1003.eqiad.wmnet'] ` and were **ALL** successful. [19:51:05] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging100[123] - https://phabricator.wikimedia.org/T273778 (10RobH) 05Open→03Resolved [19:52:22] (03PS1) 10Majavah: Update comment for irc.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670566 (https://phabricator.wikimedia.org/T277081) [19:53:00] that's a comment changing only beta config patch ^ [19:58:23] !log train status: 1.36.0-wmf.34 (T274938): currently blocked at group0 as client error logging is broken (UBN ticket incoming), will hold for patch. [19:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:30] T274938: 1.36.0-wmf.34 deployment blockers - https://phabricator.wikimedia.org/T274938 [19:58:45] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/670567 [19:58:54] (03PS1) 10Ahmon Dancy: logspam-watch.sh: Add .logspamwatchrc support [puppet] - 10https://gerrit.wikimedia.org/r/670568 [20:00:04] brennen and liw: #bothumor I � Unicode. All rise for Mediawiki train - American+European Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210310T2000). [20:03:19] 10SRE, 10ops-eqiad, 10Analytics, 10DC-Ops: analytics1066's BBU might need to be replaced - https://phabricator.wikimedia.org/T277005 (10crusnov) p:05Triage→03Medium [20:03:40] 10SRE, 10User-MoritzMuehlenhoff: Automated removal of obsolete kernels - https://phabricator.wikimedia.org/T277011 (10crusnov) p:05Triage→03Medium [20:04:22] (03CR) 10Ottomata: [C: 03+1] bump eventgate-logging-external & attach geoip [deployment-charts] - 10https://gerrit.wikimedia.org/r/670288 (https://phabricator.wikimedia.org/T263496) (owner: 10CDanis) [20:05:10] brennen if the task you were going to report was about https://phabricator.wikimedia.org/T277094, patch merged and just needs backporting [20:05:31] (03PS1) 10DannyS712: Error in shouldLog logic drops most errors [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670529 (https://phabricator.wikimedia.org/T277094) [20:05:36] (03CR) 10Brennen Bearnes: [C: 03+1] "Nice and simple, I like it." [puppet] - 10https://gerrit.wikimedia.org/r/670568 (owner: 10Ahmon Dancy) [20:06:49] (03CR) 10CDanis: [C: 03+2] bump eventgate-logging-external & attach geoip [deployment-charts] - 10https://gerrit.wikimedia.org/r/670288 (https://phabricator.wikimedia.org/T263496) (owner: 10CDanis) [20:06:53] (03PS2) 10Herron: grafana: add domainrw param and lookup [puppet] - 10https://gerrit.wikimedia.org/r/670567 [20:07:31] DannyS712: ack, thx [20:07:39] (03Merged) 10jenkins-bot: bump eventgate-logging-external & attach geoip [deployment-charts] - 10https://gerrit.wikimedia.org/r/670288 (https://phabricator.wikimedia.org/T263496) (owner: 10CDanis) [20:07:41] (03CR) 10Herron: "PCC https://puppet-compiler.wmflabs.org/compiler1002/28487/" [puppet] - 10https://gerrit.wikimedia.org/r/670567 (owner: 10Herron) [20:11:06] Jdlrobson: shall i deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/670529 ? [20:11:52] brennen: yeah...I was just going to ask you if you could or if you wanted me to so +1 from me :) [20:12:13] yep, on it. :) [20:12:19] <3 [20:12:54] (03CR) 10Brennen Bearnes: [C: 03+2] Error in shouldLog logic drops most errors [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670529 (https://phabricator.wikimedia.org/T277094) (owner: 10DannyS712) [20:13:00] (03PS4) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [20:14:08] (03CR) 10jerkins-bot: [V: 04-1] profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [20:15:35] 10SRE, 10Domains, 10Okapi, 10Traffic: Subdomain Request - OKAPI - https://phabricator.wikimedia.org/T276585 (10crusnov) Has this been followed up with an NDA ticket? @MNadrofsky [20:16:43] Jdlrobson: anything we can / should do to test this one? [20:17:00] (03PS5) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [20:17:56] brennen: if we deplooy it i can create an error on mw.org [20:17:57] to confirm [20:18:08] can do that on debug if needed [20:18:17] (03CR) 10jerkins-bot: [V: 04-1] profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [20:18:20] Jdlrobson: cool, i will let you know when it's on an mwdebug box [20:18:22] (03Merged) 10jenkins-bot: Error in shouldLog logic drops most errors [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670529 (https://phabricator.wikimedia.org/T277094) (owner: 10DannyS712) [20:21:03] Jdlrobson: on mwdebug1002 [20:22:35] brennen: testing [20:27:28] brennen: hmm.. im testing on m.mediawiki.org but not seeing any difference on mwdebug1002. [20:27:58] hrm [20:29:12] i mean it's not any worse i just cant understand what's happening here right now [20:30:43] (03PS1) 10Urbanecm: jawiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670575 (https://phabricator.wikimedia.org/T276830) [20:30:56] (03PS1) 10Cwhite: logstash: add dead letter queue support [puppet] - 10https://gerrit.wikimedia.org/r/670576 (https://phabricator.wikimedia.org/T277080) [20:31:09] Jdlrobson: just making sure i didn't mess anything up, but: [20:31:11] brennen@mwdebug1002:/srv/mediawiki/php-1.36.0-wmf.34/extensions/WikimediaEvents$ ls -lah modules/ext.wikimediaEvents/clientError.js [20:31:14] -rw-r--r-- 1 mwdeploy mwdeploy 15K Mar 10 20:19 modules/ext.wikimediaEvents/clientError.js [20:31:18] brennen: i think it's likely me. [20:31:27] brennen: may i help anyway? [20:31:32] *in any way [20:31:36] i think there's still an error in the logic [20:32:28] Urbanecm: thanks - i think we're good on the deploy end tho [20:32:37] oakay [20:32:49] if you have useful thoughts about client error logging, you're doing better than me. :) [20:33:20] sorry, Jdlrobson is probably better on that :) [20:33:38] brennen: i think i need more time with this one :( i also need to eat lunch [20:34:33] Jdlrobson: totally fair. i can file a placeholder blocker ticket if there's not one yet, and let me know if sending up the train blocked e-mail bat signal to get more eyes would be of any use. [20:35:11] (ah, right there's an existing ticket. adding to train task.) [20:35:32] Jdlrobson: did you use ?debug=1 when testing? [20:35:58] (03PS2) 10Cwhite: logstash: add dead letter queue support [puppet] - 10https://gerrit.wikimedia.org/r/670576 (https://phabricator.wikimedia.org/T277080) [20:36:26] !log cdanis@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [20:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:03] after lunch in about 30 mins ill look at this. hopefully i'll be able to work out what's happening within an hour. [20:37:22] cool [20:37:39] there's no harm in deploying this one to mediawiki.org to rule out errors my end or in debug [20:37:49] if errors are working we may see them come into logstash [20:38:00] we can always revert it later but ill leave that to you [20:38:01] bbiab [20:38:04] Jdlrobson: ack, i will go ahead and sync. [20:38:46] (03PS3) 10Cwhite: logstash: add dead letter queue support [puppet] - 10https://gerrit.wikimedia.org/r/670576 (https://phabricator.wikimedia.org/T277080) [20:41:52] !log brennen@deploy1002 Synchronized php-1.36.0-wmf.34/extensions/WikimediaEvents/modules/ext.wikimediaEvents/clientError.js: Backport: [[gerrit:670529|Error in shouldLog logic drops most errors (T277094)]] (duration: 01m 14s) [20:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:02] T277094: Bug in client error logging stops any errors from being logged in group 0 wikis - https://phabricator.wikimedia.org/T277094 [20:44:16] brennen: would you mind me sneaking in a quick config patch? [20:44:47] Urbanecm: you should be clear, please ping me when finished. [20:44:58] (03PS1) 10Urbanecm: thwiki: Make Growth features available to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670577 (https://phabricator.wikimedia.org/T274646) [20:44:59] thanks a lot! [20:45:07] (03CR) 10Urbanecm: [C: 03+2] thwiki: Make Growth features available to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670577 (https://phabricator.wikimedia.org/T274646) (owner: 10Urbanecm) [20:47:07] (03Merged) 10jenkins-bot: thwiki: Make Growth features available to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670577 (https://phabricator.wikimedia.org/T274646) (owner: 10Urbanecm) [20:48:53] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 92ae985df5411de7ff983a778aebde0e10f6253e: thwiki: Make Growth features available to newcomers (T274646) (duration: 01m 08s) [20:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:01] T274646: Growth tools deployment on Thai Wikipedia - https://phabricator.wikimedia.org/T274646 [20:49:12] (03PS2) 10Urbanecm: jawiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670575 (https://phabricator.wikimedia.org/T276830) [20:49:35] (03CR) 10Urbanecm: [C: 03+2] jawiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670575 (https://phabricator.wikimedia.org/T276830) (owner: 10Urbanecm) [20:50:41] (03Merged) 10jenkins-bot: jawiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670575 (https://phabricator.wikimedia.org/T276830) (owner: 10Urbanecm) [20:50:55] !log cdanis@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [20:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:34] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 92ae985df5411de7ff983a778aebde0e10f6253e: thwiki: Make Growth features available to newcomers (T274646) (duration: 01m 07s) [20:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:05] 10SRE, 10Domains, 10Okapi, 10Traffic: Subdomain Request - OKAPI - https://phabricator.wikimedia.org/T276585 (10BBlack) @crusnov - It's been followed up offline from phab in general with some meetings, the output of which aren't (yet) reflected in phab, if you're just looking for whether it's being ignored... [20:54:54] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: 92ae985df5411de7ff983a778aebde0e10f6253e: thwiki: Make Growth features available to newcomers (T274646) (duration: 01m 08s) [20:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:01] T274646: Growth tools deployment on Thai Wikipedia - https://phabricator.wikimedia.org/T274646 [20:55:09] brennen: I'm done, thanks a lot! [20:56:00] Urbanecm: ack, thanks! [20:56:26] !log Fixing wrong sync message: urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: f72c3d6c4fcbda692c5bf8c37a38667c3ba12d80: jawiki: Enable Growth features in stealth mode (T276830) (duration: 01m 07s) [20:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:44] T276830: Deploy Growth features on Japanese Wikipedia - https://phabricator.wikimedia.org/T276830 [20:56:47] !log Fixing wrong sync message: urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist f72c3d6c4fcbda692c5bf8c37a38667c3ba12d80: jawiki: Enable Growth features in stealth mode (T276830) (duration: 01m 08s) [20:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:54] !log cdanis@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [20:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:05] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210310T2100). Please do the needful. [21:00:47] !log cdanis@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [21:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:34] (03PS2) 10Andrew Bogott: cinderutils::ensure: Add a 'mount' resource [puppet] - 10https://gerrit.wikimedia.org/r/670553 (https://phabricator.wikimedia.org/T269511) [21:01:36] (03PS6) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [21:02:00] (03CR) 10jerkins-bot: [V: 04-1] cinderutils::ensure: Add a 'mount' resource [puppet] - 10https://gerrit.wikimedia.org/r/670553 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [21:02:58] (03CR) 10jerkins-bot: [V: 04-1] profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [21:04:43] (03PS3) 10Andrew Bogott: cinderutils::ensure: Add a 'mount' resource [puppet] - 10https://gerrit.wikimedia.org/r/670553 (https://phabricator.wikimedia.org/T269511) [21:04:45] (03PS7) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [21:06:09] (03CR) 10jerkins-bot: [V: 04-1] profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [21:07:53] 10SRE, 10Domains, 10Okapi, 10Traffic: Subdomain Request - OKAPI - https://phabricator.wikimedia.org/T276585 (10Reedy) Is this a duplicate, a subset or different to {T269686} ? [21:08:15] 10SRE, 10Domains, 10Okapi, 10Traffic: Subdomain Request - OKAPI - https://phabricator.wikimedia.org/T276585 (10crusnov) >>! In T276585#6902476, @BBlack wrote: > @crusnov - It's been followed up offline from phab in general with some meetings, the output of which aren't (yet) reflected in phab, if you're ju... [21:12:56] (03PS1) 10Urbanecm: jawiki: Growth features: Add help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670583 (https://phabricator.wikimedia.org/T276830) [21:13:21] brennen: would you mind me doing one more? 🙂 [21:13:32] !log cdanis@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [21:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:40] 10SRE, 10Domains, 10Okapi, 10Traffic: Subdomain Request - OKAPI - https://phabricator.wikimedia.org/T276585 (10BBlack) p:05Triage→03High a:03BBlack [21:16:36] brennen: ok im gonna take another look now [21:16:44] Urbanecm: go ahead. [21:16:47] thanks [21:16:55] (03CR) 10Urbanecm: [C: 03+2] jawiki: Growth features: Add help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670583 (https://phabricator.wikimedia.org/T276830) (owner: 10Urbanecm) [21:17:26] Jdlrobson: cool. i went ahead and sent a blocker mail since we're trying to err on the side of more comms lately. [21:17:39] (03Merged) 10jenkins-bot: jawiki: Growth features: Add help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670583 (https://phabricator.wikimedia.org/T276830) (owner: 10Urbanecm) [21:19:58] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: cdc47f3e35e815081f787def2d51f3fd337ecf6c: jawiki: Growth features: Add help panel links (T276830) (duration: 01m 08s) [21:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:05] T276830: Deploy Growth features on Japanese Wikipedia - https://phabricator.wikimedia.org/T276830 [21:20:25] brennen: thanks again. I'm done now for real :) [21:21:29] Urbanecm: cool, thx for update. [21:23:13] (03PS4) 10Dzahn: builder/docker: break out docker ferm rules into own profile [puppet] - 10https://gerrit.wikimedia.org/r/670286 (https://phabricator.wikimedia.org/T276869) [21:25:49] (03CR) 10Dzahn: builder/docker: break out docker ferm rules into own profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670286 (https://phabricator.wikimedia.org/T276869) (owner: 10Dzahn) [21:26:00] brennen: ok false alarm i understand what's happening now [21:26:15] the logstash board was setup incorrectly. the patch i just deployed is a bad one so should be reverted [21:27:17] (03PS1) 10Urbanecm: Revert "Error in shouldLog logic drops most errors" [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670533 (https://phabricator.wikimedia.org/T277094) [21:27:22] Jdlrobson: this one? [21:27:33] that's the one. [21:27:41] yepppp [21:27:55] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "Error in shouldLog logic drops most errors" [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670533 (https://phabricator.wikimedia.org/T277094) (owner: 10Urbanecm) [21:28:08] (03CR) 10Dzahn: "@JMeybohm Thanks for review, I did amend to follow your suggestion and renamed the file to just "docker-ferm" and adjusted the comment to " [puppet] - 10https://gerrit.wikimedia.org/r/670286 (https://phabricator.wikimedia.org/T276869) (owner: 10Dzahn) [21:29:11] (03CR) 10Dduvall: [C: 03+1] builder/docker: break out docker ferm rules into own profile [puppet] - 10https://gerrit.wikimedia.org/r/670286 (https://phabricator.wikimedia.org/T276869) (owner: 10Dzahn) [21:30:08] !log train status: 1.36.0-wmf.34 (T274938): logstash client error board was set up incorrectly; reverting earlier patch for T277094 and will proceed to group1. [21:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:16] T274938: 1.36.0-wmf.34 deployment blockers - https://phabricator.wikimedia.org/T274938 [21:30:17] T277094: Bug in client error logging stops any errors from being logged in group 0 wikis - https://phabricator.wikimedia.org/T277094 [21:30:20] Urbanecm: so this was the source of my confusion: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/670585 Function name shouldn't be confusing. [NEW] [21:30:38] yup, confusing functions are always bad :) [21:30:40] The function shouldLogFileUrl was actually shouldNotLogFileUrl [21:30:57] ideally that patch would go out too, but not a big deal if not [21:31:22] 10SRE, 10Domains, 10Okapi, 10Traffic: Subdomain Request - OKAPI - https://phabricator.wikimedia.org/T276585 (10crusnov) @BBlack Thanks! [21:31:29] Jdlrobson: not my call :) [21:31:38] 10SRE, 10Packaging, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Backport hyperkitty 1.3.4 for buster - https://phabricator.wikimedia.org/T276687 (10Ladsgroup) Thanks. I will give it a try. If I get stuck, I'll ask. [21:32:36] sorry about all this.. i will focus on fixing the logstash board now [21:35:00] (03Merged) 10jenkins-bot: Revert "Error in shouldLog logic drops most errors" [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670533 (https://phabricator.wikimedia.org/T277094) (owner: 10Urbanecm) [21:35:16] 10SRE, 10Product-Data-Infrastructure, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis) [21:36:10] 10SRE, 10Analytics, 10Patch-For-Review: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10CDanis) 05Open→03Resolved ASN, ISP/organization, country, & subdivision are now visible in Logstash! [21:37:55] hmm https://test2.wikipedia.org/wiki/Special:Version is not the same as mediawiki.org version? is test2 different? [21:38:12] im still not seeing any errors logged to mw.org so am a little suspicious that something may be broken [21:38:34] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/28488/deneb.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/670286 (https://phabricator.wikimedia.org/T276869) (owner: 10Dzahn) [21:40:25] Jdlrobson: test2 is wmf.33, not wmf.34 [21:40:51] !log brennen@deploy1002 Synchronized php-1.36.0-wmf.34/extensions/WikimediaEvents/modules/ext.wikimediaEvents/clientError.js: Backport: [[gerrit:670533|Revert "Error in shouldLog logic drops most errors" (T277094)]] (duration: 01m 08s) [21:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:58] T277094: Bug in client error logging stops any errors from being logged in group 0 wikis - https://phabricator.wikimedia.org/T277094 [21:41:12] (03PS4) 10Andrew Bogott: cinderutils::ensure: Add a 'mount' resource [puppet] - 10https://gerrit.wikimedia.org/r/670553 (https://phabricator.wikimedia.org/T269511) [21:41:14] (03PS8) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [21:41:19] (Jdlrobson and that's expected, test2wiki is supposed to be a testing wiki that's behind testwiki a bit [21:41:35] k yeh so no errors coming in to group 0 wikis [21:41:38] (03CR) 10jerkins-bot: [V: 04-1] cinderutils::ensure: Add a 'mount' resource [puppet] - 10https://gerrit.wikimedia.org/r/670553 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [21:42:10] Jdlrobson: normally i would say "that's great, no errors", but i assume it's not a cause for celebration :) [21:42:11] Jdlrobson: ok, so the revert is synced there, and will continue to hold until we're sure what's going on. [21:42:32] (03CR) 10jerkins-bot: [V: 04-1] profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [21:42:38] (03PS5) 10Andrew Bogott: cinderutils::ensure: Add a 'mount' resource [puppet] - 10https://gerrit.wikimedia.org/r/670553 (https://phabricator.wikimedia.org/T269511) [21:42:40] (03PS9) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [21:43:22] !log train status: 1.36.0-wmf.34 (T274938): client errors may still be missing for group0; continuing to hold for T277094 until we know what's broken. [21:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:33] T274938: 1.36.0-wmf.34 deployment blockers - https://phabricator.wikimedia.org/T274938 [21:43:52] (03CR) 10jerkins-bot: [V: 04-1] profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [21:44:41] (03CR) 10Dzahn: "noop on deneb" [puppet] - 10https://gerrit.wikimedia.org/r/670286 (https://phabricator.wikimedia.org/T276869) (owner: 10Dzahn) [21:44:50] (03PS2) 10Dzahn: releases: include profile::docker::ferm in releases role [puppet] - 10https://gerrit.wikimedia.org/r/670289 (https://phabricator.wikimedia.org/T276869) [21:45:12] Jdlrobson: is there anything i can do to help? [21:47:35] Urbanecm: not really... i think https://logstash.wikimedia.org/app/dashboards#/view/AXN5OoJu3_NNwgAUlbUT?_g=h@c823129&_a=h@ab7de48 might be the problem here [21:49:28] OK got it [21:49:31] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/28490/releases2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/670289 (https://phabricator.wikimedia.org/T276869) (owner: 10Dzahn) [21:49:33] it sounds to say `'' should have required property 'message'` for me. Maybe that's a required field in for https://intake-analytics.wikimedia.org/v1/events? [21:52:04] Urbanecm: could you backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/670585 ? [21:52:10] i can verify that fixes the issue if so [21:52:27] probably best for Krinkle to take a look at that patch if he's available before merging to master and rolling train forward [21:52:37] typescript would have likely helped here [21:52:52] :) [21:53:29] !log ferm/iptables docker NAT rules applied by puppet on releases servers after breaking out fules into their own profile class (T276869) [21:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:36] T276869: Missing docker iptables nat rules for releases hosts - https://phabricator.wikimedia.org/T276869 [21:54:06] Jdlrobson: i can, but I'll leave it to brennen (unless you really prefer me ;)) [21:54:21] yeah i'll go ahead, one sec [21:54:41] (03PS1) 10RobH: an-druid1004 mac update [puppet] - 10https://gerrit.wikimedia.org/r/670591 (https://phabricator.wikimedia.org/T274163) [21:54:41] ah, i see test build failed? [21:55:01] (03CR) 10RobH: [C: 03+2] an-druid1004 mac update [puppet] - 10https://gerrit.wikimedia.org/r/670591 (https://phabricator.wikimedia.org/T274163) (owner: 10RobH) [21:55:04] So ErrorDescriptor has a errorMessage key not a message key [21:58:40] (03CR) 10Dzahn: [C: 03+2] add gitlab.wikimedia.org service alias, point to gitlab1001 [dns] - 10https://gerrit.wikimedia.org/r/670330 (https://phabricator.wikimedia.org/T276170) (owner: 10Dzahn) [21:58:44] (03PS4) 10Dzahn: add gitlab.wikimedia.org service alias, point to gitlab1001 [dns] - 10https://gerrit.wikimedia.org/r/670330 (https://phabricator.wikimedia.org/T276170) [21:59:00] (03PS1) 10Brennen Bearnes: Fix client error logging [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670535 (https://phabricator.wikimedia.org/T277094) [22:01:19] 10SRE, 10DNS, 10Traffic, 10serviceops, and 4 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10Dzahn) a:03Dzahn [22:01:58] (03CR) 10Dzahn: "[authdns1001:~] $ host gitlab.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/670330 (https://phabricator.wikimedia.org/T276170) (owner: 10Dzahn) [22:03:08] 10SRE, 10DNS, 10Traffic, 10serviceops, and 4 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10Dzahn) 05Open→03Resolved done! ` [authdns1001:~] $ host gitlab.wikimedia.org gitlab.wikimedia.org is an alias for gitlab1001.wikimedia.org. gitlab1001.wikimedia.org has address 208.80.154... [22:03:37] 10SRE, 10DNS, 10Traffic, 10serviceops, and 4 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10Dzahn) @Sergey.Trofimovsky.SF See above, the gitlab.wikimedia.org name now points to the VM. Keep in mind it's both IPv4 and IPv6. [22:07:25] (03CR) 10jerkins-bot: [V: 04-1] Fix client error logging [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670535 (https://phabricator.wikimedia.org/T277094) (owner: 10Brennen Bearnes) [22:09:14] (03PS2) 10Jdlrobson: Fix client error logging [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670535 (https://phabricator.wikimedia.org/T277094) (owner: 10Brennen Bearnes) [22:09:50] (03PS3) 10Jbond: P:gitlab: Deploy acme chief certificate [puppet] - 10https://gerrit.wikimedia.org/r/670427 (https://phabricator.wikimedia.org/T276673) [22:09:58] (03CR) 10Jbond: "I noticed i hadn't actually uploaded my last PS should be there now, with this i think we would get to make use of the shared acme-cheif i" [puppet] - 10https://gerrit.wikimedia.org/r/670427 (https://phabricator.wikimedia.org/T276673) (owner: 10Jbond) [22:11:11] (03CR) 10Jdlrobson: [C: 03+1] Fix client error logging [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670535 (https://phabricator.wikimedia.org/T277094) (owner: 10Brennen Bearnes) [22:11:16] (03CR) 10Jdlrobson: [C: 03+1] Fix client error logging [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670535 (https://phabricator.wikimedia.org/T277094) (owner: 10Brennen Bearnes) [22:11:24] brennen: ok jenkins is happy with those now [22:11:57] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:11:57] Jdlrobson: ack, backporting and will let you know when it's testable. [22:12:05] brennen: sounds good [22:12:10] (03CR) 10Brennen Bearnes: [C: 03+2] Fix client error logging [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670535 (https://phabricator.wikimedia.org/T277094) (owner: 10Brennen Bearnes) [22:12:14] not sure who's around to take care of that patch on master [22:13:01] Jdlrobson: if this works on group0, how comfortable are you with the train going to group1? [22:13:09] im comfortable if it works on group0 [22:13:18] ok, we'll plan on that then. [22:13:37] i just want to make sure if there are any new JS errors in group1 those are flagged [22:14:05] (03CR) 10Dzahn: "I think using acme-chief to get and renew the cert is a good thing but I am not so sure about the service part that let's puppet control t" [puppet] - 10https://gerrit.wikimedia.org/r/670427 (https://phabricator.wikimedia.org/T276673) (owner: 10Jbond) [22:16:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:18:54] brennen: lemme know when i can test on debug [22:19:20] (03Merged) 10jenkins-bot: Fix client error logging [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670535 (https://phabricator.wikimedia.org/T277094) (owner: 10Brennen Bearnes) [22:20:15] (03PS4) 10Jbond: P:gitlab: Deploy acme chief certificate [puppet] - 10https://gerrit.wikimedia.org/r/670427 (https://phabricator.wikimedia.org/T276673) [22:21:28] (03CR) 10Jbond: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/670427 (https://phabricator.wikimedia.org/T276673) (owner: 10Jbond) [22:22:06] Jdlrobson: on mwdebug1002 [22:22:11] on it [22:22:38] brennen: that does it [22:23:01] Jdlrobson: cool. syncing this and then rolling to group1. [22:23:06] sweet [22:25:16] !log brennen@deploy1002 Synchronized php-1.36.0-wmf.34/extensions/WikimediaEvents/modules/ext.wikimediaEvents/clientError.js: Backport: [[gerrit:670535|Fix client error logging (T277094)]] (duration: 01m 09s) [22:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:24] T277094: Bug in client error logging stops any errors from being logged in group 0 wikis - https://phabricator.wikimedia.org/T277094 [22:26:37] !log train status: 1.36.0-wmf.34 (T274938): T277094 believed resolved, promoting to group1. [22:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:45] T274938: 1.36.0-wmf.34 deployment blockers - https://phabricator.wikimedia.org/T274938 [22:27:21] !log legoktm@cumin1001 START - Cookbook sre.hosts.decommission for hosts registry1001.eqiad.wmnet [22:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:45] (03PS1) 10Brennen Bearnes: group1 wikis to 1.36.0-wmf.34 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670603 [22:27:47] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.36.0-wmf.34 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670603 (owner: 10Brennen Bearnes) [22:28:31] 10SRE, 10Wikimedia-Mailing-lists: Figure out a way to sync old and new mailman - https://phabricator.wikimedia.org/T256539 (10Ladsgroup) There is the archive aspect of upgrade, there's also the double support aspect of the upgrade that bothers me a lot and couldn't come up with a good solution for yet. Imagine... [22:28:52] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.34 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670603 (owner: 10Brennen Bearnes) [22:30:36] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.34 [22:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:07] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.36.0-wmf.34 (duration: 01m 30s) [22:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:19] (03PS1) 10Legoktm: Remove registry1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/670604 (https://phabricator.wikimedia.org/T272550) [22:33:40] so far so good. (saying that is usually the trigger for _something_ to explode.) [22:34:15] 💣 [22:37:34] 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Dzahn) >>! In T276144#6899820, @jbond wrote: > AFAIK Gerrit uses an external IP address exp... [22:40:18] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts registry1001.eqiad.wmnet [22:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:25] 10SRE, 10serviceops, 10Patch-For-Review: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by legoktm@cumin1001 for hosts: `registry1001.eqiad.wmnet` - registry1001.eqiad.wmnet (**PASS**) - Downtimed... [22:40:55] (03CR) 10Legoktm: [C: 03+2] Remove registry1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/670604 (https://phabricator.wikimedia.org/T272550) (owner: 10Legoktm) [22:42:29] (03PS1) 10Legoktm: Remove registry200[12].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/670605 (https://phabricator.wikimedia.org/T272550) [22:42:47] (03CR) 10Ahmon Dancy: [C: 04-1] pipeline: Initial multiversion pipeline configuration (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666492 (https://phabricator.wikimedia.org/T274182) (owner: 10Dduvall) [22:43:20] (03PS6) 10Andrew Bogott: cinderutils::ensure: Refactor and add a 'mount' resource [puppet] - 10https://gerrit.wikimedia.org/r/670553 (https://phabricator.wikimedia.org/T269511) [22:43:22] (03PS10) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [22:43:48] (03CR) 10jerkins-bot: [V: 04-1] cinderutils::ensure: Refactor and add a 'mount' resource [puppet] - 10https://gerrit.wikimedia.org/r/670553 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [22:43:59] 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console for pcoombe@wikimedia.org - https://phabricator.wikimedia.org/T277065 (10JKatzWMF) @Gilles done. @PCoombe, you should be able to add others on your team to that domain now as well. Donate.wikimedia setup, but will take some t... [22:44:41] (03CR) 10jerkins-bot: [V: 04-1] profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [22:44:55] !log legoktm@cumin1001 START - Cookbook sre.hosts.decommission for hosts registry[2001-2002].codfw.wmnet [22:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:06] (03PS7) 10Andrew Bogott: cinderutils::ensure: Refactor and add a 'mount' resource [puppet] - 10https://gerrit.wikimedia.org/r/670553 (https://phabricator.wikimedia.org/T269511) [22:46:08] (03PS11) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [22:47:43] (03CR) 10jerkins-bot: [V: 04-1] profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [22:48:24] (03CR) 10Dzahn: [C: 04-1] "This will be waiting until we get asked to open the port. https://phabricator.wikimedia.org/T276144#6890920" [puppet] - 10https://gerrit.wikimedia.org/r/670331 (https://phabricator.wikimedia.org/T276144) (owner: 10Dzahn) [22:49:20] (03CR) 10Andrew Bogott: [C: 03+2] cinderutils::ensure: Refactor and add a 'mount' resource [puppet] - 10https://gerrit.wikimedia.org/r/670553 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [22:49:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=docker-registry site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:51:05] !log updating puppet compiler facts to catch up with a new custom fact [22:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:11] (03PS5) 10Dzahn: P:gitlab: Deploy acme chief certificate [puppet] - 10https://gerrit.wikimedia.org/r/670427 (https://phabricator.wikimedia.org/T276673) (owner: 10Jbond) [22:51:54] (03CR) 10Dzahn: [C: 03+1] "I like this now with the simple exec. Made a small edit to spelling in comments. lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/670427 (https://phabricator.wikimedia.org/T276673) (owner: 10Jbond) [22:53:01] (03CR) 10Cwhite: [C: 03+1] grafana: add domainrw param and lookup [puppet] - 10https://gerrit.wikimedia.org/r/670567 (owner: 10Herron) [22:53:47] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-druid100[345] - https://phabricator.wikimedia.org/T274163 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-druid1004.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/... [22:55:27] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts registry[2001-2002].codfw.wmnet [22:55:33] 10SRE, 10serviceops, 10Patch-For-Review: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by legoktm@cumin1001 for hosts: `registry[2001-2002].codfw.wmnet` - registry2001.codfw.wmnet (**PASS**) - Do... [22:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:35] (03PS1) 10Mholloway: [MEP] Stream always in sample if the user is in debugMode [extensions/EventLogging] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670536 [22:58:14] (03CR) 10Legoktm: [C: 03+2] Remove registry200[12].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/670605 (https://phabricator.wikimedia.org/T272550) (owner: 10Legoktm) [22:58:33] (03Abandoned) 10Jason Linehan: Enable session tick on testwiki with 100% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670496 (https://phabricator.wikimedia.org/T276515) (owner: 10Jason Linehan) [23:01:23] !log legoktm@cumin1001 START - Cookbook sre.hosts.decommission for hosts registry1002.eqiad.wmnet [23:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:43] (03PS1) 10Legoktm: Remove registry1002.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/670607 (https://phabricator.wikimedia.org/T272550) [23:03:45] (03PS1) 10Legoktm: site.pp: Tighten registry* regex [puppet] - 10https://gerrit.wikimedia.org/r/670608 (https://phabricator.wikimedia.org/T272550) [23:04:24] (03PS2) 10Legoktm: Remove registry1002.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/670607 (https://phabricator.wikimedia.org/T272550) [23:04:26] (03PS2) 10Legoktm: site.pp: Tighten registry* regex [puppet] - 10https://gerrit.wikimedia.org/r/670608 (https://phabricator.wikimedia.org/T272550) [23:04:59] PROBLEM - Disk space on releases1002 is CRITICAL: DISK CRITICAL - /srv/docker/containers/bdb11654e8fc3ba335d6bde79eb44ac39df37f528ccd3f43a8e494e70503b618/mounts/shm is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=releases1002&var-datasource=eqiad+prometheus/ops [23:05:38] didn't I _just_ fix that exact issue... man [23:06:23] what it needs to do is ignore /srv/docker but I did that and it was resolved [23:06:25] (03CR) 10Cwhite: "https://puppet-compiler.wmflabs.org/compiler1003/28492/" [puppet] - 10https://gerrit.wikimedia.org/r/670576 (https://phabricator.wikimedia.org/T277080) (owner: 10Cwhite) [23:07:40] but on the other hand, it's not like anyone using the releaes servers is getting an alert from this [23:08:38] ah.. /run/docker vs /srv/docker is why it's still an issue now [23:10:00] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts registry1002.eqiad.wmnet [23:10:05] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1004.eqiad.wmnet with reason: REIMAGE [23:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:07] 10SRE, 10serviceops, 10Patch-For-Review: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by legoktm@cumin1001 for hosts: `registry1002.eqiad.wmnet` - registry1002.eqiad.wmnet (**PASS**) - Downtimed... [23:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:14] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1004.eqiad.wmnet with reason: REIMAGE [23:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:29] RECOVERY - Disk space on releases1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=releases1002&var-datasource=eqiad+prometheus/ops [23:15:05] nevermind, it was just a temp thing apparently, fine now [23:17:51] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS38930/IPv6: Idle - Fiberring, AS38930/IPv4: Idle - Fiberring https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:18:18] (03PS15) 10Dduvall: pipeline: Initial multiversion pipeline configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666492 (https://phabricator.wikimedia.org/T274182) [23:19:25] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-druid100[345] - https://phabricator.wikimedia.org/T274163 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-druid1004.eqiad.wmnet'] ` and were **ALL** successful. [23:20:35] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:21:07] (03CR) 10Mholloway: [C: 03+2] [MEP] Stream always in sample if the user is in debugMode [extensions/EventLogging] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670536 (owner: 10Mholloway) [23:21:33] (03PS16) 10Dduvall: pipeline: Initial multiversion pipeline configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666492 (https://phabricator.wikimedia.org/T274182) [23:21:37] (03CR) 10Dduvall: pipeline: Initial multiversion pipeline configuration (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666492 (https://phabricator.wikimedia.org/T274182) (owner: 10Dduvall) [23:25:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:26:13] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:38:53] (03PS17) 10Dduvall: pipeline: Initial multiversion pipeline configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666492 (https://phabricator.wikimedia.org/T274182) [23:39:05] (03CR) 10Legoktm: [C: 03+2] Remove registry1002.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/670607 (https://phabricator.wikimedia.org/T272550) (owner: 10Legoktm) [23:39:26] (03CR) 10Legoktm: [C: 03+2] site.pp: Tighten registry* regex [puppet] - 10https://gerrit.wikimedia.org/r/670608 (https://phabricator.wikimedia.org/T272550) (owner: 10Legoktm) [23:42:14] (03PS18) 10Dduvall: pipeline: Initial multiversion pipeline configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666492 (https://phabricator.wikimedia.org/T274182) [23:45:31] (03Merged) 10jenkins-bot: [MEP] Stream always in sample if the user is in debugMode [extensions/EventLogging] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670536 (owner: 10Mholloway) [23:49:24] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Legoktm) [23:49:50] 10SRE, 10serviceops, 10Patch-For-Review: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10Legoktm) 05Open→03Resolved Everything is Buster now, Stretch is gone \o/ [23:49:56] !log mholloway-shell@deploy1002 Synchronized php-1.36.0-wmf.34/extensions/EventLogging: EventLogging: Stream always in sample if the user is in debugMode (T276515) (duration: 01m 23s) [23:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:04] T276515: Generate Session Length test data - https://phabricator.wikimedia.org/T276515