[00:00:51] (03PS5) 10Legoktm: lists: Notify when exclude_backups list is out of sync [puppet] - 10https://gerrit.wikimedia.org/r/680158 (https://phabricator.wikimedia.org/T279237) [00:03:20] (03PS1) 10Mstyles: rdf-streaming-updater: use session mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/681497 (https://phabricator.wikimedia.org/T280166) [00:04:04] !log [WDQS] pooled `wdqs1004` [00:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:31] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01001 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:05:53] (03CR) 10Legoktm: [V: 03+1 C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1002/29136/lists1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/680158 (https://phabricator.wikimedia.org/T279237) (owner: 10Legoktm) [00:06:59] !log `sudo -i wmf-auto-reimage-host -p T280382 wdqs1006.eqiad.wmnet` [00:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:08] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [00:08:27] (03PS1) 10Razzi: sqoop: switch to single grouped_wikis.csv [puppet] - 10https://gerrit.wikimedia.org/r/681498 (https://phabricator.wikimedia.org/T280549) [00:09:07] (03PS1) 10Papaul: Add cloudnet2004-dev MAC address, partman recipe and to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/681499 (https://phabricator.wikimedia.org/T275676) [00:09:10] (03PS1) 10Dzahn: ci/deployment-server: create kubernetes namespace for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/681500 [00:13:00] (03PS1) 10Legoktm: mailman: Fix check_exclude_backups exit code [puppet] - 10https://gerrit.wikimedia.org/r/681501 [00:14:10] !log [WDQS] Pooled `wdqs2008` [00:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:35] (03CR) 10Papaul: [C: 03+2] Add cloudnet2004-dev MAC address, partman recipe and to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/681499 (https://phabricator.wikimedia.org/T275676) (owner: 10Papaul) [00:15:25] (03PS2) 10Legoktm: mailman: Fix check_exclude_backups exit code [puppet] - 10https://gerrit.wikimedia.org/r/681501 [00:15:55] !log [WDQS] Pooled `wdqs1003` [00:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:13] (03CR) 10Razzi: "Hi @Joal, how does this look?" [puppet] - 10https://gerrit.wikimedia.org/r/681498 (https://phabricator.wikimedia.org/T280549) (owner: 10Razzi) [00:22:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudnet2004-dev - https://phabricator.wikimedia.org/T275676 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cloudnet2004-dev.c... [00:25:27] (03CR) 10Legoktm: [C: 03+2] mailman: Fix check_exclude_backups exit code [puppet] - 10https://gerrit.wikimedia.org/r/681501 (owner: 10Legoktm) [00:32:26] smtplib.SMTPRecipientsRefused: {'root@lists1001.wikimedia.org': (451, b'Temporary local problem - please try later')} [00:32:28] wonderful [00:34:15] is /etc/aliases supposed to be only readable to root? [00:36:56] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2004-dev.codfw.wmnet with reason: REIMAGE [00:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:52] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudnet2004-dev.codfw.wmnet with reason: REIMAGE [00:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:25] 10SRE, 10Mail: Mail to root@lists1001.wikimedia.org doesn't work because of /etc/aliases file permissions - https://phabricator.wikimedia.org/T280744 (10Legoktm) p:05Triage→03High [00:40:36] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0053 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:48:38] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudnet2004-dev - https://phabricator.wikimedia.org/T275676 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudnet2004-dev.codfw.wmnet'] ` and were **ALL** successful. [00:52:17] PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_exclude_backups.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:52:17] PROBLEM - Check systemd state on logstash1007 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_apifeatureusage_codfw.service,curator_actions_apifeatureusage_eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:55:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudnet2004-dev - https://phabricator.wikimedia.org/T275676 (10Papaul) [00:58:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudnet2004-dev - https://phabricator.wikimedia.org/T275676 (10Papaul) @Andrew @aborrero it said that the second interface needs to be in cloud-virt-instance-trunk vlan, I have no such... [01:03:00] (03PS9) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [01:03:38] (03CR) 10Ryan Kemper: "Two comments left to get to but almost done with the feedback" (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [01:04:11] 10SRE, 10Release-Engineering-Team, 10SRE-Access-Requests: Requesting deployment access for HMonroy - https://phabricator.wikimedia.org/T280177 (10HMonroy) Thank you everyone for making this possible! [01:06:05] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [01:13:20] ACKNOWLEDGEMENT - Check systemd state on logstash1007 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_apifeatureusage_codfw.service,curator_actions_apifeatureusage_eqiad.service cole_white Known - T274394 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:17:03] PROBLEM - WDQS SPARQL on wdqs1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:33:17] RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 4.366 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:40:17] PROBLEM - WDQS SPARQL on wdqs1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:28:39] RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 4.105 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:36:11] PROBLEM - WDQS SPARQL on wdqs1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:33:20] (03PS1) 10Marostegui: mariadb: Decommission db1086 [puppet] - 10https://gerrit.wikimedia.org/r/681515 (https://phabricator.wikimedia.org/T278229) [05:33:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1086.eqiad.wmnet [05:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:44] (03CR) 10Marostegui: [C: 03+1] mariadb: Remove s3 from db2098 [puppet] - 10https://gerrit.wikimedia.org/r/681448 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [05:37:14] (03CR) 10Marostegui: "There are some alerts fired on the host, if they are normal for this host +1. But would you mind double checking? I am not sure if they ar" [puppet] - 10https://gerrit.wikimedia.org/r/681447 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [05:39:54] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1086 [puppet] - 10https://gerrit.wikimedia.org/r/681515 (https://phabricator.wikimedia.org/T278229) (owner: 10Marostegui) [05:41:37] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1086.eqiad.wmnet - https://phabricator.wikimedia.org/T278229 (10Marostegui) This is ready for #dc-ops [05:41:53] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [05:41:53] 10ops-eqiad, 10decommission-hardware: decommission db1086.eqiad.wmnet - https://phabricator.wikimedia.org/T278229 (10Marostegui) [05:42:02] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1086.eqiad.wmnet - https://phabricator.wikimedia.org/T278229 (10Marostegui) [05:42:26] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [05:42:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1086.eqiad.wmnet [05:42:39] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1086.eqiad.wmnet - https://phabricator.wikimedia.org/T278229 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db1086.eqiad.wmnet` - db1086.eqiad.wmnet (**PASS**) - Downtimed host on Icinga -... [05:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:45] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [06:03:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2077.codfw.wmnet with reason: REIMAGE [06:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2082.codfw.wmnet with reason: REIMAGE [06:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:22] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup) Sounds good to me. [06:05:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2077.codfw.wmnet with reason: REIMAGE [06:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2082.codfw.wmnet with reason: REIMAGE [06:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1074 to clone db1156 T258361', diff saved to https://phabricator.wikimedia.org/P15491 and previous config saved to /var/cache/conftool/dbconfig/20210421-061019-marostegui.json [06:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:29] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [06:11:08] !log Stop MySQL on db1074 to clone db1156 (there will be lag in s2 in wiki replicas) T258361 [06:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:48] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) Transfer started from db1074 to db1156 [06:13:50] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup) [06:15:17] PROBLEM - MariaDB Replica IO: s2 on db1125 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1074.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1074.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:15:45] ^ me [06:17:04] (03PS1) 10Marostegui: db1124,db1125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/681518 (https://phabricator.wikimedia.org/T258361) [06:18:37] (03CR) 10Marostegui: [C: 03+2] db1124,db1125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/681518 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:25:04] PROBLEM - MariaDB Replica Lag: s2 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 820.64 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:26:50] ^ me [06:33:55] (03CR) 10Elukey: [C: 03+2] install_server: add custom recipe for an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/681396 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey) [06:34:02] (03PS3) 10Elukey: install_server: add custom recipe for an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/681396 (https://phabricator.wikimedia.org/T278424) [06:38:32] (03PS1) 10Elukey: Add two upstream patches on top of the 4.9 release [debs/hue] - 10https://gerrit.wikimedia.org/r/681520 (https://phabricator.wikimedia.org/T264896) [06:39:55] (03PS2) 10Elukey: Add two upstream patches on top of the 4.9 release [debs/hue] - 10https://gerrit.wikimedia.org/r/681520 (https://phabricator.wikimedia.org/T264896) [06:40:37] (03CR) 10Elukey: [V: 03+2 C: 03+2] "Built and tested on an-test-ui1001, all good!" [debs/hue] - 10https://gerrit.wikimedia.org/r/681520 (https://phabricator.wikimedia.org/T264896) (owner: 10Elukey) [06:40:40] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Ladsgroup) Hi! Can it be done? We are planning to deploy early next week. [06:42:00] !log upload hue_4.9.0-2+deb10u1 to buster-wikimedia [06:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Gonna merge this and see how it affects already running services in staging-codfw. It should not somehow affect them, but we 'll see if th" [deployment-charts] - 10https://gerrit.wikimedia.org/r/681473 (https://phabricator.wikimedia.org/T238909) (owner: 10Alexandros Kosiaris) [06:43:07] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Marostegui) Hey, I thought this was meant to be done in 2-3 weeks :) (T278614#6992922) Also, I thought we wanted to delete the temporary databases before proceeding. Keep in mind... [06:43:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Gonna merge this and see how it affects already running services in staging-codfw. It should not somehow affect them, but we 'll see if th" [puppet] - 10https://gerrit.wikimedia.org/r/681470 (https://phabricator.wikimedia.org/T238909) (owner: 10Alexandros Kosiaris) [06:44:04] (03Merged) 10jenkins-bot: staging-codfw: Advertise service cluster IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/681473 (https://phabricator.wikimedia.org/T238909) (owner: 10Alexandros Kosiaris) [06:49:16] !log akosiaris@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [06:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:28] !log akosiaris@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [06:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:41] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Ladsgroup) >>! In T278614#7022974, @Marostegui wrote: > Hey, > > I thought this was meant to be done in 2-3 weeks :) (T278614#6992922) Well, it was one week ago :D. But it's sti... [06:53:35] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Marostegui) >>! In T278614#7022981, @Ladsgroup wrote: > > hmm, that's a tough one. So we are enabling mailman3 on lists1001.wikimedia.org (lists.wikimedia.org) but we are not upgr... [06:53:43] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Marostegui) 05Stalled→03Open [06:53:45] 10SRE, 10Wikimedia-Mailing-lists: Install mailman3 on lists1001.wikimedia.org - https://phabricator.wikimedia.org/T278610 (10Marostegui) [06:55:24] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29137/console" [puppet] - 10https://gerrit.wikimedia.org/r/681358 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey) [06:55:38] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:55:54] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::analytics_cluster::coordinator: move mysql to /srv/sqldata [puppet] - 10https://gerrit.wikimedia.org/r/681358 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey) [06:58:06] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Ladsgroup) Sure. I will make sure to remind you to delete it. Yes. `mailman3` with user `mailman3` and `mailman3web` with user `mailman3web`. They should bound to a different host... [07:00:12] (03CR) 10Muehlenhoff: [C: 03+2] profile::conftool::client: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668019 (owner: 10Muehlenhoff) [07:00:36] (03PS1) 10Marostegui: mariadb: Productionize db1156 on s2 [puppet] - 10https://gerrit.wikimedia.org/r/681574 (https://phabricator.wikimedia.org/T258361) [07:02:06] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1156 on s2 [puppet] - 10https://gerrit.wikimedia.org/r/681574 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [07:02:54] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Ladsgroup) oh and this one needs backups but that can happen later. [07:03:24] 10SRE, 10Wikimedia-Mailing-lists, 10observability: Implement central logging for mailman3 - https://phabricator.wikimedia.org/T276697 (10Ladsgroup) I remember writing something for ores that sends all log entries to logstash (it was also uwsgi so it should be really similar) if you want to look it up but I'm... [07:04:44] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:06:41] (03CR) 10Ladsgroup: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/681244 (owner: 10Legoktm) [07:08:21] (03PS1) 10Marostegui: production-m5.sql.erb: Add new mailman3 final users [puppet] - 10https://gerrit.wikimedia.org/r/681575 (https://phabricator.wikimedia.org/T278614) [07:11:57] (03PS1) 10Elukey: install_server: fix netboot for an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/681576 [07:12:58] (03CR) 10Elukey: [C: 03+2] install_server: fix netboot for an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/681576 (owner: 10Elukey) [07:14:50] (03PS1) 10Elukey: install_server: fix netboot for an-coord1001 - extra space [puppet] - 10https://gerrit.wikimedia.org/r/681577 [07:15:11] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Marostegui) Databases created on m5 (same IP where the test databases are) Ferm needs updating to be able to reach m5-master: ` root@lists1001:~# telnet db11... [07:15:43] (03CR) 10Elukey: [C: 03+2] install_server: fix netboot for an-coord1001 - extra space [puppet] - 10https://gerrit.wikimedia.org/r/681577 (owner: 10Elukey) [07:16:29] (03CR) 10Marostegui: [C: 03+2] production-m5.sql.erb: Add new mailman3 final users [puppet] - 10https://gerrit.wikimedia.org/r/681575 (https://phabricator.wikimedia.org/T278614) (owner: 10Marostegui) [07:17:26] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Marostegui) >>! In T278614#7023003, @Ladsgroup wrote: > oh and this one needs backups but that can happen later. @jcrespo can you handle this? Thank you! [07:23:25] (03PS1) 10Elukey: install_server: fix partman recipe for an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/681578 [07:24:56] (03CR) 10Elukey: [C: 03+2] install_server: fix partman recipe for an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/681578 (owner: 10Elukey) [07:27:27] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [07:30:25] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [07:33:46] (03PS1) 10Elukey: install_server: fix custom recipe for an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/681580 [07:34:40] (03CR) 10Elukey: [C: 03+2] install_server: fix custom recipe for an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/681580 (owner: 10Elukey) [07:37:49] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10jcrespo) >>! In T278614#7023031, @Marostegui wrote: > @jcrespo can you handle this? Thank you! Which section is this in? I cannot find it on the task. [07:38:24] 10SRE, 10DBA, 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10jcrespo) [07:38:53] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ores2001.codfw.wmnet [07:38:54] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ores2002.codfw.wmnet [07:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:58] 10SRE, 10DBA, 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Marostegui) This lives on m5: T278614#7023029 [07:40:16] 10SRE, 10DBA, 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Marostegui) [07:42:23] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 146, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:43:41] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 145, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:44:14] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2001.codfw.wmnet [07:44:18] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2002.codfw.wmnet [07:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:39] (03CR) 10Elukey: "Razzi: this is a self review after running the debian installer, several things were missing and I had to fix them manually. Should have s" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/681396 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey) [07:46:40] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ores2003.codfw.wmnet [07:46:40] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ores2004.codfw.wmnet [07:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:05] (03CR) 10Jcrespo: "The s3 being lagged and stopped is normal. There was something weird, which is wmf_auto_restart_prometheus-mysqld-exporter@s5.timer trying" [puppet] - 10https://gerrit.wikimedia.org/r/681447 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [07:50:18] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-coord1001.eqiad.wmnet with reason: REIMAGE [07:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:09] (03CR) 10Marostegui: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/681447 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [07:51:21] (03CR) 10Marostegui: [C: 03+1] "+1 cause the patchset is correct itself" [puppet] - 10https://gerrit.wikimedia.org/r/681447 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [07:52:16] 10SRE, 10Icinga, 10observability: implement paging for non-ops teams - https://phabricator.wikimedia.org/T141038 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi We have implemented paging for non-ops teams in VO/splunk oncall, within icinga and alertmanager has that capability as well. I'm boldly resol... [07:52:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2004.codfw.wmnet [07:52:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2003.codfw.wmnet [07:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:42] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-coord1001.eqiad.wmnet with reason: REIMAGE [07:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:54] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ores2005.codfw.wmnet [07:52:54] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ores2006.codfw.wmnet [07:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:01] (03CR) 10Jcrespo: "Trying Moritz for the wmf_auto_restart_prometheus-mysqld-exporter@*.timer services, almost at random, as I remember him working on other a" [puppet] - 10https://gerrit.wikimedia.org/r/681447 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [07:57:19] (03CR) 10Jcrespo: "Blame on modules/base/manifests/service_auto_restart.pp tells me that either him or JBond will know more about it, hopefully." [puppet] - 10https://gerrit.wikimedia.org/r/681447 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [07:58:16] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2005.codfw.wmnet [07:58:17] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2006.codfw.wmnet [07:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:35] 10SRE, 10Wikimedia-Mailing-lists, 10observability: Implement central logging for mailman3 - https://phabricator.wikimedia.org/T276697 (10fgiunchedi) For daemons that are logging to syslog/journald the tl;dr to get the logs in logstash is to add the "program name" to `modules/profile/files/rsyslog/lookup_tabl... [07:58:59] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ores2007.codfw.wmnet [07:59:00] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ores2008.codfw.wmnet [07:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:11] (03CR) 10Joal: [C: 03+1] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/681498 (https://phabricator.wikimedia.org/T280549) (owner: 10Razzi) [08:05:15] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2007.codfw.wmnet [08:05:18] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2008.codfw.wmnet [08:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:46] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ores2009.codfw.wmnet [08:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:54] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2009.codfw.wmnet [08:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:03] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ores1001.eqiad.wmnet [08:16:04] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ores1002.eqiad.wmnet [08:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:09] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1001.eqiad.wmnet [08:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:01] (03PS1) 10Jcrespo: mariadb: Add mailman3 and mailman3web to the list of host to be backed up [puppet] - 10https://gerrit.wikimedia.org/r/681584 (https://phabricator.wikimedia.org/T278614) [08:26:16] (03PS2) 10Jcrespo: mariadb: Add mailman3 and mailman3web to the list of host to be backed up [puppet] - 10https://gerrit.wikimedia.org/r/681584 (https://phabricator.wikimedia.org/T278614) [08:29:23] 10SRE, 10netops: BGP: prioritize directly connected peers - https://phabricator.wikimedia.org/T280054 (10ayounsi) 05Open→03Resolved a:03ayounsi I explored a bit of the data in Turnilo, as we can now filter on community 14907:12. It's not easy to estimate the gain, it's not null, but most likely quite mi... [08:29:34] 10SRE, 10Icinga, 10observability: implement paging for non-ops teams - https://phabricator.wikimedia.org/T141038 (10Peachey88) [08:30:27] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) Replication started on db1156 and checking tables. [08:30:33] (03CR) 10Elukey: alerts: add victorops paging for hadoop master and kafka broker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/681420 (https://phabricator.wikimedia.org/T273064) (owner: 10Razzi) [08:30:46] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1002.eqiad.wmnet [08:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:09] 10SRE, 10DBA, 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10jcrespo) >>! In T278614#7023003, @Ladsgroup wrote: > oh and this one needs backups but that can happen later. @Ladsgroup please... [08:33:26] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ores1003.eqiad.wmnet [08:33:26] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ores1004.eqiad.wmnet [08:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:19] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1003.eqiad.wmnet [08:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:35] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1004.eqiad.wmnet [08:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:05] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ores1005.eqiad.wmnet [08:41:06] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ores1006.eqiad.wmnet [08:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:33] (03CR) 10Ayounsi: [C: 03+1] "LGTM! Those prefixes are marked as reserved in Netbox, don't forget to make them active once they are advertised." [homer/public] - 10https://gerrit.wikimedia.org/r/681472 (https://phabricator.wikimedia.org/T238909) (owner: 10Alexandros Kosiaris) [08:45:28] RECOVERY - MariaDB Replica Lag: s2 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:45:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 25%: Repool db1074', diff saved to https://phabricator.wikimedia.org/P15493 and previous config saved to /var/cache/conftool/dbconfig/20210421-084555-root.json [08:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:33] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1006.eqiad.wmnet [08:46:34] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1005.eqiad.wmnet [08:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:23] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ores1007.eqiad.wmnet [08:47:25] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ores1008.eqiad.wmnet [08:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:31] !log filippo@deploy1002 Started deploy [librenms/librenms@692b5d5]: Upgrade LibreNMS to 21.4.0 - T266987 [08:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:41] !log filippo@deploy1002 Finished deploy [librenms/librenms@692b5d5]: Upgrade LibreNMS to 21.4.0 - T266987 (duration: 00m 10s) [08:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:07] 10SRE, 10vm-requests: Site: 2 VMs for failoid - https://phabricator.wikimedia.org/T280759 (10MoritzMuehlenhoff) [08:52:24] 10SRE, 10vm-requests: Site: 2 VMs for failoid - https://phabricator.wikimedia.org/T280759 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [08:52:49] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1007.eqiad.wmnet [08:52:52] !log filippo@deploy1002 Started deploy [librenms/librenms@692b5d5]: Upgrade LibreNMS to 21.4.0 - T266987 [08:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:57] !log filippo@deploy1002 Finished deploy [librenms/librenms@692b5d5]: Upgrade LibreNMS to 21.4.0 - T266987 (duration: 00m 05s) [08:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:07] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1008.eqiad.wmnet [08:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:31] 10SRE, 10Wikimedia-General-or-Unknown, 10Wikimedia-SVG-rendering, 10Documentation: Document how to request installing additional SVG and PDF fonts on Wikimedia servers - https://phabricator.wikimedia.org/T228591 (10jbond) @Aklapper I think SRE are only tagged on this ticket in case there are any puppet ch... [08:55:15] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ores1009.eqiad.wmnet [08:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:50] !log filippo@deploy1002 Started deploy [librenms/librenms@692b5d5]: Upgrade LibreNMS to 21.4.0 - T266987 [08:55:55] !log filippo@deploy1002 Finished deploy [librenms/librenms@692b5d5]: Upgrade LibreNMS to 21.4.0 - T266987 (duration: 00m 05s) [08:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:03] !log filippo@deploy1002 Started deploy [librenms/librenms@692b5d5]: Upgrade LibreNMS to 21.4.0 - T266987 [08:58:08] !log filippo@deploy1002 Finished deploy [librenms/librenms@692b5d5]: Upgrade LibreNMS to 21.4.0 - T266987 (duration: 00m 05s) [08:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:34] !log filippo@deploy1002 Started deploy [librenms/librenms@692b5d5]: Upgrade LibreNMS to 21.4.0 - T266987 [08:58:39] !log filippo@deploy1002 Finished deploy [librenms/librenms@692b5d5]: Upgrade LibreNMS to 21.4.0 - T266987 (duration: 00m 05s) [08:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:16] sorry for the spam :( [09:00:57] godog: it makes a nice change that it's not elukey for once [09:00:59] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1009.eqiad.wmnet [09:01:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 50%: Repool db1074', diff saved to https://phabricator.wikimedia.org/P15494 and previous config saved to /var/cache/conftool/dbconfig/20210421-090100-root.json [09:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:16] kormat: haha! ikr [09:03:56] !log jiji@cumin1001 conftool action : set/pooled=yes; selector: name=mw2280.codfw.wmnet,service=nginx [09:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:35] kormat: I know that you have a special highlight on IRC for my username, so you can see my technically superb attempt to mess with production [09:04:39] *attempts [09:04:42] :D [09:05:32] !log filippo@deploy1002 Started deploy [librenms/librenms@692b5d5]: Upgrade LibreNMS to 21.4.0 - T266987 [09:05:37] !log filippo@deploy1002 Finished deploy [librenms/librenms@692b5d5]: Upgrade LibreNMS to 21.4.0 - T266987 (duration: 00m 05s) [09:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:52] !log upgrade hue on an-tool1009 to 4.9 [09:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:51] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/681447 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [09:14:02] (03CR) 10Alexandros Kosiaris: [C: 03+2] "> Patch Set 1: Code-Review+1" [homer/public] - 10https://gerrit.wikimedia.org/r/681472 (https://phabricator.wikimedia.org/T238909) (owner: 10Alexandros Kosiaris) [09:14:10] (03CR) 10Muehlenhoff: "profile::mariadb:.dbstore_multiinstance uses profile::prometheus::mysqld_exporter_instance to setup the Promethesus exporter and that one " [puppet] - 10https://gerrit.wikimedia.org/r/681447 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [09:15:55] (03Merged) 10jenkins-bot: Add kubernetes service IP ranges to prefix list [homer/public] - 10https://gerrit.wikimedia.org/r/681472 (https://phabricator.wikimedia.org/T238909) (owner: 10Alexandros Kosiaris) [09:16:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 75%: Repool db1074', diff saved to https://phabricator.wikimedia.org/P15495 and previous config saved to /var/cache/conftool/dbconfig/20210421-091605-root.json [09:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:18] 10SRE, 10Services, 10Wikidata, 10Wikidata-Query-Service, and 3 others: [Draft] New service request: WDQS Flink based Streaming Updater - https://phabricator.wikimedia.org/T280579 (10Gehel) [09:20:44] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:22:53] (03PS1) 10Filippo Giunchedi: scap: stop deploy-local from deleting old revisions [software/librenms] (upstream-21.4.0) - 10https://gerrit.wikimedia.org/r/681597 (https://phabricator.wikimedia.org/T266987) [09:26:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:27:21] checking --^ [09:27:34] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 23642 bytes in 0.328 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:27:56] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/681353 (owner: 10Jbond) [09:31:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 100%: Repool db1074', diff saved to https://phabricator.wikimedia.org/P15496 and previous config saved to /var/cache/conftool/dbconfig/20210421-093109-root.json [09:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:27] 10SRE, 10netops, 10observability, 10User-fgiunchedi: LibreNMS sends its alerts to Alertmanager, resulting in email notifications to network operations - https://phabricator.wikimedia.org/T267018 (10fgiunchedi) [09:33:32] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:33:34] (03CR) 10Jcrespo: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/681447 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [09:35:51] (03CR) 10Jcrespo: "> Funny thing is I think it was the opposite- I just rebooted the host and the issues started after reboot." [puppet] - 10https://gerrit.wikimedia.org/r/681447 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [09:36:45] (03CR) 10Jcrespo: "> I no longer see a service for s5 only s3 and s4:" [puppet] - 10https://gerrit.wikimedia.org/r/681447 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [09:40:53] (03CR) 10Jcrespo: [C: 03+2] mariadb: Reenable notifications for db2139 after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/681447 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [09:41:56] 10SRE, 10Wikimedia-SVG-rendering: Adding new font for CJK media display - https://phabricator.wikimedia.org/T280432 (10MoritzMuehlenhoff) The upstream repo is https://github.com/googlefonts/noto-cjk We're currently running version 20170601, so this might be fixed in the 20201206-cjk release in Debian bullseye... [09:43:42] (03PS1) 10Phuedx: Rename RelatedArticlesFooterWhitelistedSkins to RelatedArticlesFooterAllowedSkins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681598 (https://phabricator.wikimedia.org/T277958) [09:44:58] (03CR) 10Phuedx: [C: 04-2] "Blocked on I468a38df being deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681598 (https://phabricator.wikimedia.org/T277958) (owner: 10Phuedx) [09:45:04] (03PS3) 10Jcrespo: mariadb: Remove s3 from db2098 [puppet] - 10https://gerrit.wikimedia.org/r/681448 (https://phabricator.wikimedia.org/T280492) [09:45:29] (03CR) 10Jcrespo: "To be applied on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/681448 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [09:46:40] (03CR) 10Ladsgroup: [C: 03+1] "you're right. it was permission of the file." [puppet] - 10https://gerrit.wikimedia.org/r/680335 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [09:47:16] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to the extent that this syntax can look good :-)" [puppet] - 10https://gerrit.wikimedia.org/r/681299 (owner: 10David Caro) [09:49:31] 10SRE, 10Prod-Kubernetes, 10Pybal, 10Traffic, and 2 others: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10akosiaris) And with the merge and deploy of the above we got: ` akosiaris@deploy1002:~$ kube_env proton staging-codfw akosiaris@... [09:52:49] 10SRE, 10vm-requests: Site: 2 VMs for failoid - https://phabricator.wikimedia.org/T280759 (10Volans) +1, LGTM [09:55:49] (03CR) 10Jbond: [C: 03+2] debmonitor: make cfssl the default PKI issuer for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/681353 (owner: 10Jbond) [09:56:06] !log switch debmonitor-clients to use cfssl [09:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:41] (03PS1) 10Jbond: pki: enable client [puppet] - 10https://gerrit.wikimedia.org/r/681600 [09:59:58] (03CR) 10Jbond: [V: 03+2 C: 03+2] pki: enable client [puppet] - 10https://gerrit.wikimedia.org/r/681600 (owner: 10Jbond) [10:01:59] (03PS19) 10Jcrespo: mediabackup: Initial setup for the media backup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) [10:02:01] (03PS8) 10Jcrespo: mariadb: Setup 2 new host as temporary metadata database for media backups [puppet] - 10https://gerrit.wikimedia.org/r/681103 (https://phabricator.wikimedia.org/T276442) [10:02:03] (03PS5) 10Jcrespo: mediabackup: Setup the storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/681117 (https://phabricator.wikimedia.org/T276442) [10:03:48] (03PS20) 10Jcrespo: mediabackup: Initial setup for the media backup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) [10:04:26] (03PS21) 10Jcrespo: mediabackup: Initial setup for the media backup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) [10:04:45] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.03416 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:04:50] * jbond42 fixing [10:04:53] ^^ [10:06:56] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm for new host failoid2002.codfw.wmnet [10:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:24] (03PS1) 10Jcrespo: mediabackups: Set a dummy password on cloud repo to check validity [labs/private] - 10https://gerrit.wikimedia.org/r/681602 (https://phabricator.wikimedia.org/T276442) [10:11:33] (03PS1) 10Jbond: Revert "debmonitor: make cfssl the default PKI issuer for debmonitor" [puppet] - 10https://gerrit.wikimedia.org/r/681383 [10:11:56] (03CR) 10Ladsgroup: [C: 03+1] "Fine by me. Letting Kunal take a look too." [puppet] - 10https://gerrit.wikimedia.org/r/681584 (https://phabricator.wikimedia.org/T278614) (owner: 10Jcrespo) [10:12:37] (03PS3) 10Jcrespo: mariadb: Add mailman3 and mailman3web to the list of hosts to be backed up [puppet] - 10https://gerrit.wikimedia.org/r/681584 (https://phabricator.wikimedia.org/T278614) [10:13:08] (03PS2) 10Jbond: Revert "debmonitor: make cfssl the default PKI issuer for debmonitor" [puppet] - 10https://gerrit.wikimedia.org/r/681383 [10:13:25] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "debmonitor: make cfssl the default PKI issuer for debmonitor" [puppet] - 10https://gerrit.wikimedia.org/r/681383 (owner: 10Jbond) [10:14:21] (03PS1) 10Jbond: debmonitor: make cfssl the default PKI issuer for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/681384 [10:15:06] (03PS2) 10Jcrespo: mediabackups: Set a dummy password on cloud repo to check validity [labs/private] - 10https://gerrit.wikimedia.org/r/681602 (https://phabricator.wikimedia.org/T276442) [10:17:34] 10SRE, 10DBA, 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Ladsgroup) >>! In T278614#7023147, @jcrespo wrote: >>>! In T278614#7023003, @Ladsgroup wrote: >> oh and this one needs backups bu... [10:19:44] 10SRE, 10DBA, 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10jcrespo) >>! In T278614#7023524, @Ladsgroup wrote: > `mailman3` should be small but `mailman3web` can get rather big. According t... [10:21:04] !log rebooting eventlog1002 for kernel update [10:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:31] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host failoid2002.codfw.wmnet [10:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:43] lots of puppet failures, maybe some ongoing package upgrade, I will check it goes back to close to 0 [10:22:19] oh, John already was working on it [10:22:21] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host eventlog1002.eqiad.wmnet [10:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:25] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm for new host failoid1002.eqiad.wmnet [10:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:35] (03PS3) 10Jcrespo: mediabackups: Set a dummy password on cloud repo to check validity [labs/private] - 10https://gerrit.wikimedia.org/r/681602 (https://phabricator.wikimedia.org/T276442) [10:24:46] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mediabackups: Set a dummy password on cloud repo to check validity [labs/private] - 10https://gerrit.wikimedia.org/r/681602 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [10:25:43] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005303 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:27:02] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudnet2004-dev - https://phabricator.wikimedia.org/T275676 (10aborrero) I think you can copy the configuration from the server this is replacing (cloudnet2003-dev). The dataplane interface for that server is... [10:29:18] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host eventlog1002.eqiad.wmnet [10:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:26] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/681384 (owner: 10Jbond) [10:33:10] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host failoid1002.eqiad.wmnet [10:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:31] (03CR) 10Reedy: "https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/680814" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681598 (https://phabricator.wikimedia.org/T277958) (owner: 10Phuedx) [10:37:19] !log upload golang-cfssl packages for jessi and stretch [10:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:27] (03CR) 10Jbond: [C: 03+2] debmonitor: make cfssl the default PKI issuer for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/681384 (owner: 10Jbond) [10:39:58] (03PS22) 10Jcrespo: mediabackup: Initial setup for the media backup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) [10:40:22] (03PS1) 10Muehlenhoff: Add failoid1002/2002 to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/681608 [10:40:36] (03PS2) 10Muehlenhoff: Add failoid1002/2002 to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/681608 [10:41:39] !log switch debmonitor-client to cfssl (second try) [10:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:40] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10akosiaris) p:05Triage→03Medium [10:45:01] (03PS3) 10Muehlenhoff: Add failoid1002/2002 to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/681608 [10:45:29] 10SRE, 10Wikimedia-Mailing-lists: Create a mailing list for ptwikinews - https://phabricator.wikimedia.org/T280408 (10akosiaris) 05Open→03Stalled p:05Triage→03Medium Stalling for a couple of weeks per above comments [10:46:20] 10SRE, 10Wikimedia-SVG-rendering: Adding new font for CJK media display - https://phabricator.wikimedia.org/T280432 (10akosiaris) p:05Triage→03Low [10:47:58] (03CR) 10Muehlenhoff: [C: 03+2] Add failoid1002/2002 to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/681608 (owner: 10Muehlenhoff) [10:48:59] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [10:50:43] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.1072 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:50:45] RECOVERY - ensure kvm processes are running on cloudvirt1045 is OK: PROCS OK: 2 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [10:51:12] (03PS1) 10Muehlenhoff: Update DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/681615 [10:53:43] (03CR) 10Muehlenhoff: [C: 03+2] Update DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/681615 (owner: 10Muehlenhoff) [10:54:55] 10SRE, 10Wikimedia-SVG-rendering: Adding new font for CJK media display - https://phabricator.wikimedia.org/T280432 (10NFSL2001) I have changed the SVG to use Noto Sans instead of Source Han Sans first and is waiting for the preview image to load. https://commons.wikimedia.org/wiki/File:Periodic_table_zh-tw.s... [10:55:54] 10SRE, 10MediaWiki-General, 10Browser-Support-Apple-Safari: File:Chessboard480.svg not visible on safari when size is fixed at 208px - https://phabricator.wikimedia.org/T280439 (10akosiaris) Is this just safari on iOS and Mac? This works for me (at least on 1 try) on: * Google Chrome logged in user on Linux... [10:56:19] 10SRE, 10MediaWiki-General, 10Browser-Support-Apple-Safari: File:Chessboard480.svg not visible on safari when size is fixed at 208px - https://phabricator.wikimedia.org/T280439 (10akosiaris) p:05Triage→03Low Triaging as low until we can have an easy reproduction scenario. [10:59:06] (03PS1) 10Jbond: Revert "debmonitor: make cfssl the default PKI issuer for debmonitor" [puppet] - 10https://gerrit.wikimedia.org/r/681385 [10:59:40] 10SRE, 10Wikimedia-General-or-Unknown, 10Wikimedia-SVG-rendering, 10Documentation: Document how to request installing additional SVG and PDF fonts on Wikimedia servers - https://phabricator.wikimedia.org/T228591 (10Aklapper) @jbond: Prioritizing a task as "medium" priority but not being able to give an ans... [10:59:58] 10SRE, 10Wikimedia-General-or-Unknown, 10Wikimedia-SVG-rendering, 10Documentation: Document how to request installing additional SVG and PDF fonts on Wikimedia servers - https://phabricator.wikimedia.org/T228591 (10Aklapper) p:05Medium→03Triage [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210421T1100). [11:00:04] awight: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:31] It's just my own mess, I'll deploy :-) [11:00:46] (03CR) 10Jbond: [C: 03+2] Revert "debmonitor: make cfssl the default PKI issuer for debmonitor" [puppet] - 10https://gerrit.wikimedia.org/r/681385 (owner: 10Jbond) [11:01:03] (03CR) 10Awight: [C: 03+2] "Backport deployment" [extensions/WikimediaEvents] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/681334 (https://phabricator.wikimedia.org/T210106) (owner: 10Phuedx) [11:01:54] (03PS1) 10Jbond: Revert "Revert "debmonitor: make cfssl the default PKI issuer for debmonitor"" [puppet] - 10https://gerrit.wikimedia.org/r/681626 [11:02:02] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "Revert "debmonitor: make cfssl the default PKI issuer for debmonitor"" [puppet] - 10https://gerrit.wikimedia.org/r/681626 (owner: 10Jbond) [11:02:04] (03CR) 10Alexandros Kosiaris: [C: 04-1] exim: Drop support for legacy mailing list domains (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/681242 (https://phabricator.wikimedia.org/T280472) (owner: 10Legoktm) [11:02:17] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Figure out if we can remove legacy domain support for mailing lists - https://phabricator.wikimedia.org/T280472 (10akosiaris) p:05Triage→03Medium [11:02:51] (03PS3) 10Hnowlan: install_server: add entry for eventlog1003 [puppet] - 10https://gerrit.wikimedia.org/r/681425 (https://phabricator.wikimedia.org/T280679) [11:04:23] (03CR) 10Hnowlan: [C: 03+2] install_server: add entry for eventlog1003 [puppet] - 10https://gerrit.wikimedia.org/r/681425 (https://phabricator.wikimedia.org/T280679) (owner: 10Hnowlan) [11:04:29] (03PS1) 10Jbond: P:debmonitor::client: switch back to using puppet as there are ACL blockin [puppet] - 10https://gerrit.wikimedia.org/r/681619 [11:04:33] 10SRE, 10Wikimedia-Mailing-lists: mail.wikimedia.org doesn't redirect to lists.wikimedia.org - https://phabricator.wikimedia.org/T280473 (10akosiaris) 05Open→03Resolved a:03akosiaris From what I gather, we are on board with removing it, so resolving this in favor of tracking the work in T280472. Feel fr... [11:04:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: payments1006.frack.eqiad.wmnet DRAC no console output - https://phabricator.wikimedia.org/T280527 (10akosiaris) p:05Triage→03High [11:04:55] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:debmonitor::client: switch back to using puppet as there are ACL blockin [puppet] - 10https://gerrit.wikimedia.org/r/681619 (owner: 10Jbond) [11:13:45] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005889 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:14:23] (03PS1) 10Jcrespo: dbbackups: Switchover s6 codfw database backups from db2097 to db2141 [puppet] - 10https://gerrit.wikimedia.org/r/681621 (https://phabricator.wikimedia.org/T280751) [11:14:25] (03PS1) 10Jcrespo: dbbackups: Switchover s6 codfw database backups from db2097 to db2141 [puppet] - 10https://gerrit.wikimedia.org/r/681622 (https://phabricator.wikimedia.org/T280751) [11:14:49] (03PS2) 10Jcrespo: dbbackups: Switchover s6 codfw database backups from db2097 to db2141 [puppet] - 10https://gerrit.wikimedia.org/r/681621 (https://phabricator.wikimedia.org/T280751) [11:14:53] (03PS2) 10Jcrespo: dbbackups: Switchover s6 codfw database backups from db2097 to db2141 [puppet] - 10https://gerrit.wikimedia.org/r/681622 (https://phabricator.wikimedia.org/T280751) [11:15:13] 10Puppet, 10SRE: Determine safe concurrent puppet run batches via cumin - https://phabricator.wikimedia.org/T280622 (10akosiaris) p:05Triage→03Low >>! In T280622#7019996, @jbond wrote: > FYI i ran puppet fleet wide today using a batch size of 40 and there was no issue. puppet master load rose from ~1.5 ->... [11:15:19] (03PS3) 10Jcrespo: dbbackups: Switchover s6 eqiad database backups from db2097 to db2141 [puppet] - 10https://gerrit.wikimedia.org/r/681622 (https://phabricator.wikimedia.org/T280751) [11:16:35] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [11:16:39] (03PS4) 10Jcrespo: dbbackups: Switchover s6 eqiad database backups from db1139 to db1140 [puppet] - 10https://gerrit.wikimedia.org/r/681622 (https://phabricator.wikimedia.org/T280751) [11:16:48] 10SRE, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Shrink redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10akosiaris) p:05Triage→03Medium [11:17:09] 10SRE, 10ops-eqiad: Can't access thanos-fe1001.mgmt - https://phabricator.wikimedia.org/T280623 (10akosiaris) p:05Triage→03Medium [11:17:51] (03CR) 10Ladsgroup: [C: 03+1] exim: Drop support for legacy mailing list domains (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/681242 (https://phabricator.wikimedia.org/T280472) (owner: 10Legoktm) [11:17:56] (cross-post): can anyone here restart the CI infra? It's gummed up... [11:18:24] (03CR) 10Kosta Harlan: [C: 03+1] Update $wgGEHomepageNewAccountVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681309 (https://phabricator.wikimedia.org/T278123) (owner: 10Gergő Tisza) [11:19:31] 10SRE, 10Machine-Learning-Team, 10serviceops: Kubernetes packages in Debian Bullseye - https://phabricator.wikimedia.org/T280625 (10akosiaris) 05Open→03Declined I am gonna tentatively for now decline this. While #serviceops wouldn't block #machine-learning-team if they wanted to utilize bullseye debian p... [11:20:00] (03Merged) 10jenkins-bot: Send "0 edits" userEditCountBucket for anons [extensions/WikimediaEvents] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/681334 (https://phabricator.wikimedia.org/T210106) (owner: 10Phuedx) [11:20:14] It's running again. [11:21:18] 10SRE, 10Services, 10Wikidata, 10Wikidata-Query-Service, and 3 others: [Draft] New service request: WDQS Flink based Streaming Updater - https://phabricator.wikimedia.org/T280579 (10akosiaris) p:05Triage→03Medium [11:29:09] !log awight@deploy1002 Synchronized php-1.37.0-wmf.1/extensions/WikimediaEvents: Backport: [[gerrit:681334|Send 0 edits userEditCountBucket for anons (T210106)]] (duration: 00m 59s) [11:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:18] T210106: Provide a reusable getEditCountBucket function for analytics purposes - https://phabricator.wikimedia.org/T210106 [11:30:22] 10SRE, 10Patch-For-Review: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 (10MoritzMuehlenhoff) [11:30:35] (03PS1) 10Ladsgroup: rsyslog: Add mailman3 to list of accepted daemons [puppet] - 10https://gerrit.wikimedia.org/r/681648 (https://phabricator.wikimedia.org/T276697) [11:31:17] 10SRE, 10Wikimedia-Mailing-lists, 10observability, 10Patch-For-Review: Implement central logging for mailman3 - https://phabricator.wikimedia.org/T276697 (10Ladsgroup) This is for mailman3 process. I haven't added mailman3web yet. [11:31:30] !log installing failoid1002 [11:31:35] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/681648 (https://phabricator.wikimedia.org/T276697) (owner: 10Ladsgroup) [11:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:15] !log EU backport window complete [11:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:30] 10SRE, 10Wikimedia-Mailing-lists, 10observability, 10Patch-For-Review: Implement central logging for mailman3 - https://phabricator.wikimedia.org/T276697 (10Ladsgroup) PCC is noop: https://puppet-compiler.wmflabs.org/compiler1001/727/lists1002.wikimedia.org/index.html 🤔 [11:39:11] 10SRE, 10MediaWiki-General, 10Browser-Support-Apple-Safari: File:Chessboard480.svg not visible on safari when size is fixed at 208px - https://phabricator.wikimedia.org/T280439 (10Daimona) >>! In T280439#7023677, @akosiaris wrote: > Is this just safari on iOS and Mac? This works for me (at least on 1 try) on... [11:39:18] (03PS1) 10Jbond: PKI access: open access to the pki service for analytics and cloud [homer/public] - 10https://gerrit.wikimedia.org/r/681650 [11:44:49] (03PS1) 10Hnowlan: site: set role for eventlog1003 to eventlog [puppet] - 10https://gerrit.wikimedia.org/r/681652 (https://phabricator.wikimedia.org/T280679) [11:46:08] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [11:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:37] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:58] 10SRE, 10Services, 10Wikidata, 10Wikidata-Query-Service, and 3 others: [Draft] New service request: WDQS Flink based Streaming Updater - https://phabricator.wikimedia.org/T280579 (10Gehel) [11:56:12] 10SRE, 10Services, 10Wikidata, 10Wikidata-Query-Service, and 3 others: New service request: WDQS Flink based Streaming Updater - https://phabricator.wikimedia.org/T280579 (10Gehel) [11:57:58] 10SRE, 10vm-requests, 10Patch-For-Review: eqiad: 1 VM %request for eventlog - https://phabricator.wikimedia.org/T280679 (10hnowlan) 05Open→03Resolved [11:58:22] (03PS1) 10Marostegui: install_server: Reimage db1158 to buster [puppet] - 10https://gerrit.wikimedia.org/r/681653 (https://phabricator.wikimedia.org/T258361) [11:58:27] PROBLEM - Host thanos-fe2001 is DOWN: PING CRITICAL - Packet loss = 100% [11:58:38] 10SRE, 10Wikimedia-General-or-Unknown, 10Wikimedia-SVG-rendering, 10Documentation: Document how to request installing additional SVG and PDF fonts on Wikimedia servers - https://phabricator.wikimedia.org/T228591 (10jbond) >>! In T228591#7023681, @Aklapper wrote: > @jbond: Prioritizing a task as "medium" pr... [11:59:19] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1158 to buster [puppet] - 10https://gerrit.wikimedia.org/r/681653 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [12:01:32] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1158.eqiad.wmnet'] ` The log ca... [12:05:18] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] scap: stop deploy-local from deleting old revisions [software/librenms] (upstream-21.4.0) - 10https://gerrit.wikimedia.org/r/681597 (https://phabricator.wikimedia.org/T266987) (owner: 10Filippo Giunchedi) [12:05:49] (03PS2) 10Jbond: PKI access: open access to the pki service for analytics and cloud [homer/public] - 10https://gerrit.wikimedia.org/r/681650 [12:07:51] (03PS3) 10Jbond: PKI access: open access to the pki service for analytics and cloud [homer/public] - 10https://gerrit.wikimedia.org/r/681650 [12:08:50] (03PS4) 10Jbond: PKI access: open access to the pki service for analytics and cloud [homer/public] - 10https://gerrit.wikimedia.org/r/681650 [12:10:07] (03PS5) 10Jbond: PKI access: open access to the pki service for analytics and cloud [homer/public] - 10https://gerrit.wikimedia.org/r/681650 [12:10:55] PROBLEM - Thanos compact has disappeared from Prometheus discovery on alert1001 is CRITICAL: 1 ge 1 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview [12:11:16] (03CR) 10Ayounsi: [C: 03+1] PKI access: open access to the pki service for analytics and cloud [homer/public] - 10https://gerrit.wikimedia.org/r/681650 (owner: 10Jbond) [12:12:56] I'm looking into thanos-fe2001 [12:14:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1158.eqiad.wmnet with reason: REIMAGE [12:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:30] 10SRE, 10Services, 10Wikidata, 10Wikidata-Query-Service, and 3 others: New service request: WDQS Flink based Streaming Updater - https://phabricator.wikimedia.org/T280579 (10Gehel) [12:16:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1158.eqiad.wmnet with reason: REIMAGE [12:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:24] RECOVERY - Host thanos-fe2001 is UP: PING OK - Packet loss = 0%, RTA = 31.70 ms [12:20:45] 10ops-codfw: thanos-fe2001 machine check exception and crash/stall - https://phabricator.wikimedia.org/T280782 (10fgiunchedi) [12:21:27] !log installing failoid2002 [12:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:28] RECOVERY - Thanos compact has disappeared from Prometheus discovery on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview [12:25:09] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1158.eqiad.wmnet'] ` and were **ALL** successful. [12:30:21] (03CR) 10Jbond: [C: 03+2] PKI access: open access to the pki service for analytics and cloud [homer/public] - 10https://gerrit.wikimedia.org/r/681650 (owner: 10Jbond) [12:33:23] 10SRE, 10MediaWiki-General, 10Traffic, 10Browser-Support-Apple-Safari: File:Chessboard480.svg not visible on safari when size is fixed at 208px - https://phabricator.wikimedia.org/T280439 (10akosiaris) So a successful fetch per safari, of 100 bytes per `Content-Length`. Interestingly, my tests are almost i... [12:34:18] 10SRE, 10Machine-Learning-Team, 10serviceops: Kubernetes packages in Debian Bullseye - https://phabricator.wikimedia.org/T280625 (10elukey) @akosiaris to be clear, I didn't mean that ML would have used it to bypass Service ops, it was only to bring up the subject to get opinions, +1 to decline it after what... [12:34:28] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.497e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [12:34:57] 10SRE, 10Wikimedia-Mailing-lists: Implement static redirects from pipermail archives to hyperkitty archives - https://phabricator.wikimedia.org/T280731 (10akosiaris) p:05Triage→03Medium [12:38:06] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.01346 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [12:39:48] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10akosiaris) p:05Triage→03Low [12:41:46] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10akosiaris) 3 commits * 566424388816a0094cf1f2720f5855ca38230562 * 1616462ae03ad28fff9fdfc85203820c0bc25c1e * b52d6... [12:43:11] (03PS1) 10Jbond: CR:firewall: remove tangeling term [homer/public] - 10https://gerrit.wikimedia.org/r/681664 [12:44:40] (03CR) 10Ayounsi: [C: 03+1] CR:firewall: remove tangeling term [homer/public] - 10https://gerrit.wikimedia.org/r/681664 (owner: 10Jbond) [12:44:48] (03CR) 10Jbond: [C: 03+2] CR:firewall: remove tangeling term [homer/public] - 10https://gerrit.wikimedia.org/r/681664 (owner: 10Jbond) [12:45:10] (03PS1) 10Alexandros Kosiaris: Remove fc-list file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681665 [12:45:59] 10SRE, 10SRE-Access-Requests: Requesting access to Wikimedia Analytics Data for Silvan Heintze - https://phabricator.wikimedia.org/T280541 (10akosiaris) p:05Triage→03Medium [12:48:54] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:50:32] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:51:57] (03PS1) 10Jbond: debmonitor:client: switch ssl to pki [puppet] - 10https://gerrit.wikimedia.org/r/681667 [12:54:36] (03CR) 10Jbond: [C: 03+2] debmonitor:client: switch ssl to pki [puppet] - 10https://gerrit.wikimedia.org/r/681667 (owner: 10Jbond) [12:55:15] (03PS1) 10Jbond: Revert "debmonitor:client: switch ssl to pki" [puppet] - 10https://gerrit.wikimedia.org/r/681630 [12:55:59] (03PS1) 10Gergő Tisza: Job queue configuration for DeleteLinkRecommendationJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/681669 [13:01:57] !log upgrading mw1262-1265,mw1277-1279 to PHP 7.2.34 [13:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:22] PROBLEM - puppetmaster backend https on puppetmaster1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [13:07:45] (03CR) 10Elukey: "Hugh: can we update the pcc facts and run PCC for eventlog100[2,3] to see what changes?" [puppet] - 10https://gerrit.wikimedia.org/r/681652 (https://phabricator.wikimedia.org/T280679) (owner: 10Hnowlan) [13:09:18] PROBLEM - puppetmaster https on puppetmaster1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [13:09:37] (03PS1) 10Urbanecm: eswiki: Push Growth features out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681671 (https://phabricator.wikimedia.org/T278235) [13:16:10] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:16:58] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01176 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:18:05] (03PS5) 10Majavah: etcd: Use cfssl for peer-to-peer communication [puppet] - 10https://gerrit.wikimedia.org/r/674077 [13:18:53] !log [urbanecm@mwmaint1002 ~]$ time mwscript extensions/GrowthExperiments/maintenance/migrateMentorMenteeRelationship.php --wiki=frwiki # T279853 [13:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:02] T279853: Migrate mentor/mentee relationship to a separate database table on Wikimedia wikis - https://phabricator.wikimedia.org/T279853 [13:19:43] (03CR) 10Ottomata: alerts: add victorops paging for hadoop master and kafka broker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/681420 (https://phabricator.wikimedia.org/T273064) (owner: 10Razzi) [13:22:20] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 195322016 and 17 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:24:21] (03PS1) 10Effie Mouzeli: (WIP) conftool: improve safe-service-restart multiple cluster support [puppet] - 10https://gerrit.wikimedia.org/r/681676 (https://phabricator.wikimedia.org/T279100) [13:25:35] (03CR) 10jerkins-bot: [V: 04-1] (WIP) conftool: improve safe-service-restart multiple cluster support [puppet] - 10https://gerrit.wikimedia.org/r/681676 (https://phabricator.wikimedia.org/T279100) (owner: 10Effie Mouzeli) [13:26:50] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 1286008 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:33:08] 10SRE, 10MediaWiki-General, 10Traffic, 10Browser-Support-Apple-Safari: File:Chessboard480.svg not visible on safari when size is fixed at 208px - https://phabricator.wikimedia.org/T280439 (10ema) Thanks for pointing me to this task @akosiaris. The issue here is webp, this is broken on Safari: https://uploa... [13:36:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10Andrew) The secondary IPs are set by puppet and are on a subnet that's separate from anything managed by n... [13:39:45] !log upgrading mw1262-1265,mw1277-1279 to PHP 7.2.34 [13:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:58] (03CR) 10Jbond: "Glad you found `profile::pki::get_cert` :) have left some comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674077 (owner: 10Majavah) [13:41:42] (03PS6) 10Majavah: etcd: Use cfssl for peer-to-peer communication [puppet] - 10https://gerrit.wikimedia.org/r/674077 [13:42:07] (03CR) 10Majavah: "Thanks!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674077 (owner: 10Majavah) [13:44:36] RECOVERY - puppetmaster https on puppetmaster1001 is OK: HTTP OK: Status line output matched 400 - 415 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [13:44:38] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:45:06] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:45:11] (03PS1) 10Andrew Bogott: Make cloudcephosd1016 an osd node [puppet] - 10https://gerrit.wikimedia.org/r/681681 (https://phabricator.wikimedia.org/T274945) [13:45:32] RECOVERY - puppetmaster backend https on puppetmaster1001 is OK: HTTP OK: Status line output matched 400 - 414 bytes in 0.019 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [13:47:15] (03CR) 10Andrew Bogott: [C: 03+2] Make cloudcephosd1016 an osd node [puppet] - 10https://gerrit.wikimedia.org/r/681681 (https://phabricator.wikimedia.org/T274945) (owner: 10Andrew Bogott) [13:54:51] !log [urbanecm@mwmaint1002 ~]$ time mwscript extensions/GrowthExperiments/maintenance/migrateMentorMenteeRelationship.php --wiki=fawiki # T279853 [13:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:02] T279853: Migrate mentor/mentee relationship to a separate database table on Wikimedia wikis - https://phabricator.wikimedia.org/T279853 [13:59:54] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on an-coord1001 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear, ssbd, flush_l1d} https://wikitech.wikimedia.org/wiki/Microcode [14:00:11] (03PS1) 10Filippo Giunchedi: hieradata: trim prometheus retention on PoPs [puppet] - 10https://gerrit.wikimedia.org/r/681684 (https://phabricator.wikimedia.org/T277163) [14:00:57] (03PS1) 10Ema: cache: do not serve webp files to Safari [puppet] - 10https://gerrit.wikimedia.org/r/681685 (https://phabricator.wikimedia.org/T280439) [14:01:36] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005882 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:10:25] (03CR) 10Jbond: "see inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674077 (owner: 10Majavah) [14:11:26] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005294 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:11:41] 10SRE, 10Wikimedia-Mailing-lists, 10observability, 10Patch-For-Review: Implement central logging for mailman3 - https://phabricator.wikimedia.org/T276697 (10fgiunchedi) Bizarre PCC is a NOOP indeed. The patch LGTM, but I see `mailman3` didn't log anything to journald on lists1002 since this morning? [14:13:12] 10SRE, 10Wikimedia-Mailing-lists, 10observability, 10Patch-For-Review: Implement central logging for mailman3 - https://phabricator.wikimedia.org/T276697 (10Ladsgroup) it's a test system atm so not super active right now. I can send some mails to trigger some logs if you want to. [14:13:59] Amir1: sure, a test email would work [14:14:23] afaics logs are (also?) in /var/log/mailman3 [14:14:36] godog: done [14:15:07] yeah, I think that's the canonical place [14:15:13] yeah I don't think mailman3 logs to journald indeed [14:15:17] exim4 is similar [14:15:27] :((((((( [14:15:42] okay. I'll use the other point you said [14:15:44] (03PS1) 10Jbond: P:pki::get_cert: escape unsafe labels [puppet] - 10https://gerrit.wikimedia.org/r/681688 [14:16:58] (03PS1) 10MSantos: no-op build: ec1adafabb5c9fe9e5614ab5b5e1ac47c28b47aa [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/681689 [14:17:01] (03CR) 10Jbond: [C: 03+2] P:pki::get_cert: escape unsafe labels [puppet] - 10https://gerrit.wikimedia.org/r/681688 (owner: 10Jbond) [14:17:52] (03CR) 10Jbond: etcd: Use cfssl for peer-to-peer communication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674077 (owner: 10Majavah) [14:18:01] (03CR) 10jerkins-bot: [V: 04-1] no-op build: ec1adafabb5c9fe9e5614ab5b5e1ac47c28b47aa [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/681689 (owner: 10MSantos) [14:18:15] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudnet2004-dev - https://phabricator.wikimedia.org/T275676 (10Papaul) 05Open→03Resolved @aborrero this complete ` [edit interfaces] + ge-1/0/27 { + description cloudnet2004-dev; + mtu 919... [14:22:33] (03PS9) 10Volans: clustershell: allow to choose different reporters [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [14:22:35] (03PS1) 10Volans: tests: fix integration tests [software/cumin] - 10https://gerrit.wikimedia.org/r/681690 [14:22:37] (03PS1) 10Volans: tests: fix minimum dependency and pytest warning [software/cumin] - 10https://gerrit.wikimedia.org/r/681691 [14:22:40] (03PS1) 10Volans: CLI/clustershell: allow to disable progress bars [software/cumin] - 10https://gerrit.wikimedia.org/r/681692 (https://phabricator.wikimedia.org/T212783) [14:23:49] (03CR) 10Volans: [C: 03+2] setup.py: revert tqdm upper limit constraint [software/cumin] - 10https://gerrit.wikimedia.org/r/681427 (owner: 10Volans) [14:25:20] (03PS2) 10MSantos: build: ec1adafabb5c9fe9e5614ab5b5e1ac47c28b47aa [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/681689 [14:25:28] !log upload new version of debmonitor-client to apt [14:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:02] (03CR) 10jerkins-bot: [V: 04-1] build: ec1adafabb5c9fe9e5614ab5b5e1ac47c28b47aa [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/681689 (owner: 10MSantos) [14:28:09] (03PS2) 10Volans: CLI/clustershell: allow to disable progress bars [software/cumin] - 10https://gerrit.wikimedia.org/r/681692 (https://phabricator.wikimedia.org/T212783) [14:29:49] (03PS1) 10David Caro: icinga: use a bash command wrapper to allow sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/681694 (https://phabricator.wikimedia.org/T280641) [14:30:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 25%: Repool db1165', diff saved to https://phabricator.wikimedia.org/P15500 and previous config saved to /var/cache/conftool/dbconfig/20210421-143015-root.json [14:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:06] (03PS1) 10Muehlenhoff: Enable failoid role for failoid1002/2002 [puppet] - 10https://gerrit.wikimedia.org/r/681696 [14:32:11] (03PS1) 10Elukey: Add user 'yarn' among the admins in Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/681698 (https://phabricator.wikimedia.org/T277062) [14:32:23] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/681694 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [14:34:42] 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10RKemper) [14:35:40] (03CR) 10Elukey: [C: 03+2] Add user 'yarn' among the admins in Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/681698 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [14:35:42] (03Merged) 10jenkins-bot: setup.py: revert tqdm upper limit constraint [software/cumin] - 10https://gerrit.wikimedia.org/r/681427 (owner: 10Volans) [14:36:13] 10SRE, 10Wikidata, 10wdwb-tech, 10wikiba.se website, 10HTTPS: Set HSTS on wikiba.se (force HTTPS) - https://phabricator.wikimedia.org/T232246 (10Ladsgroup) [14:36:16] (03PS7) 10Majavah: etcd: Use cfssl for peer-to-peer communication [puppet] - 10https://gerrit.wikimedia.org/r/674077 [14:36:30] (03CR) 10Volans: "Replies to my last comments" (035 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [14:36:56] (03CR) 10Majavah: etcd: Use cfssl for peer-to-peer communication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674077 (owner: 10Majavah) [14:37:22] (03CR) 10jerkins-bot: [V: 04-1] icinga: use a bash command wrapper to allow sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/681694 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [14:40:00] (03PS1) 10Elukey: Enable the Yarn Capacity scheduler for Hadoop Analytics [puppet] - 10https://gerrit.wikimedia.org/r/681700 (https://phabricator.wikimedia.org/T277062) [14:40:17] 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10RKemper) [14:41:24] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29140/console" [puppet] - 10https://gerrit.wikimedia.org/r/681700 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [14:41:48] 10SRE, 10DBA, 10Privacy Engineering, 10WMF-Legal, and 3 others: dbtree loads third party resources (from google.com/jsapi) - https://phabricator.wikimedia.org/T96499 (10Ladsgroup) [14:43:11] (03PS3) 10MSantos: build: ec1adafabb5c9fe9e5614ab5b5e1ac47c28b47aa [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/681689 [14:43:52] (03CR) 10jerkins-bot: [V: 04-1] build: ec1adafabb5c9fe9e5614ab5b5e1ac47c28b47aa [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/681689 (owner: 10MSantos) [14:44:20] (03PS2) 10Elukey: Enable the Yarn Capacity scheduler for Hadoop Analytics [puppet] - 10https://gerrit.wikimedia.org/r/681700 (https://phabricator.wikimedia.org/T277062) [14:45:18] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29141/console" [puppet] - 10https://gerrit.wikimedia.org/r/681700 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [14:45:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 50%: Repool db1165', diff saved to https://phabricator.wikimedia.org/P15501 and previous config saved to /var/cache/conftool/dbconfig/20210421-144519-root.json [14:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:56] (03CR) 10Elukey: [V: 03+1] "Ready to deploy folks :)" [puppet] - 10https://gerrit.wikimedia.org/r/681700 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [14:53:14] (03PS2) 10Hnowlan: site: set role for eventlog1003 to eventlog [puppet] - 10https://gerrit.wikimedia.org/r/681652 (https://phabricator.wikimedia.org/T280679) [14:54:22] (03CR) 10Jforrester: "Dupe of I34519c75ba?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681598 (https://phabricator.wikimedia.org/T277958) (owner: 10Phuedx) [14:56:10] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:57:52] (03PS2) 10David Caro: icinga: use a bash command wrapper to allow sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/681694 (https://phabricator.wikimedia.org/T280641) [14:58:00] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 145, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:00:02] 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10RKemper) [15:00:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 75%: Repool db1165', diff saved to https://phabricator.wikimedia.org/P15502 and previous config saved to /var/cache/conftool/dbconfig/20210421-150023-root.json [15:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:08] !log installing jquery security updates on buster [15:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:55] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/681684 (https://phabricator.wikimedia.org/T277163) (owner: 10Filippo Giunchedi) [15:04:04] (03CR) 10jerkins-bot: [V: 04-1] icinga: use a bash command wrapper to allow sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/681694 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [15:06:59] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29145/console" [puppet] - 10https://gerrit.wikimedia.org/r/681684 (https://phabricator.wikimedia.org/T277163) (owner: 10Filippo Giunchedi) [15:07:10] (03PS4) 10MSantos: build: ec1adafabb5c9fe9e5614ab5b5e1ac47c28b47aa [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/681689 [15:07:52] (03CR) 10jerkins-bot: [V: 04-1] build: ec1adafabb5c9fe9e5614ab5b5e1ac47c28b47aa [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/681689 (owner: 10MSantos) [15:09:40] (03PS2) 10Filippo Giunchedi: hieradata: trim prometheus retention on PoPs [puppet] - 10https://gerrit.wikimedia.org/r/681684 (https://phabricator.wikimedia.org/T277163) [15:09:52] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] hieradata: trim prometheus retention on PoPs [puppet] - 10https://gerrit.wikimedia.org/r/681684 (https://phabricator.wikimedia.org/T277163) (owner: 10Filippo Giunchedi) [15:15:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 100%: Repool db1165', diff saved to https://phabricator.wikimedia.org/P15503 and previous config saved to /var/cache/conftool/dbconfig/20210421-151526-root.json [15:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:38] !log urbanecm@mwmaint1002:~$ foreachwikiindblist growthexperiments extensions/GrowthExperiments/maintenance/migrateMentorMenteeRelationship.php # T279853 [15:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:46] T279853: Migrate mentor/mentee relationship to a separate database table on Wikimedia wikis - https://phabricator.wikimedia.org/T279853 [15:16:51] (03PS1) 10Hnowlan: site: set eventlog1003 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/681704 (https://phabricator.wikimedia.org/T280679) [15:18:17] 10SRE: Integrate Buster 10.9 point update - https://phabricator.wikimedia.org/T279054 (10MoritzMuehlenhoff) [15:26:42] PROBLEM - Check for expired certificates debmonitor_discovery_wmnet on pki2001 is CRITICAL: CRITICAL - 1 certs expiry in 1 days https://wikitech.wikimedia.org/wiki/PKI/Debugging [15:26:52] PROBLEM - Check for expired certificates debmonitor_discovery_wmnet on pki1001 is CRITICAL: CRITICAL - 1 certs expiry in 1 days https://wikitech.wikimedia.org/wiki/PKI/Debugging [15:31:16] ^ what's up with that? I checked the "live" cert I get from debmonitor.discovery.wmnet from codfw+eqiad and they're both 2019-2024 [15:31:28] RECOVERY - Check for expired certificates debmonitor_discovery_wmnet on pki2001 is OK: OK - No certificates due to expire https://wikitech.wikimedia.org/wiki/PKI/Debugging [15:31:42] RECOVERY - Check for expired certificates debmonitor_discovery_wmnet on pki1001 is OK: OK - No certificates due to expire https://wikitech.wikimedia.org/wiki/PKI/Debugging [15:33:22] cc jbond42 as he was working on debmonitor certs today [15:35:04] volans: bblack: thanks those can safley be ignored the timing of when the cert is renewed and the icinga checks have a small amount of corss over which needs fixing [15:35:12] have it for dscussion in a meeting later [15:36:52] (03PS1) 10Jbond: cfssl: updated cfssl-certs script to add a list function [puppet] - 10https://gerrit.wikimedia.org/r/681706 [15:37:32] (03CR) 10jerkins-bot: [V: 04-1] Move sanitize_eventlogging_analytics jobs from data_purge to refine_santizie [puppet] - 10https://gerrit.wikimedia.org/r/678941 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [15:39:14] (03PS5) 10MSantos: build: ec1adafabb5c9fe9e5614ab5b5e1ac47c28b47aa [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/681689 [15:39:27] (03PS11) 10Ottomata: Move sanitize_eventlogging_analytics jobs from data_purge to refine_santizie [puppet] - 10https://gerrit.wikimedia.org/r/678941 (https://phabricator.wikimedia.org/T273789) [15:39:32] (03PS10) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [15:42:05] (03CR) 10Jbond: [C: 03+2] cfssl: updated cfssl-certs script to add a list function [puppet] - 10https://gerrit.wikimedia.org/r/681706 (owner: 10Jbond) [15:44:21] (03PS12) 10Ottomata: Move sanitize_eventlogging_analytics jobs from data_purge to refine_santizie [puppet] - 10https://gerrit.wikimedia.org/r/678941 (https://phabricator.wikimedia.org/T273789) [15:45:19] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [15:46:16] (03CR) 10MSantos: [C: 03+2] build: ec1adafabb5c9fe9e5614ab5b5e1ac47c28b47aa [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/681689 (owner: 10MSantos) [15:47:58] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/681696 (owner: 10Muehlenhoff) [15:48:05] (03Merged) 10jenkins-bot: build: ec1adafabb5c9fe9e5614ab5b5e1ac47c28b47aa [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/681689 (owner: 10MSantos) [15:52:42] (03PS3) 10David Caro: icinga: use a bash command wrapper to allow sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/681694 (https://phabricator.wikimedia.org/T280641) [15:53:29] (03PS13) 10Ottomata: Move sanitize_eventlogging_analytics jobs from data_purge to refine_santizie [puppet] - 10https://gerrit.wikimedia.org/r/678941 (https://phabricator.wikimedia.org/T273789) [15:53:52] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1002/29150/an-launcher1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/678941 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [15:54:21] !log T280744: legoktm@lists1001:~$ sudo chmod 644 /etc/aliases [15:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:31] T280744: Mail to root@lists1001.wikimedia.org doesn't work because of /etc/aliases file permissions - https://phabricator.wikimedia.org/T280744 [15:55:06] 10SRE, 10MediaWiki-General, 10Traffic, 10Browser-Support-Apple-Safari, 10Patch-For-Review: File:Chessboard480.svg not visible on safari when size is fixed at 208px - https://phabricator.wikimedia.org/T280439 (10akosiaris) wow, TIL. Thanks for that hint @ema. [15:55:09] (03CR) 10Alexandros Kosiaris: [C: 03+1] cache: do not serve webp files to Safari [puppet] - 10https://gerrit.wikimedia.org/r/681685 (https://phabricator.wikimedia.org/T280439) (owner: 10Ema) [15:57:20] 10SRE, 10SRE-Access-Requests: Need access to noc@wikimedia.org (associated with Analytics' MaxMind account) - https://phabricator.wikimedia.org/T279310 (10akosiaris) 05Open→03Resolved @JLaytonWMF I am gonna tentatively resolve this task, it looks like the matter is out of SRE hands and maxmind should be co... [15:57:59] (03CR) 10MSantos: "recheck publish" [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/681689 (owner: 10MSantos) [15:58:20] 10SRE, 10Dumps-Generation, 10SRE-Access-Requests: Create new group for root access to snapshot*, dumpsdata* and labstore1006,7 with holger in it - https://phabricator.wikimedia.org/T277629 (10akosiaris) Any news on this? [15:59:10] (03CR) 10jerkins-bot: [V: 04-1] icinga: use a bash command wrapper to allow sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/681694 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [16:01:42] 10SRE, 10Mail: Mail to root@lists1001.wikimedia.org doesn't work because of /etc/aliases file permissions - https://phabricator.wikimedia.org/T280744 (10Legoktm) Now we're at: ` 2021-04-21 15:56:55 H=localhost (lists1001.wikimedia.org) [::1]:57120 I=[::1]:25 sender verify fail for 10SRE, 10Machine-Learning-Team, 10serviceops: Kubernetes packages in Debian Bullseye - https://phabricator.wikimedia.org/T280625 (10akosiaris) >>! In T280625#7023947, @elukey wrote: > @akosiaris to be clear, I didn't mean that ML would have used it to bypass Service ops, it was only to bring up the subject t... [16:05:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10Andrew) >>! In T274945#7024178, @Andrew wrote: > The secondary IPs are set by puppet and are on a subnet t... [16:05:44] (03PS1) 10Ayounsi: Use Capirca to generate mgmt SRX security policies [homer/public] - 10https://gerrit.wikimedia.org/r/681708 [16:06:45] 10SRE, 10LDAP-Access-Requests, 10CAS-SSO: CAS SSO for reedy - https://phabricator.wikimedia.org/T279244 (10akosiaris) Hi @Reedy, given the discussion in the task, do you reckon you still need racktables access? Or should be close this as instead? [16:06:49] (03PS1) 10Cparle: Make the logistic regression image search default [extensions/WikibaseMediaInfo] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/681709 (https://phabricator.wikimedia.org/T271799) [16:08:30] (03CR) 10jerkins-bot: [V: 04-1] Use Capirca to generate mgmt SRX security policies [homer/public] - 10https://gerrit.wikimedia.org/r/681708 (owner: 10Ayounsi) [16:09:24] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Adding the 3 people that have a commit for this file in the last 8 years." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681665 (owner: 10Alexandros Kosiaris) [16:09:39] (03PS2) 10Alexandros Kosiaris: Remove fc-list file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681665 [16:10:37] (03CR) 10Reedy: [C: 04-1] "Would want removing from docroot/noc/createTxtFileSymlinks.sh too" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681665 (owner: 10Alexandros Kosiaris) [16:10:47] (03CR) 10Reedy: [C: 04-1] "And docroot/noc/conf/index.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681665 (owner: 10Alexandros Kosiaris) [16:13:47] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: add Manuel Merz to ldap_only admins (wmde, nda) [puppet] - 10https://gerrit.wikimedia.org/r/681202 (https://phabricator.wikimedia.org/T280162) (owner: 10Dzahn) [16:14:08] (03CR) 10Alexandros Kosiaris: "Thanks Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/681202 (https://phabricator.wikimedia.org/T280162) (owner: 10Dzahn) [16:15:27] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: NDA for Superset Request from WMDE Employee Manuel - https://phabricator.wikimedia.org/T280162 (10akosiaris) 05Open→03Resolved a:03akosiaris User added to wmde and nda ldap groups. @Manuel, I am resolving this task, feel free to reopen if any issues wi... [16:16:26] (03PS2) 10Alexandros Kosiaris: admin: add mlitn to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/681167 (https://phabricator.wikimedia.org/T274749) (owner: 10Dzahn) [16:18:34] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: add mlitn to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/681167 (https://phabricator.wikimedia.org/T274749) (owner: 10Dzahn) [16:21:08] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stat boxes for mlitn - https://phabricator.wikimedia.org/T274749 (10akosiaris) 05Open→03Resolved Thanks @Dzahn! @matthiasmullie access has been granted. It will take ~30 minutes to fully propagate but otherwise, on our end you are go... [16:26:48] 10SRE, 10Wikimedia-General-or-Unknown, 10Wikimedia-SVG-rendering, 10Documentation: Document how to request installing additional SVG and PDF fonts on Wikimedia servers - https://phabricator.wikimedia.org/T228591 (10akosiaris) p:05Triage→03Low Specifically regarding https://noc.wikimedia.org/conf/fc-li... [16:27:21] (03PS3) 10Alexandros Kosiaris: Remove fc-list file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681665 [16:27:27] (03CR) 10Alexandros Kosiaris: [C: 03+1] "> Patch Set 2: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681665 (owner: 10Alexandros Kosiaris) [16:41:33] 10SRE, 10ops-codfw: thanos-fe2001 machine check exception and crash/stall - https://phabricator.wikimedia.org/T280782 (10akosiaris) p:05Triage→03High [16:41:51] 10SRE, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T280668 (10akosiaris) p:05Triage→03High [16:45:05] (03CR) 10SBassett: Unbreak dbtree (031 comment) [software] - 10https://gerrit.wikimedia.org/r/192771 (owner: 10Springle) [16:50:48] PROBLEM - puppet last run on mw2280 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:55:31] (03PS2) 10Effie Mouzeli: (WIP) conftool: improve safe-service-restart multiple cluster support [puppet] - 10https://gerrit.wikimedia.org/r/681676 (https://phabricator.wikimedia.org/T279100) [16:57:03] (03CR) 10jerkins-bot: [V: 04-1] (WIP) conftool: improve safe-service-restart multiple cluster support [puppet] - 10https://gerrit.wikimedia.org/r/681676 (https://phabricator.wikimedia.org/T279100) (owner: 10Effie Mouzeli) [16:58:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ayounsi) a:05ayounsi→03RobH Shouldn't have they been racked in the cloud racks (C8 and `D5`) to be con... [16:59:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) >>! In T274945#7025033, @ayounsi wrote: > Shouldn't have they been racked in the cloud racks (C8 and... [17:01:47] (03PS3) 10Effie Mouzeli: (WIP) conftool: improve safe-service-restart multiple cluster support [puppet] - 10https://gerrit.wikimedia.org/r/681676 (https://phabricator.wikimedia.org/T279100) [17:02:10] (03CR) 10Effie Mouzeli: [C: 04-2] "This won't work for api, needs more work" [puppet] - 10https://gerrit.wikimedia.org/r/681676 (https://phabricator.wikimedia.org/T279100) (owner: 10Effie Mouzeli) [17:03:28] RECOVERY - puppet last run on mw2280 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:09:00] (03PS4) 10Legoktm: exim: Drop support for legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/681242 (https://phabricator.wikimedia.org/T280472) [17:09:02] (03PS1) 10Legoktm: exim: Clean up remnants of legacy_mailing_lists [puppet] - 10https://gerrit.wikimedia.org/r/681724 (https://phabricator.wikimedia.org/T280472) [17:09:05] (03CR) 10Legoktm: exim: Drop support for legacy mailing list domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/681242 (https://phabricator.wikimedia.org/T280472) (owner: 10Legoktm) [17:09:38] (03CR) 10jerkins-bot: [V: 04-1] exim: Drop support for legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/681242 (https://phabricator.wikimedia.org/T280472) (owner: 10Legoktm) [17:09:58] (03CR) 10jerkins-bot: [V: 04-1] exim: Clean up remnants of legacy_mailing_lists [puppet] - 10https://gerrit.wikimedia.org/r/681724 (https://phabricator.wikimedia.org/T280472) (owner: 10Legoktm) [17:10:17] (03PS5) 10Legoktm: exim: Drop support for legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/681242 (https://phabricator.wikimedia.org/T280472) [17:10:19] (03PS2) 10Legoktm: exim: Clean up remnants of legacy_mailing_lists [puppet] - 10https://gerrit.wikimedia.org/r/681724 (https://phabricator.wikimedia.org/T280472) [17:11:47] (03CR) 10jerkins-bot: [V: 04-1] exim: Drop support for legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/681242 (https://phabricator.wikimedia.org/T280472) (owner: 10Legoktm) [17:12:54] (03PS1) 10Arturo Borrero Gonzalez: toolforge: nginx-ingress-jobs: specify ingress-class [puppet] - 10https://gerrit.wikimedia.org/r/681725 (https://phabricator.wikimedia.org/T251917) [17:14:51] (03PS6) 10Legoktm: exim: Drop support for legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/681242 (https://phabricator.wikimedia.org/T280472) [17:14:53] (03PS3) 10Legoktm: exim: Clean up remnants of legacy_mailing_lists [puppet] - 10https://gerrit.wikimedia.org/r/681724 (https://phabricator.wikimedia.org/T280472) [17:17:35] (03PS5) 10Legoktm: exim: Fix TLS cert for cloud (in mailman) [puppet] - 10https://gerrit.wikimedia.org/r/680335 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [17:17:52] (03PS1) 10Urbanecm: Set wgGEMentorshipMigrationStage to WRITE_BOTH/READ_NEW everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681750 (https://phabricator.wikimedia.org/T279853) [17:18:36] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29151/console" [puppet] - 10https://gerrit.wikimedia.org/r/680335 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [17:19:15] (03PS6) 10Legoktm: exim: Fix TLS cert for mailman in cloud [puppet] - 10https://gerrit.wikimedia.org/r/680335 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [17:19:25] (03CR) 10Legoktm: [C: 03+2] exim: Fix TLS cert for mailman in cloud [puppet] - 10https://gerrit.wikimedia.org/r/680335 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [17:20:29] jouncebot: next [17:20:30] In 0 hour(s) and 39 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210421T1800) [17:20:30] In 0 hour(s) and 39 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210421T1800) [17:31:04] (03PS1) 10Legoktm: mariadb: Allow lists1001.wikimedia.org to talk to m5 [puppet] - 10https://gerrit.wikimedia.org/r/681753 (https://phabricator.wikimedia.org/T278614) [17:32:53] (03CR) 10Ladsgroup: [C: 03+1] "oh thank you." [puppet] - 10https://gerrit.wikimedia.org/r/681753 (https://phabricator.wikimedia.org/T278614) (owner: 10Legoktm) [17:33:55] (03PS4) 10David Caro: icinga: use a bash command wrapper to allow sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/681694 (https://phabricator.wikimedia.org/T280641) [17:33:57] (03PS1) 10David Caro: icinga: use a sudo-friendly command to get command_file [software/spicerack] - 10https://gerrit.wikimedia.org/r/681754 (https://phabricator.wikimedia.org/T280641) [17:34:06] (03CR) 10Legoktm: [C: 03+1] mariadb: Add mailman3 and mailman3web to the list of hosts to be backed up [puppet] - 10https://gerrit.wikimedia.org/r/681584 (https://phabricator.wikimedia.org/T278614) (owner: 10Jcrespo) [17:35:33] (03CR) 10Jcrespo: "Note that while I helped you for backups- the right person for reviewing misc db patches is either Manuel or Stevie, as they will know pot" [puppet] - 10https://gerrit.wikimedia.org/r/681753 (https://phabricator.wikimedia.org/T278614) (owner: 10Legoktm) [17:37:07] (03CR) 10Legoktm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/681753 (https://phabricator.wikimedia.org/T278614) (owner: 10Legoktm) [17:38:23] (03CR) 10Jcrespo: [C: 03+2] mariadb: Add mailman3 and mailman3web to the list of hosts to be backed up [puppet] - 10https://gerrit.wikimedia.org/r/681584 (https://phabricator.wikimedia.org/T278614) (owner: 10Jcrespo) [17:39:25] ACKNOWLEDGEMENT - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_exclude_backups.service Legoktm T280744 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:40:05] (03PS2) 10David Caro: icinga: use a sudo-friendly command to get command_file [software/spicerack] - 10https://gerrit.wikimedia.org/r/681754 (https://phabricator.wikimedia.org/T280641) [17:40:44] (03CR) 10David Caro: "I'm having lots of issues trying to run the tests with tox locally... it ends up in the dependency loop (taking >2h so far), so sorry for " [software/spicerack] - 10https://gerrit.wikimedia.org/r/681694 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [17:41:31] (03CR) 10jerkins-bot: [V: 04-1] icinga: use a sudo-friendly command to get command_file [software/spicerack] - 10https://gerrit.wikimedia.org/r/681754 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [17:41:38] (03CR) 10Volans: "> Patch Set 3:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/681694 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [17:42:14] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/681694 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [17:43:00] !log deploy grant changes on m5 backup sources (db1117 and db2078) T278614 [17:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:13] T278614: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 [17:47:47] (03PS1) 10Legoktm: lists: Stage mailman3 configuration, but don't enable yet [puppet] - 10https://gerrit.wikimedia.org/r/681755 (https://phabricator.wikimedia.org/T278610) [17:48:23] (03CR) 10jerkins-bot: [V: 04-1] icinga: use a sudo-friendly command to get command_file [software/spicerack] - 10https://gerrit.wikimedia.org/r/681754 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [17:49:27] 10SRE, 10DBA, 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10jcrespo) Backups have been enabled and access seem correct. I saw the dbs are right now empty, but please ping me at some point i... [17:49:58] (03PS1) 10Legoktm: lists: Add mailman3 config [labs/private] - 10https://gerrit.wikimedia.org/r/681757 [17:53:55] 10SRE, 10Mail: Mail to root@lists1001.wikimedia.org doesn't work because of /etc/aliases file permissions - https://phabricator.wikimedia.org/T280744 (10Ladsgroup) From what I can see, it means the router rules has not been matched(!) The router rule is exactly the same as the exim4 smarthost config so I'm che... [17:55:57] hello! would someone with the relevant privileges/permissions mind approving (+2) this beta-only config change https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/681456/ ? [17:56:03] (03CR) 10Kaldari: [C: 04-1] "Yes, it is purely informational, but it is very important information that the community relies on. Instead of deleting it, it needs to be" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681665 (owner: 10Alexandros Kosiaris) [17:59:48] (03PS2) 10Legoktm: lists: Add mailman3 config [labs/private] - 10https://gerrit.wikimedia.org/r/681757 [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210421T1800). [18:00:05] tgr, cormacparle, and Urbanecm: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:05] Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210421T1800) [18:00:15] i can deploy today [18:00:27] cool, I'm here anyway [18:00:27] cjming: I can do it. [18:00:39] ty! [18:00:59] (03CR) 10Urbanecm: [C: 03+2] Make the logistic regression image search default [extensions/WikibaseMediaInfo] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/681709 (https://phabricator.wikimedia.org/T271799) (owner: 10Cparle) [18:01:15] (03CR) 10Urbanecm: [C: 03+2] Update wgVectorLanguageInHeader variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681456 (https://phabricator.wikimedia.org/T277588) (owner: 10Clare Ming) [18:01:17] (03PS1) 10Volans: setup.py: support more recent PyParsing versions [software/cumin] - 10https://gerrit.wikimedia.org/r/681758 [18:02:02] (03Merged) 10jenkins-bot: Update wgVectorLanguageInHeader variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681456 (https://phabricator.wikimedia.org/T277588) (owner: 10Clare Ming) [18:02:04] tgr|away: you around? [18:02:16] cjming: should be merged now. Keep in mind it will take up to 30 minutes to propagate. [18:02:25] (03PS2) 10Hnowlan: api-gateway: use envoy 1.15.4 temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/681336 (https://phabricator.wikimedia.org/T280317) [18:02:34] (03CR) 10Urbanecm: [C: 03+2] eswiki: Push Growth features out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681671 (https://phabricator.wikimedia.org/T278235) (owner: 10Urbanecm) [18:03:52] (03CR) 10Ladsgroup: [C: 03+1] lists: Stage mailman3 configuration, but don't enable yet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/681755 (https://phabricator.wikimedia.org/T278610) (owner: 10Legoktm) [18:03:55] o/ [18:03:58] (03Merged) 10jenkins-bot: eswiki: Push Growth features out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681671 (https://phabricator.wikimedia.org/T278235) (owner: 10Urbanecm) [18:04:26] (03CR) 10Hnowlan: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/681336 (https://phabricator.wikimedia.org/T280317) (owner: 10Hnowlan) [18:04:32] hi tgr_ ! [18:05:53] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: e252de0482c60e87e06d866006bb9ceb186af6cf: eswiki: Push Growth features out of dark mode (T278235) (duration: 01m 00s) [18:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:02] T278235: Deploy Growth features on Spanish Wikipedia - https://phabricator.wikimedia.org/T278235 [18:06:07] (03CR) 10Legoktm: [V: 03+2 C: 03+2] lists: Add mailman3 config [labs/private] - 10https://gerrit.wikimedia.org/r/681757 (owner: 10Legoktm) [18:06:18] 10SRE: Expose live font list (fc-list) on a public webpage - https://phabricator.wikimedia.org/T280829 (10kaldari) [18:06:20] Urbanecm: much obliged [18:06:36] any time :) [18:06:50] (03CR) 10Kaldari: [C: 04-1] "I filed a task for a live font list at https://phabricator.wikimedia.org/T280829." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681665 (owner: 10Alexandros Kosiaris) [18:06:59] (03PS2) 10Urbanecm: Set wgGEMentorshipMigrationStage to WRITE_BOTH/READ_NEW everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681750 (https://phabricator.wikimedia.org/T279853) [18:07:03] (03CR) 10Urbanecm: [C: 03+2] Set wgGEMentorshipMigrationStage to WRITE_BOTH/READ_NEW everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681750 (https://phabricator.wikimedia.org/T279853) (owner: 10Urbanecm) [18:07:42] (03CR) 10Legoktm: [C: 03+2] lists: Stage mailman3 configuration, but don't enable yet [puppet] - 10https://gerrit.wikimedia.org/r/681755 (https://phabricator.wikimedia.org/T278610) (owner: 10Legoktm) [18:08:33] 10SRE: update svg font list - https://phabricator.wikimedia.org/T79424 (10kaldari) 05Resolved→03Open Reopening as the list is again out of date. See also T280829. [18:09:10] (03CR) 10Kaldari: [C: 04-1] "I also reopened https://phabricator.wikimedia.org/T79424, which is for updating the file in the meantime." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681665 (owner: 10Alexandros Kosiaris) [18:10:40] (03Merged) 10jenkins-bot: Set wgGEMentorshipMigrationStage to WRITE_BOTH/READ_NEW everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681750 (https://phabricator.wikimedia.org/T279853) (owner: 10Urbanecm) [18:12:01] (03PS2) 10Giuseppe Lavagetto: helmfile: install a simple deployment shell [puppet] - 10https://gerrit.wikimedia.org/r/681432 [18:14:05] (03PS2) 10Ayounsi: Use Capirca to generate mgmt SRX security policies [homer/public] - 10https://gerrit.wikimedia.org/r/681708 [18:14:08] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 1ae5ca5467fad7bfdae8aa94b241fe6c048ab8e5: Set wgGEMentorshipMigrationStage to WRITE_BOTH/READ_NEW everywhere (T279853) (duration: 00m 59s) [18:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:18] T279853: Migrate mentor/mentee relationship to a separate database table on Wikimedia wikis - https://phabricator.wikimedia.org/T279853 [18:14:20] tgr_: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/681309 is marked as "depends on" a patch that got merged, but is not yet in production. Can you confirm it's okay to go ahead? [18:15:12] yeah, it's a soft dependency, I'll check on mwdebug to make sure nothing unexpected happens [18:15:47] okay, cool [18:15:53] (03PS3) 10Urbanecm: Update $wgGEHomepageNewAccountVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681309 (https://phabricator.wikimedia.org/T278123) (owner: 10Gergő Tisza) [18:15:57] (03CR) 10Urbanecm: [C: 03+2] Update $wgGEHomepageNewAccountVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681309 (https://phabricator.wikimedia.org/T278123) (owner: 10Gergő Tisza) [18:21:58] 10SRE: Expose live font list (fc-list) on a public webpage - https://phabricator.wikimedia.org/T280829 (10Dzahn) ACK! It's kind of a duplicate of T280718 before that was renamed at least. [18:25:04] 10SRE: update svg font list - https://phabricator.wikimedia.org/T79424 (10Dzahn) I posted the current list on T210960#7015971 by request: {P15475} [18:25:49] jenkins is so slow today... [18:26:07] 10+ minutes to merge a config patch, very slow [18:26:12] :( [18:27:10] extension patches take upwards from an hour [18:27:30] yikes [18:27:31] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10Dzahn) Also see T280829. And I posted the current version at /T210960#7015971 the other day, by request, as it hap... [18:29:07] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10Dzahn) >>! In T280718#7023966, @akosiaris wrote: > in 8 years says to me that this isn't something sustainable to k... [18:29:31] (03Merged) 10jenkins-bot: Update $wgGEHomepageNewAccountVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681309 (https://phabricator.wikimedia.org/T278123) (owner: 10Gergő Tisza) [18:29:34] finally [18:29:58] tgr_: pulled onto mwdebug1001, can you test? [18:30:06] (03Merged) 10jenkins-bot: Make the logistic regression image search default [extensions/WikibaseMediaInfo] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/681709 (https://phabricator.wikimedia.org/T271799) (owner: 10Cparle) [18:30:23] looking [18:31:00] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:34:41] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10JoKalliauer) @akosiaris I think it is important to define "safe fonts" that are available to librsvg for commons (... [18:35:03] cormacparle: your patch is available at mwdebug1002, can you test, please? [18:35:08] on it [18:35:40] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3054 is OK: HTTP OK: HTTP/1.0 200 OK - 23634 bytes in 0.351 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:37:07] Urbanecm: looks good [18:37:11] thanks, syncing [18:38:49] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: f6d076a69607172475a86ba935a273e7519108d1: Update $wgGEHomepageNewAccountVariants (T278123) (duration: 00m 58s) [18:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:59] T278123: Provide capability for A/B testing task types - https://phabricator.wikimedia.org/T278123 [18:39:33] tgr_: done [18:39:43] thanks! [18:40:10] Urbanecm: looks good for me too [18:40:30] thanks, syncing [18:42:01] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.1/extensions/WikibaseMediaInfo/: f831d16e42e712832d683233a5b21ad59f7c73b3: Make the logistic regression image search default (T271799) (duration: 00m 58s) [18:42:09] cormacparle: done [18:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:10] T271799: [L] Implement new search profile(s) based on image search signal results - https://phabricator.wikimedia.org/T271799 [18:42:11] anyhting else? [18:42:13] (03CR) 10Ottomata: [C: 03+2] Move sanitize_eventlogging_analytics jobs from data_purge to refine_santizie [puppet] - 10https://gerrit.wikimedia.org/r/678941 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [18:43:11] I just got a 'gate pipeline build failed' email [18:43:19] is that something i need to pay attention to? [18:43:24] (03PS1) 10Legoktm: lists: Backup /var/lib/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/681763 [18:43:38] cormacparle: depends on which change that is [18:44:04] it probably means that one of your changes was +2'ed, but didn't pass tests [18:44:09] that can be flaky tests, or an actual bug [18:44:19] hard to tell w/o knowing the details [18:44:31] (03PS2) 10Legoktm: lists: Backup /var/lib/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/681763 [18:44:46] ah hang on, it's the patch onto master, not the patch for the branch [18:45:01] thought it might be what you just merged [18:45:04] cool [18:45:14] I'm going to send some test mails to root@ in lists1002 for T280744 don't be alarmed [18:45:15] T280744: Mail to root@lists1001.wikimedia.org doesn't work because of /etc/aliases file permissions - https://phabricator.wikimedia.org/T280744 [18:45:29] cormacparle: the patch i just deployed definitely got merged, so it must've passed gate-and-submit [18:45:35] great [18:46:29] !log Morning B&C done [18:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:39] (03CR) 10Legoktm: "Our worst case estimate was that the search index would be 212GB (T279701), but I doubt it'll be that big because attachments don't get in" [puppet] - 10https://gerrit.wikimedia.org/r/681763 (owner: 10Legoktm) [18:46:47] thank you Urbanecm ! [18:46:51] any time :) [18:49:06] Amir1: hello, do you know any good way to get Wikidata QID for a page outside of Wikibase code? [18:49:17] page props [18:49:40] yup [18:50:03] https://en.wikipedia.org/w/api.php?action=query&titles=Taylor%20Swift&prop=pageprops [18:50:14] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10JoKalliauer) p:05Low→03Triage @akosiaris : I removed the priority, also knowing you are an admin here, but I do... [18:50:27] thanks legoktm and Amir1! [18:52:06] Amir1: both test emails came through [18:52:21] sorry for spam [18:52:40] funnily enough, one was sent to lists1001. so it should be fixed now [18:52:51] maybe exim4 needed restart? [18:53:05] let me try again (sorry) [18:53:55] can't read exim4 logs in lists1001 :( [18:59:39] 10SRE, 10Mail: Mail to root@lists1001.wikimedia.org doesn't work because of /etc/aliases file permissions - https://phabricator.wikimedia.org/T280744 (10Ladsgroup) My test mails went through. Might be exim4 needed restart? Also /etc/aliases on lists1002 has much more options that lists1001: ` # HEADER: This f... [19:05:54] (03CR) 10Ladsgroup: [C: 03+1] lists: Backup /var/lib/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/681763 (owner: 10Legoktm) [19:07:04] that's the "adm" group thing again [19:07:16] ops also gets you adm which gets you the logs [19:07:47] but individual role admin without ops doesnt come with it [19:09:45] (03PS1) 10Dzahn: update fc-list to current version on buster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681766 (https://phabricator.wikimedia.org/T79424) [19:10:07] 10SRE, 10MediaWiki-General, 10Traffic, 10Browser-Support-Apple-Safari, 10Patch-For-Review: File:Chessboard480.svg WEBP thumbnail version not visible on safari when size is fixed at 208px - https://phabricator.wikimedia.org/T280439 (10Aklapper) [19:12:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) [19:15:22] !log robh@cumin1001 START - Cookbook sre.dns.netbox [19:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) [19:28:18] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10JoKalliauer) [19:28:44] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10JoKalliauer) [19:32:26] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10Dzahn) Meanwhile old ticket T79424 from 2011 has been reopened so I created https://gerrit.wikimedia.org/r/681766 [19:33:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: payments1006.frack.eqiad.wmnet DRAC no console output - https://phabricator.wikimedia.org/T280527 (10wiki_willy) a:03Cmjohnson Hi @Cmjohnson / @Jclark-ctr - this one is high priority [19:34:02] Will look at it shortly [19:40:13] 10SRE, 10serviceops: Find 8 machines (4 eqiad + 4 codfw) for Thumbor - https://phabricator.wikimedia.org/T280843 (10jijiki) p:05Triage→03Medium [19:44:16] (03CR) 10JoKalliauer: "Thanks that looks great. :-D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681766 (https://phabricator.wikimedia.org/T79424) (owner: 10Dzahn) [19:45:56] !log manually kicking off a run of update-openstack-mirror on sodium to capture an upstream package update [19:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:05] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host planet1003.eqiad.wmnet [19:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:25] !log creating a ganeti VM to test bullseye install [19:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:34] (03CR) 10Dzahn: "So the file pathes in there are not a problem? Will amend to sort it alphabetically." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681766 (https://phabricator.wikimedia.org/T79424) (owner: 10Dzahn) [19:48:39] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:51] !log robh@cumin1001 START - Cookbook sre.dns.netbox [19:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:52] (03PS2) 10Dzahn: update fc-list to current version on buster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681766 (https://phabricator.wikimedia.org/T79424) [19:51:05] (03CR) 10Dzahn: "> 2) Alphabetical ordering would be nice, as in https://commons.wikimedia.org/wiki/User:JoKalliauer/fc-list (import your text in your spre" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681766 (https://phabricator.wikimedia.org/T79424) (owner: 10Dzahn) [19:52:46] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:59] (03CR) 10JoKalliauer: [C: 03+1] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681766 (https://phabricator.wikimedia.org/T79424) (owner: 10Dzahn) [19:56:06] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:59:12] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host planet1003.eqiad.wmnet [19:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:46] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10Majavah) p:05Triage→03Low @JoKalliauer: Please [[ https://www.mediawiki.org/wiki/Bug_management/Phabricator_eti... [20:00:04] chrisalbon and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210421T2000). [20:03:05] !log robh@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudcephosd1016.eqiad.wmnet [20:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) [20:08:50] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:10:43] (03PS1) 10Dzahn: DHCP: add planet1003 and use bullseye installer [puppet] - 10https://gerrit.wikimedia.org/r/681774 [20:12:06] (03PS1) 10Ayounsi: Homer: get Capirca definitions from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/681775 (https://phabricator.wikimedia.org/T273865) [20:13:36] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/681766" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681665 (owner: 10Alexandros Kosiaris) [20:14:24] (03CR) 10Dzahn: "also see https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/681665" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681766 (https://phabricator.wikimedia.org/T79424) (owner: 10Dzahn) [20:15:20] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:16:12] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:18:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) [20:18:49] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcephosd1016.eqiad.wmnet [20:18:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `cloudcephosd1016.eqiad.wmnet` - clou... [20:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) [20:20:31] (03PS1) 10Papaul: Add new backup node MAC address, partman recipe, role insetup [puppet] - 10https://gerrit.wikimedia.org/r/681777 (https://phabricator.wikimedia.org/T277323) [20:21:11] !log robh@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudcephosd1017.eqiad.wmnet [20:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:11] (03CR) 10Dzahn: [C: 03+2] DHCP: add planet1003 and use bullseye installer [puppet] - 10https://gerrit.wikimedia.org/r/681774 (owner: 10Dzahn) [20:22:24] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 145, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:28:17] (03PS2) 10Papaul: Add new backup node MAC address, partman recipe, role insetup [puppet] - 10https://gerrit.wikimedia.org/r/681777 (https://phabricator.wikimedia.org/T277323) [20:31:54] (03CR) 10Papaul: [C: 03+2] Add new backup node MAC address, partman recipe, role insetup [puppet] - 10https://gerrit.wikimedia.org/r/681777 (https://phabricator.wikimedia.org/T277323) (owner: 10Papaul) [20:32:04] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcephosd1017.eqiad.wmnet [20:32:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `cloudcephosd1017.eqiad.wmnet` - clou... [20:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:20] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:35:00] (03PS1) 10Dzahn: site: add planet1003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/681779 [20:35:35] (03PS2) 10Dzahn: site: add planet1003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/681779 [20:35:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) All new 10G cloud hosts need to be racked into C8 or D5 ONLY, so all of these hosts must be moved. 3 were racked into row... [20:36:50] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/681780 [20:38:24] (03CR) 10Dzahn: [C: 03+2] site: add planet1003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/681779 (owner: 10Dzahn) [20:43:27] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: 2021-04-30) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 (10Papaul) [20:44:10] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: 2021-04-30) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 (10Papaul) @jcrespo backup2004 and 2007 are ready for OS install, you can tale over [20:45:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) Chatted with John who pointed out D5's switch is nearly full. I logged in, and indeed it only has 4 ports left, so when 2... [20:46:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) [21:06:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: payments1006.frack.eqiad.wmnet DRAC no console output - https://phabricator.wikimedia.org/T280527 (10Jclark-ctr) 05Open→03Resolved a:05Cmjohnson→03Jclark-ctr Found power button was stuck. Fixed booting fine verified with jgreen [21:06:52] 10SRE, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Jclark-ctr) [21:08:36] 10SRE, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10RhinosF1) [21:26:35] (03PS1) 10Ottomata: sanitize_eventlogging_analytics_immediate - ensure absent during switchover [puppet] - 10https://gerrit.wikimedia.org/r/681784 (https://phabricator.wikimedia.org/T280813) [21:26:50] (03CR) 10jerkins-bot: [V: 04-1] sanitize_eventlogging_analytics_immediate - ensure absent during switchover [puppet] - 10https://gerrit.wikimedia.org/r/681784 (https://phabricator.wikimedia.org/T280813) (owner: 10Ottomata) [21:27:00] (03PS1) 10Legoktm: lists: Fix mailman3 apache config [puppet] - 10https://gerrit.wikimedia.org/r/681785 (https://phabricator.wikimedia.org/T278612) [21:27:05] (03PS2) 10Ottomata: sanitize_eventlogging_analytics_immediate - ensure absent during switchover [puppet] - 10https://gerrit.wikimedia.org/r/681784 (https://phabricator.wikimedia.org/T280813) [21:28:46] (03CR) 10Ottomata: [C: 03+2] sanitize_eventlogging_analytics_immediate - ensure absent during switchover [puppet] - 10https://gerrit.wikimedia.org/r/681784 (https://phabricator.wikimedia.org/T280813) (owner: 10Ottomata) [21:34:10] 10SRE, 10Parsoid-Tests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Arlolra) @Dzahn The static files aren't rendering, ex. https://parsoid-rt-tests.wikimedia.org/static/style.css `curl h... [21:46:31] (03PS1) 10Alexandros Kosiaris: Enable per flow ECMP for kubernetes/kubestage [homer/public] - 10https://gerrit.wikimedia.org/r/681789 (https://phabricator.wikimedia.org/T238909) [21:48:03] (03PS11) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [21:49:51] (03CR) 10Ryan Kemper: "Okay, that should be all the comments. The linter is probably still going to complain about a couple things, which I'll get to next monday" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [21:51:49] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [22:04:26] (03PS1) 10Andrew Bogott: Trove.conf: add [network] section [puppet] - 10https://gerrit.wikimedia.org/r/681790 (https://phabricator.wikimedia.org/T212595) [22:05:43] (03CR) 10Andrew Bogott: [C: 03+2] Trove.conf: add [network] section [puppet] - 10https://gerrit.wikimedia.org/r/681790 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [22:10:44] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Install mailman3 and mailman2 at the same time on the cloud - https://phabricator.wikimedia.org/T278612 (10Legoktm) >>! In T278612#7011274, @Ladsgroup wrote: > So after several changes in puppetmaster of mailman in the cloud, it works now: https://polymor... [22:23:15] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10Glrx) conf/fc-list has useful information, and there have been several requests for updated information on MW fonts... [22:32:01] 10SRE, 10Services, 10Toolhub, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10bd808) [22:32:31] 10SRE, 10Services, 10Toolhub, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10bd808) [22:37:36] 10SRE, 10Services, 10serviceops, 10Service-deployment-requests: New Service Request - Calculator Service - https://phabricator.wikimedia.org/T273807 (10bd808) [22:38:50] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Improve workflow for mailman database bootstrapping and updates - https://phabricator.wikimedia.org/T278499 (10Legoktm) 05Open→03Resolved a:03Legoktm Sounds good, thanks for the input! [22:41:11] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10AntiCompositeNumber) From a user perspective, having something like fc-list is useful. {https://phabricator.wikimed... [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210421T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:03:27] (03PS1) 10Bstorm: cloudstore: set up secondary_drbd classes [puppet] - 10https://gerrit.wikimedia.org/r/681800 (https://phabricator.wikimedia.org/T224747) [23:04:37] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Reconsider which mailman3 version we're running - https://phabricator.wikimedia.org/T278905 (10Legoktm) [23:04:43] (03CR) 10jerkins-bot: [V: 04-1] cloudstore: set up secondary_drbd classes [puppet] - 10https://gerrit.wikimedia.org/r/681800 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [23:04:44] 10SRE, 10DBA, 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Legoktm) It slipped my mind that we need to test the new packages first, I filed {T280887} for that. If we can get that upgrade... [23:04:47] PROBLEM - SSH on mw1279.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:05:18] (03PS2) 10Bstorm: cloudstore: set up secondary_drbd classes [puppet] - 10https://gerrit.wikimedia.org/r/681800 (https://phabricator.wikimedia.org/T224747) [23:06:34] (03CR) 10jerkins-bot: [V: 04-1] cloudstore: set up secondary_drbd classes [puppet] - 10https://gerrit.wikimedia.org/r/681800 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [23:08:12] (03PS3) 10Bstorm: cloudstore: set up secondary_drbd classes [puppet] - 10https://gerrit.wikimedia.org/r/681800 (https://phabricator.wikimedia.org/T224747) [23:09:27] (03CR) 10jerkins-bot: [V: 04-1] cloudstore: set up secondary_drbd classes [puppet] - 10https://gerrit.wikimedia.org/r/681800 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [23:11:21] (03PS4) 10Bstorm: cloudstore: set up secondary_drbd classes [puppet] - 10https://gerrit.wikimedia.org/r/681800 (https://phabricator.wikimedia.org/T224747) [23:17:58] 10SRE, 10Wikimedia-Mailing-lists: Expose mailman3 internal REST API inside Wikimedia production network - https://phabricator.wikimedia.org/T279023 (10Legoktm) [23:24:19] 10SRE, 10Wikimedia-Mailing-lists: Expose mailman3 internal REST API inside Wikimedia production network - https://phabricator.wikimedia.org/T279023 (10Legoktm) Since we already need envoy in front of the REST API to provide HTTPS, I think we can have it also do some limited access control, or only expose the s... [23:27:06] (03CR) 10Jforrester: "check experimental" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [23:49:55] !log made myself and Amir1 list admins for the listadmins@lists.wikimedia.org mailing list [23:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log