[00:00:04] twentyafterfour: How many deployers does it take to do Phabricator update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210506T0000). [00:00:53] RECOVERY - Check systemd state on lists1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:09] ^ the lists1001 alert is flappy unfortunately [00:08:13] PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_exclude_backups.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:13:18] (03CR) 10Legoktm: [C: 03+2] "I think we can create a dedicated alias as a follow-up." [puppet] - 10https://gerrit.wikimedia.org/r/685567 (owner: 10Ladsgroup) [00:23:05] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 374043624 and 87 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:27:59] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Rename mailinglists eliso, and eliso-anoncoj - https://phabricator.wikimedia.org/T281686 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup It's done now both are renamed and upgraded to mm3. Please check if everything is fine. - Create an account: https... [00:29:54] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup) So public mailing lists in group B and C are now upgraded to mm3, it was less messy than the first upgrade. I also migrated some other mailing lists for various reason... [00:30:31] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:30:52] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup) I forgot to give some numbers: Now more than 1/4th of all mailing lists are on mm3 \o/ [00:32:47] (Primary outbound port utilisation over 80% #page) firing: (2) Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [00:33:40] * legoktm is driving [00:35:04] !log sudo service mailman3-web restart [00:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:47] (Primary outbound port utilisation over 80% #page) resolved: (2) Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [00:48:54] 10SRE, 10SRE-Access-Requests: Gaining access to MaxMind account associated with noc@wikimedia.org - https://phabricator.wikimedia.org/T282066 (10Reedy) [01:33:43] PROBLEM - Hadoop NodeManager on an-worker1131 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:34:41] PROBLEM - Check systemd state on an-worker1131 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:36:41] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikibase_repo_prune_test.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:46:53] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:56:13] RECOVERY - Hadoop NodeManager on an-worker1131 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:57:11] RECOVERY - Check systemd state on an-worker1131 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:13:48] 10SRE, 10SRE-Access-Requests: Gaining access to MaxMind account associated with noc@wikimedia.org - https://phabricator.wikimedia.org/T282066 (10Dzahn) a:03Dzahn [02:20:46] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10Dzahn) T282068 is another duplicate of this, right? [02:22:22] 10SRE, 10ops-codfw: Degraded RAID on wdqs2007 - https://phabricator.wikimedia.org/T282068 (10Dzahn) p:05Triage→03Medium [02:26:33] PROBLEM - MegaRAID on db2107 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:26:34] ACKNOWLEDGEMENT - MegaRAID on db2107 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T282072 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:26:39] 10SRE, 10ops-codfw: Degraded RAID on db2107 - https://phabricator.wikimedia.org/T282072 (10ops-monitoring-bot) [02:37:43] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [02:55:54] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563 [02:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:56:03] T280563: Reboot elasticsearch* and relforge* to apply kernel security updates - https://phabricator.wikimedia.org/T280563 [02:56:17] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563 [02:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:59] !log [Elastic] It looks like we've got a single missing shard in `production-search-codfw` (port 9200), which is putting the cluster into red status. The cluster won't get back into green status without intervention [03:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:50] !log [Elastic] I banned two nodes simultaneously earlier today - if there's an index with only 1 replica, and its primary and replica happened to be on the two nodes I banned, then that would have caused this situation [03:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:07:39] (03CR) 10Cwhite: "Overall LGTM. It may be worthwhile to service ensure => stopped on the collector ES instances. Not sure how hard that would be to set up" [puppet] - 10https://gerrit.wikimedia.org/r/685090 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [03:08:05] !log [Elastic] Temporarily unbanning `elastic2033` and `elastic2043` from `production-search-codfw` to see if we can get the cluster green again. If it returns to green then we'll ban one node, wait for the shards to redistribute, and then ban the other [03:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:08:54] !log [Elastic] `ryankemper@elastic2044:~$ curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_host": null,"_name": null}}}'` [03:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:24] 10SRE, 10ops-codfw, 10DBA, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [03:14:08] (03CR) 10Cwhite: [C: 03+1] kafka-logging: migrate logstash2003 broker to kafka-logging2003 [puppet] - 10https://gerrit.wikimedia.org/r/683014 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [03:15:42] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10RKemper) This host is ssh unreachable again. There is definitely some underlying hardware failure. [03:16:14] (03CR) 10Cwhite: [C: 03+1] kafka-logging: migrate logstash2002 broker to kafka-logging2002 [puppet] - 10https://gerrit.wikimedia.org/r/683013 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [03:16:41] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10Papaul) I am still working on it [03:16:51] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:16:55] (03CR) 10Cwhite: [C: 03+1] kafka-logging: migrate logstash2001 broker to kafka-logging2001 [puppet] - 10https://gerrit.wikimedia.org/r/683012 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [03:17:22] 10SRE, 10ops-codfw, 10DBA, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) @jcrespo no IP change just switch port change [03:18:23] !log [Elastic] `elastic2043` is ssh unreachable. Power cycling it to bring it briefly back online - if it has the shard it should be able to repair the cluster state. Otherwise I'll have to delete the index for `enwiki_titlesuggest_1620184482` given the data would be unrecoverable [03:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:19:32] 10SRE, 10ops-codfw, 10DBA, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [03:24:32] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10RKemper) ack! [03:31:39] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.094 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:38:45] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs2004.codfw.wmnet` on `ryankemper@cumin1001` tmux session `reimage` [03:38:52] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs1007.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` [03:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:38:54] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [03:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:52:05] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1007.eqiad.wmnet with reason: REIMAGE [03:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:53:09] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2004.codfw.wmnet with reason: REIMAGE [03:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:54:15] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1007.eqiad.wmnet with reason: REIMAGE [03:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:56:15] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2004.codfw.wmnet with reason: REIMAGE [03:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:08:44] (03PS1) 10Tim Starling: Remove harmful validation regex in PageReferenceValue [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685596 (https://phabricator.wikimedia.org/T282070) [04:08:55] (03CR) 10Tim Starling: [C: 03+2] Remove harmful validation regex in PageReferenceValue [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685596 (https://phabricator.wikimedia.org/T282070) (owner: 10Tim Starling) [04:32:54] (03Merged) 10jenkins-bot: Remove harmful validation regex in PageReferenceValue [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685596 (https://phabricator.wikimedia.org/T282070) (owner: 10Tim Starling) [04:44:52] 10SRE, 10ops-codfw, 10DBA, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10jijiki) [04:52:36] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2107 - https://phabricator.wikimedia.org/T282072 (10Marostegui) p:05Triage→03Medium a:03Papaul @papaul this host is under support, can we get a new disk from DELL? This is s2 codfw master [05:13:55] (03PS1) 10ArielGlenn: Remove snapshot1005,6,7 from mediawiki scap targets [puppet] - 10https://gerrit.wikimedia.org/r/685636 (https://phabricator.wikimedia.org/T281330) [05:18:07] (03CR) 10ArielGlenn: [C: 03+2] Remove snapshot1005,6,7 from mediawiki scap targets [puppet] - 10https://gerrit.wikimedia.org/r/685636 (https://phabricator.wikimedia.org/T281330) (owner: 10ArielGlenn) [05:20:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_eventstreams_internal_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:20:44] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:20:45] (03PS1) 10ArielGlenn: remove snapshot1005,6,7 from dump scap targets [dumps/scap] - 10https://gerrit.wikimedia.org/r/685638 (https://phabricator.wikimedia.org/T281330) [05:21:14] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] remove snapshot1005,6,7 from dump scap targets [dumps/scap] - 10https://gerrit.wikimedia.org/r/685638 (https://phabricator.wikimedia.org/T281330) (owner: 10ArielGlenn) [05:22:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:26:52] jouncebot: now [05:26:53] No deployments scheduled for the next 4 hour(s) and 33 minute(s) [05:27:28] (03PS1) 10Marostegui: pc2010: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/685639 [05:27:48] !log upgrade scap to 3.17.1-1 - T279695 [05:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:57] T279695: Deploy Scap version 3.17.1-1 - https://phabricator.wikimedia.org/T279695 [05:28:07] (03CR) 10Marostegui: [C: 03+2] pc2010: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/685639 (owner: 10Marostegui) [05:32:16] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.4/includes/page/PageReferenceValue.php: fixing T282070 RC/log breakage due to unblocking autoblocks (duration: 01m 09s) [05:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:25] T282070: After unblocking autoblock, Special:Log and Special:RecentChanges gives ParameterAssertionException: Bad value for parameter $dbKey - https://phabricator.wikimedia.org/T282070 [05:33:27] (03PS1) 10ArielGlenn: turn snapshot1005,6,7 into spares [puppet] - 10https://gerrit.wikimedia.org/r/685640 (https://phabricator.wikimedia.org/T282078) [05:35:00] (03CR) 10ArielGlenn: [C: 03+2] turn snapshot1005,6,7 into spares [puppet] - 10https://gerrit.wikimedia.org/r/685640 (https://phabricator.wikimedia.org/T282078) (owner: 10ArielGlenn) [05:37:30] (03PS1) 10Marostegui: db1158: Binlog format: ROW [puppet] - 10https://gerrit.wikimedia.org/r/685641 [05:37:36] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [05:37:42] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs2004.codfw.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin1001` tmux session `reimage` [05:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:51] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [05:37:59] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1011.eqiad.wmnet --dest wdqs1007.eqiad.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin1001` tmux session `reimage` [05:38:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1079 and db1158 to switch sanitarium masters', diff saved to https://phabricator.wikimedia.org/P15792 and previous config saved to /var/cache/conftool/dbconfig/20210506-053801-marostegui.json [05:38:07] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [05:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:20] (03CR) 10Marostegui: [C: 03+2] db1158: Binlog format: ROW [puppet] - 10https://gerrit.wikimedia.org/r/685641 (owner: 10Marostegui) [05:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:36] PROBLEM - mediawiki-installation DSH group on snapshot1006 is CRITICAL: Host snapshot1006 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [05:43:14] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [05:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:40] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [05:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 25%: Repool db1158', diff saved to https://phabricator.wikimedia.org/P15793 and previous config saved to /var/cache/conftool/dbconfig/20210506-054404-root.json [05:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 25%: Repool db1079', diff saved to https://phabricator.wikimedia.org/P15794 and previous config saved to /var/cache/conftool/dbconfig/20210506-054419-root.json [05:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:42] PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 1.202 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:45:54] PROBLEM - WDQS SPARQL on wdqs1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 1.077 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:46:10] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:46:41] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [05:46:56] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [05:47:23] (03PS1) 10ArielGlenn: remove snapshot1005,6,7 from dumps nfs mounters list [puppet] - 10https://gerrit.wikimedia.org/r/685642 [05:47:56] (03CR) 10jerkins-bot: [V: 04-1] remove snapshot1005,6,7 from dumps nfs mounters list [puppet] - 10https://gerrit.wikimedia.org/r/685642 (owner: 10ArielGlenn) [05:48:44] (03PS2) 10ArielGlenn: remove snapshot1005,6,7 from dumps nfs mounters list [puppet] - 10https://gerrit.wikimedia.org/r/685642 (https://phabricator.wikimedia.org/T282078) [05:51:39] (03PS1) 10Samwilson: Enable Wikimedia OCR on Beta Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685643 (https://phabricator.wikimedia.org/T282080) [05:55:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 25%: Repool db1112 after checking its tables', diff saved to https://phabricator.wikimedia.org/P15795 and previous config saved to /var/cache/conftool/dbconfig/20210506-055509-root.json [05:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1083 T281445', diff saved to https://phabricator.wikimedia.org/P15796 and previous config saved to /var/cache/conftool/dbconfig/20210506-055535-marostegui.json [05:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:44] T281445: decommission db1083.eqiad.wmnet - https://phabricator.wikimedia.org/T281445 [05:56:10] (03PS1) 10Marostegui: db1083: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/685644 (https://phabricator.wikimedia.org/T281445) [05:57:04] (03CR) 10Marostegui: [C: 03+2] db1083: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/685644 (https://phabricator.wikimedia.org/T281445) (owner: 10Marostegui) [05:59:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 50%: Repool db1158', diff saved to https://phabricator.wikimedia.org/P15797 and previous config saved to /var/cache/conftool/dbconfig/20210506-055907-root.json [05:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 50%: Repool db1079', diff saved to https://phabricator.wikimedia.org/P15798 and previous config saved to /var/cache/conftool/dbconfig/20210506-055923-root.json [05:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:59] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [06:00:02] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [06:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:07] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs2004.codfw.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin1001` tmux session `reimage` [06:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:19] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [06:00:24] ^ bad copy paste, deleting that line from SAL [06:00:29] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs2004.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `reimage` [06:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:01] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1011.eqiad.wmnet --dest wdqs1007.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `reimage` [06:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:42] ryankemper: o/ elastic2043 is down again :( [06:01:55] elukey: yeah it's being worked on by papaul [06:02:15] I briefly unbanned it from the network because it might have a shard that we need to not lose data :x [06:02:28] (1 shard's worth of data to be clear) [06:02:40] ryankemper: ah ok so I can ack with T281327 [06:02:41] T281327: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 [06:03:06] elukey: yeah please do [06:03:15] I thought I acked it earlier, not sure if it flapped or if I just never pressed submit haha [06:03:24] np! [06:03:28] 10SRE, 10Traffic, 10HTTPS, 10Performance-Team (Radar): Enable HTTP/3 (QUIC) support on Wikimedia servers - https://phabricator.wikimedia.org/T238034 (10RuleTheWiki) [[ https://hacks.mozilla.org/2021/04/quic-and-http-3-support-now-in-firefox-nightly-and-beta/ | Mozilla now supports HTTP/3 ]] and the editor'... [06:03:32] 10SRE, 10Wikimedia-Mailing-lists: Error in qcluster - https://phabricator.wikimedia.org/T282071 (10Legoktm) Looks like we need to cherry-pick https://gitlab.com/mailman/hyperkitty/-/commit/2712722da7608c42e54ac73a392edb8673de9c4f ? [06:04:00] ACKNOWLEDGEMENT - SSH on elastic2043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Elukey T281327 https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:04:00] ACKNOWLEDGEMENT - Elasticsearch HTTPS for production-search-psi-codfw on elastic2043 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection timed out Elukey T281327 https://wikitech.wikimedia.org/wiki/Search [06:04:00] ACKNOWLEDGEMENT - Elasticsearch HTTPS for production-search-codfw on elastic2043 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection timed out Elukey T281327 https://wikitech.wikimedia.org/wiki/Search [06:04:00] ACKNOWLEDGEMENT - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% Elukey T281327 [06:05:13] ryankemper: since we are here (if you have a moment) - elastic2033 is meant to have puppet disabled + some prometheus units down etc.? [06:09:42] ah of course https://phabricator.wikimedia.org/T281621 [06:10:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 50%: Repool db1112 after checking its tables', diff saved to https://phabricator.wikimedia.org/P15799 and previous config saved to /var/cache/conftool/dbconfig/20210506-061012-root.json [06:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:28] 10SRE, 10ops-codfw, 10Discovery, 10Discovery-Search (Current work): elastic2033 without bootable devices available - https://phabricator.wikimedia.org/T281621 (10elukey) @Papaul what did you do to fix it?? (curious) Thanks! [06:12:12] RECOVERY - Check systemd state on elastic2033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:12:15] ok so restarted the units [06:12:21] and it worked (mostly prometheus) [06:13:56] 10SRE, 10ops-codfw, 10Discovery, 10Discovery-Search (Current work): elastic2033 without bootable devices available - https://phabricator.wikimedia.org/T281621 (10elukey) @RKemper I restarted the failed prometheus units on the node to clear icinga, but puppet is still disable, can you enable it when you hav... [06:14:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 75%: Repool db1158', diff saved to https://phabricator.wikimedia.org/P15800 and previous config saved to /var/cache/conftool/dbconfig/20210506-061411-root.json [06:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 75%: Repool db1079', diff saved to https://phabricator.wikimedia.org/P15801 and previous config saved to /var/cache/conftool/dbconfig/20210506-061427-root.json [06:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:25] ACKNOWLEDGEMENT - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_exclude_backups.service Legoktm T280744 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:20:32] !log apt-get clean on ping[1,2,3]001 to free some space [06:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:23:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:24:36] (03CR) 10ArielGlenn: [C: 03+2] remove snapshot1005,6,7 from dumps nfs mounters list [puppet] - 10https://gerrit.wikimedia.org/r/685642 (https://phabricator.wikimedia.org/T282078) (owner: 10ArielGlenn) [06:25:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 75%: Repool db1112 after checking its tables', diff saved to https://phabricator.wikimedia.org/P15802 and previous config saved to /var/cache/conftool/dbconfig/20210506-062516-root.json [06:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 100%: Repool db1158', diff saved to https://phabricator.wikimedia.org/P15803 and previous config saved to /var/cache/conftool/dbconfig/20210506-062915-root.json [06:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 100%: Repool db1079', diff saved to https://phabricator.wikimedia.org/P15804 and previous config saved to /var/cache/conftool/dbconfig/20210506-062931-root.json [06:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:56] RECOVERY - Keyholder SSH agent on cumin2002 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [06:32:15] elukey: thanks, kicked off a puppet run [06:33:09] perfect thanks [06:40:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 100%: Repool db1112 after checking its tables', diff saved to https://phabricator.wikimedia.org/P15805 and previous config saved to /var/cache/conftool/dbconfig/20210506-064020-root.json [06:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:33] ACKNOWLEDGEMENT - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service ArielGlenn Theres a task, T265056, the search team has it on their todo list https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:47:10] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:05:27] (03PS2) 10Muehlenhoff: Add ldap-replica2005 as new replica with Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/683290 [07:07:52] (03PS1) 10Legoktm: lists: Add Apache configuration for pipermail redirects [puppet] - 10https://gerrit.wikimedia.org/r/685711 [07:08:20] (03CR) 10jerkins-bot: [V: 04-1] lists: Add Apache configuration for pipermail redirects [puppet] - 10https://gerrit.wikimedia.org/r/685711 (owner: 10Legoktm) [07:09:25] (03PS2) 10Legoktm: lists: Add Apache configuration for pipermail redirects [puppet] - 10https://gerrit.wikimedia.org/r/685711 [07:22:20] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Fix cacert_(dirpath|filename) usage [puppet] - 10https://gerrit.wikimedia.org/r/685503 (https://phabricator.wikimedia.org/T281673) (owner: 10Vgutierrez) [07:22:36] RECOVERY - WDQS high update lag on wdqs2001 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.159e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:24:19] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [07:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:36] RECOVERY - WDQS SPARQL on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.083 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:24:41] !log installing exim security updates on bullseye hosts [07:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:50] PROBLEM - WDQS high update lag on wdqs1011 is CRITICAL: 4800 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:26:33] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [07:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:58] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Clear outbound TLS cacert_path for cp4026 and cp4032 [puppet] - 10https://gerrit.wikimedia.org/r/685504 (https://phabricator.wikimedia.org/T281673) (owner: 10Vgutierrez) [07:27:21] (03PS1) 10Majavah: toolforge: Add ingress-nginx Helm files [puppet] - 10https://gerrit.wikimedia.org/r/685715 (https://phabricator.wikimedia.org/T264221) [07:27:22] RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.202 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:27:26] 10SRE, 10DBA, 10Orchestrator: Base replication lag detection on heartbeat - https://phabricator.wikimedia.org/T268316 (10Marostegui) [07:27:36] PROBLEM - WDQS high update lag on wdqs2008 is CRITICAL: 4986 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:29:19] !log Enforce Puppet Internal CA validation on trafficserver@cp[4026,4032] - T281673 [07:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:10] (03CR) 10Aklapper: "Right. Thanks everyone!" [puppet] - 10https://gerrit.wikimedia.org/r/682124 (owner: 10Aklapper) [07:36:05] (03PS1) 10Jcrespo: dbbackups: remove db2098 s3 section for this codfw backup source [puppet] - 10https://gerrit.wikimedia.org/r/685717 (https://phabricator.wikimedia.org/T280492) [07:37:25] (03CR) 10Jcrespo: "I will update tendril and zarcillo when done." [puppet] - 10https://gerrit.wikimedia.org/r/685717 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [07:37:38] (03CR) 10Muehlenhoff: [C: 03+2] Add ldap-replica2005 as new replica with Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/683290 (owner: 10Muehlenhoff) [07:39:36] (03CR) 10Marostegui: [C: 03+1] dbbackups: remove db2098 s3 section for this codfw backup source [puppet] - 10https://gerrit.wikimedia.org/r/685717 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [07:40:26] (03PS1) 10Muehlenhoff: Allow new LDAP replicas to access acmechief [puppet] - 10https://gerrit.wikimedia.org/r/685718 [07:40:46] (03PS2) 10Muehlenhoff: Allow new LDAP replicas to access acmechief [puppet] - 10https://gerrit.wikimedia.org/r/685718 [07:41:12] (03PS3) 10Awight: Enable ReferencePreviews as full default on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685554 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [07:45:25] !log ariel@cumin1001 START - Cookbook sre.hosts.decommission for hosts snapshot1005.eqiad.wmnet [07:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1160 for schema change', diff saved to https://phabricator.wikimedia.org/P15806 and previous config saved to /var/cache/conftool/dbconfig/20210506-074746-marostegui.json [07:47:51] !log shutting down and removing db2098:s3 instance [07:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:21] (03CR) 10Jcrespo: [C: 03+2] dbbackups: remove db2098 s3 section for this codfw backup source [puppet] - 10https://gerrit.wikimedia.org/r/685717 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [07:48:28] (03PS2) 10Jcrespo: dbbackups: remove db2098 s3 section for this codfw backup source [puppet] - 10https://gerrit.wikimedia.org/r/685717 (https://phabricator.wikimedia.org/T280492) [07:50:47] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/685090 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [07:53:04] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Migrate node_puppet_agent cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/685581 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [07:53:34] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Migrate node_gdnsd cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/685582 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [07:53:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 25%: Repool db1160', diff saved to https://phabricator.wikimedia.org/P15807 and previous config saved to /var/cache/conftool/dbconfig/20210506-075359-root.json [07:54:00] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Migrate node_file_count cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/685583 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [07:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3315 for schema change', diff saved to https://phabricator.wikimedia.org/P15808 and previous config saved to /var/cache/conftool/dbconfig/20210506-075416-marostegui.json [07:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:18] (03CR) 10Vgutierrez: [C: 03+1] "nitpick: missing task on commit message" [puppet] - 10https://gerrit.wikimedia.org/r/685718 (owner: 10Muehlenhoff) [08:00:51] (03CR) 10WMDE-Fisch: [C: 03+1] "Thanks for the clean-up :-)!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685554 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [08:04:22] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts snapshot1005.eqiad.wmnet [08:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:30] !log ariel@cumin1001 START - Cookbook sre.hosts.decommission for hosts snapshot1006.eqiad.wmnet [08:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:30] (03PS3) 10Giuseppe Lavagetto: eventgate: add kafka egress policy stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/684855 (https://phabricator.wikimedia.org/T253058) [08:07:32] (03PS3) 10Giuseppe Lavagetto: eventgate-main: autogenerate egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/684856 [08:07:34] (03PS1) 10Giuseppe Lavagetto: Add diff tasks to rake [deployment-charts] - 10https://gerrit.wikimedia.org/r/685721 [08:07:54] (03CR) 10jerkins-bot: [V: 04-1] Add diff tasks to rake [deployment-charts] - 10https://gerrit.wikimedia.org/r/685721 (owner: 10Giuseppe Lavagetto) [08:08:39] (03CR) 10jerkins-bot: [V: 04-1] eventgate: add kafka egress policy stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/684855 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [08:08:41] (03CR) 10jerkins-bot: [V: 04-1] eventgate-main: autogenerate egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/684856 (owner: 10Giuseppe Lavagetto) [08:09:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 50%: Repool db1160', diff saved to https://phabricator.wikimedia.org/P15809 and previous config saved to /var/cache/conftool/dbconfig/20210506-080902-root.json [08:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:35] RECOVERY - dump of es4 in eqiad on alert1001 is OK: Last dump for es4 at eqiad (es1022.eqiad.wmnet) taken on 2021-05-05 19:10:52 (1746 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [08:10:49] RECOVERY - WDQS high update lag on wdqs1011 is OK: (C)3600 ge (W)1200 ge 258.9 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:11:49] RECOVERY - WDQS high update lag on wdqs2008 is OK: (C)3600 ge (W)1200 ge 1030 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:13:16] (03CR) 10Muehlenhoff: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/685718 (owner: 10Muehlenhoff) [08:13:23] (03CR) 10Muehlenhoff: [C: 03+2] Allow new LDAP replicas to access acmechief [puppet] - 10https://gerrit.wikimedia.org/r/685718 (owner: 10Muehlenhoff) [08:16:28] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts snapshot1006.eqiad.wmnet [08:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:16] (03PS1) 10Legoktm: mailman3: Script to generate pipermail redirects [puppet] - 10https://gerrit.wikimedia.org/r/685723 (https://phabricator.wikimedia.org/T280731) [08:17:50] (03CR) 10jerkins-bot: [V: 04-1] mailman3: Script to generate pipermail redirects [puppet] - 10https://gerrit.wikimedia.org/r/685723 (https://phabricator.wikimedia.org/T280731) (owner: 10Legoktm) [08:18:23] !log ariel@cumin1001 START - Cookbook sre.hosts.decommission for hosts snapshot1007.eqiad.wmnet [08:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:50] (03PS2) 10Legoktm: mailman3: Script to generate pipermail redirects [puppet] - 10https://gerrit.wikimedia.org/r/685723 (https://phabricator.wikimedia.org/T280731) [08:21:59] (03CR) 10Awight: Enable ReferencePreviews as full default on pilot wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685554 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [08:23:19] !log imported wikimedia-lvs-realserver to apt.wikimedia.org/bullseye T275873 [08:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:28] T275873: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 [08:24:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 75%: Repool db1160', diff saved to https://phabricator.wikimedia.org/P15810 and previous config saved to /var/cache/conftool/dbconfig/20210506-082406-root.json [08:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:10] 10SRE, 10Mail, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: In Mailman3 if a list has no owners, mail goes to root@ - https://phabricator.wikimedia.org/T281753 (10Legoktm) 05Open→03Resolved a:03Ladsgroup Aaaand now it's going to listadmins-owner@ because of {9b8147775d1e468a1b8578004aff645da9d153eb}?... [08:27:25] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts snapshot1007.eqiad.wmnet [08:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 25%: Repool db1096:3315', diff saved to https://phabricator.wikimedia.org/P15811 and previous config saved to /var/cache/conftool/dbconfig/20210506-083307-root.json [08:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:18] (03PS1) 10Muehlenhoff: openldap: Remove python-ldap [puppet] - 10https://gerrit.wikimedia.org/r/685725 [08:37:36] (03PS2) 10Muehlenhoff: openldap: Remove python-ldap [puppet] - 10https://gerrit.wikimedia.org/r/685725 [08:39:09] (03PS1) 10Jcrespo: dbbackups: remove db2097 s6 section for this codfw backup source [puppet] - 10https://gerrit.wikimedia.org/r/685726 (https://phabricator.wikimedia.org/T280751) [08:39:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 100%: Repool db1160', diff saved to https://phabricator.wikimedia.org/P15812 and previous config saved to /var/cache/conftool/dbconfig/20210506-083910-root.json [08:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:03] (03CR) 10Jcrespo: "this is preparing the cleanup, but will wait to make sure everything looks fine for some days before actually removing it (it is now passi" [puppet] - 10https://gerrit.wikimedia.org/r/685726 (https://phabricator.wikimedia.org/T280751) (owner: 10Jcrespo) [08:42:19] (03PS1) 10ArielGlenn: remove fake mcrouter secrets for snapshot1005,6,7 [labs/private] - 10https://gerrit.wikimedia.org/r/685727 (https://phabricator.wikimedia.org/T282078) [08:43:00] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Makes so much more sense with these comments. ;-) Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685554 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [08:43:17] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] remove fake mcrouter secrets for snapshot1005,6,7 [labs/private] - 10https://gerrit.wikimedia.org/r/685727 (https://phabricator.wikimedia.org/T282078) (owner: 10ArielGlenn) [08:44:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1087 and db1167 to switch sanitarium masters', diff saved to https://phabricator.wikimedia.org/P15813 and previous config saved to /var/cache/conftool/dbconfig/20210506-084443-marostegui.json [08:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:42] 10SRE, 10Wikimedia-Mailing-lists: Upload new mailman3 and hyperkitty packages - https://phabricator.wikimedia.org/T282092 (10Legoktm) [08:47:21] (03PS1) 10ArielGlenn: remove last traces of snapshot1005,6,7 from puppet manifests [puppet] - 10https://gerrit.wikimedia.org/r/685728 (https://phabricator.wikimedia.org/T282078) [08:47:42] (03CR) 10jerkins-bot: [V: 04-1] remove last traces of snapshot1005,6,7 from puppet manifests [puppet] - 10https://gerrit.wikimedia.org/r/685728 (https://phabricator.wikimedia.org/T282078) (owner: 10ArielGlenn) [08:47:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1087 db1167', diff saved to https://phabricator.wikimedia.org/P15814 and previous config saved to /var/cache/conftool/dbconfig/20210506-084754-marostegui.json [08:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 50%: Repool db1096:3315', diff saved to https://phabricator.wikimedia.org/P15815 and previous config saved to /var/cache/conftool/dbconfig/20210506-084811-root.json [08:48:17] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Mailman3 "New subscription request to" template line wraps, breaking long links - https://phabricator.wikimedia.org/T282044 (10Legoktm) Apparently if the URL is on a line that starts with whitespace, it won't get wrapped. I didn't test this though. [08:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:07] 10SRE, 10Wikimedia-Mailing-lists: Make customized Mailman3 templates translatable - https://phabricator.wikimedia.org/T282018 (10Legoktm) [08:49:10] 10SRE, 10Wikimedia-Mailing-lists, 10translatewiki.net: Add mailman-templates to translatewiki.net - https://phabricator.wikimedia.org/T282022 (10Legoktm) [08:49:48] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [08:49:57] (03CR) 10ArielGlenn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/685728 (https://phabricator.wikimedia.org/T282078) (owner: 10ArielGlenn) [08:50:25] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [08:52:51] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [08:53:05] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [08:53:52] (03CR) 10ArielGlenn: [C: 03+2] remove last traces of snapshot1005,6,7 from puppet manifests [puppet] - 10https://gerrit.wikimedia.org/r/685728 (https://phabricator.wikimedia.org/T282078) (owner: 10ArielGlenn) [08:54:11] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) 05Open→03Resolved All hosts that are scheduled for decommissioning are now ready (but waiting a few days to make sure their repl... [08:58:35] ACKNOWLEDGEMENT - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 5088163858840 and 6250401 seconds Hnowlan Needs to be resynced. https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:03:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: wmcs-dns-floating-ip-updater.py: fix typo in config option [puppet] - 10https://gerrit.wikimedia.org/r/685488 (owner: 10Arturo Borrero Gonzalez) [09:03:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 75%: Repool db1096:3315', diff saved to https://phabricator.wikimedia.org/P15816 and previous config saved to /var/cache/conftool/dbconfig/20210506-090315-root.json [09:03:20] !log sudo apt-get remove linux-image-4.19.0-11-amd64 linux-image-4.19.0-9-amd64 linux-image-4.19.0-13-amd64 on ping[123]001 host to free some space (tiny root partition, these are old kernels) [09:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:38] XioNoX --^ [09:06:27] (03PS2) 10Arturo Borrero Gonzalez: openstack: wmcs-dns-floating-ip-updater.py: run black [puppet] - 10https://gerrit.wikimedia.org/r/685491 [09:07:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: wmcs-dns-floating-ip-updater.py: run black [puppet] - 10https://gerrit.wikimedia.org/r/685491 (owner: 10Arturo Borrero Gonzalez) [09:18:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 100%: Repool db1096:3315', diff saved to https://phabricator.wikimedia.org/P15817 and previous config saved to /var/cache/conftool/dbconfig/20210506-091818-root.json [09:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1110 for schema change', diff saved to https://phabricator.wikimedia.org/P15818 and previous config saved to /var/cache/conftool/dbconfig/20210506-092217-marostegui.json [09:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:52] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 69 probes of 633 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:28:04] (03Abandoned) 10Filippo Giunchedi: pontoon: enable sso for alerts in cloud [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676386 (owner: 10Filippo Giunchedi) [09:28:18] (03Abandoned) 10Filippo Giunchedi: pontoon: use public_domain for alerts/icinga [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676387 (owner: 10Filippo Giunchedi) [09:28:37] apologies in advance, there will be a little gerrit spam [09:28:46] (03Abandoned) 10Filippo Giunchedi: pontoon: introduce public_certs [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676388 (owner: 10Filippo Giunchedi) [09:28:56] (03Abandoned) 10Filippo Giunchedi: pontoon: add public LB class [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676389 (owner: 10Filippo Giunchedi) [09:29:06] (03Abandoned) 10Filippo Giunchedi: role: add pontoon::frontend role/profile [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676390 (owner: 10Filippo Giunchedi) [09:29:08] (03CR) 10Mvolz: [C: 03+2] Switch to a different contact email [deployment-charts] - 10https://gerrit.wikimedia.org/r/683563 (https://phabricator.wikimedia.org/T278516) (owner: 10Mvolz) [09:29:21] (03Abandoned) 10Filippo Giunchedi: wmflib: add role/public_endpoint to wmflib::service [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676385 (owner: 10Filippo Giunchedi) [09:29:49] [cross-posting] FYI I'll be upgrading spicerack on cumin2001 in few minutes, for critical cookbooks runs please use cumin1001 for the next hour or so. [09:30:34] (03Merged) 10jenkins-bot: Switch to a different contact email [deployment-charts] - 10https://gerrit.wikimedia.org/r/683563 (https://phabricator.wikimedia.org/T278516) (owner: 10Mvolz) [09:31:10] (03PS1) 10Filippo Giunchedi: wmflib: add role/public_endpoint to wmflib::service [puppet] - 10https://gerrit.wikimedia.org/r/685734 [09:31:12] (03PS1) 10Filippo Giunchedi: pontoon: enable sso for alerts in cloud [puppet] - 10https://gerrit.wikimedia.org/r/685735 [09:31:14] (03PS1) 10Filippo Giunchedi: pontoon: use public_domain for alerts/icinga [puppet] - 10https://gerrit.wikimedia.org/r/685736 [09:31:16] (03PS1) 10Filippo Giunchedi: pontoon: introduce public_certs [puppet] - 10https://gerrit.wikimedia.org/r/685737 [09:31:18] (03PS1) 10Filippo Giunchedi: pontoon: add public LB class [puppet] - 10https://gerrit.wikimedia.org/r/685738 [09:31:20] (03PS1) 10Filippo Giunchedi: role: add pontoon::frontend role/profile [puppet] - 10https://gerrit.wikimedia.org/r/685739 [09:31:22] (03PS1) 10Alexandros Kosiaris: base: Remove the jessie if clause, move packages to array [puppet] - 10https://gerrit.wikimedia.org/r/685740 [09:31:32] (03PS1) 10Muehlenhoff: conftool::client: Remove python-socks [puppet] - 10https://gerrit.wikimedia.org/r/685741 [09:32:14] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 46 probes of 633 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:32:15] (03CR) 10Muehlenhoff: "Already done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/671095, but merging is blocked by the two remaining mwlog servers" [puppet] - 10https://gerrit.wikimedia.org/r/685740 (owner: 10Alexandros Kosiaris) [09:33:15] (03PS2) 10MSantos: maps imposm3: add log file for imposm3 sync [puppet] - 10https://gerrit.wikimedia.org/r/670817 [09:33:17] (03PS1) 10MSantos: WIP: maps: DB performance improvements [puppet] - 10https://gerrit.wikimedia.org/r/685743 [09:35:33] (03PS1) 10Alexandros Kosiaris: linkrecommendation: Bump limits/requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/685745 (https://phabricator.wikimedia.org/T279411) [09:36:56] (03PS2) 10Filippo Giunchedi: pontoon: enable sso for alerts in cloud [puppet] - 10https://gerrit.wikimedia.org/r/685735 [09:37:57] (03CR) 10Jbond: "LGTM, however being picky i think we could use absolute paths and drop all the cd, pushd and popd commands. Also this just deploys a scri" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/685537 (https://phabricator.wikimedia.org/T276589) (owner: 10Volans) [09:38:03] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: enable sso for alerts in cloud [puppet] - 10https://gerrit.wikimedia.org/r/685735 (owner: 10Filippo Giunchedi) [09:38:34] (03PS2) 10Filippo Giunchedi: pontoon: use public_domain for alerts/icinga [puppet] - 10https://gerrit.wikimedia.org/r/685736 [09:39:52] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: use public_domain for alerts/icinga [puppet] - 10https://gerrit.wikimedia.org/r/685736 (owner: 10Filippo Giunchedi) [09:40:46] (03PS2) 10Alexandros Kosiaris: linkrecommendation: Bump limits/requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/685745 (https://phabricator.wikimedia.org/T279411) [09:42:11] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/685571 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [09:42:52] (03CR) 10Alexandros Kosiaris: [C: 03+2] linkrecommendation: Bump limits/requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/685745 (https://phabricator.wikimedia.org/T279411) (owner: 10Alexandros Kosiaris) [09:44:20] (03Merged) 10jenkins-bot: linkrecommendation: Bump limits/requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/685745 (https://phabricator.wikimedia.org/T279411) (owner: 10Alexandros Kosiaris) [09:45:31] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [09:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:13] (03PS1) 10Hnowlan: Remove references to eventlog1002 [puppet] - 10https://gerrit.wikimedia.org/r/685746 (https://phabricator.wikimedia.org/T282025) [09:50:20] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' . [09:50:20] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [09:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:58] (03CR) 10Alexandros Kosiaris: [C: 03+1] docker: Stop copying config for each Debian version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/683979 (owner: 10Legoktm) [09:55:01] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' . [09:55:02] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [09:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:45] (03CR) 10Hnowlan: maps imposm3: add log file for imposm3 sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670817 (owner: 10MSantos) [09:59:23] (03PS1) 10Mvolz: Update Zotero to use new email for crossRef [deployment-charts] - 10https://gerrit.wikimedia.org/r/685747 (https://phabricator.wikimedia.org/T278516) [10:00:05] mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210506T1000). [10:03:23] (03CR) 10Volans: "> Patch Set 3:" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/685537 (https://phabricator.wikimedia.org/T276589) (owner: 10Volans) [10:10:04] Trying to deploy and I keep getting: ssh: Could not resolve hostname bast1002.wikimedia.org: Name or service not known - I assume this probably an issue with my DNS? [10:10:12] (03PS1) 10Effie Mouzeli: WIP: Add canary support in scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/685748 [10:10:29] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add canary support in scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/685748 (owner: 10Effie Mouzeli) [10:12:32] oh nvm apparently we're on 1003 now :) [10:13:17] (03CR) 10Elukey: "LGTM, but are the netboot/dhcp configs already taken care by something else? If so feel free to proceed :)" [puppet] - 10https://gerrit.wikimedia.org/r/685746 (https://phabricator.wikimedia.org/T282025) (owner: 10Hnowlan) [10:13:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 25%: Repool db1110', diff saved to https://phabricator.wikimedia.org/P15819 and previous config saved to /var/cache/conftool/dbconfig/20210506-101339-root.json [10:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:03] (03CR) 10Jbond: "> Patch Set 3:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685537 (https://phabricator.wikimedia.org/T276589) (owner: 10Volans) [10:14:14] (03CR) 10Alexandros Kosiaris: [C: 03+1] kube-apiserver: Update admission controller config [puppet] - 10https://gerrit.wikimedia.org/r/677922 (https://phabricator.wikimedia.org/T270063) (owner: 10JMeybohm) [10:14:17] (03CR) 10Hnowlan: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/685746 (https://phabricator.wikimedia.org/T282025) (owner: 10Hnowlan) [10:18:09] (03PS2) 10Hnowlan: Remove references to eventlog1002 [puppet] - 10https://gerrit.wikimedia.org/r/685746 (https://phabricator.wikimedia.org/T282025) [10:19:36] !log stop dbprov2002 in advance of maintenance T281135 [10:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:44] T281135: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 [10:21:34] 10SRE, 10ops-codfw, 10DBA, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10jcrespo) @Papaul could you turn dbprov2002 back on when you finish all needed maintenance? That's all it will need to be back into service. Thank you. [10:21:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/685571 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [10:28:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 50%: Repool db1110', diff saved to https://phabricator.wikimedia.org/P15820 and previous config saved to /var/cache/conftool/dbconfig/20210506-102842-root.json [10:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:42] (03CR) 10Muehlenhoff: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/685515 (owner: 10Ssingh) [10:32:53] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1, but don't forget to also bump the chart version in Chart.yaml for this to be picked up by deployments." [deployment-charts] - 10https://gerrit.wikimedia.org/r/683563 (https://phabricator.wikimedia.org/T278516) (owner: 10Mvolz) [10:33:39] (03CR) 10Mvolz: "> Patch Set 5: Code-Review+1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/683563 (https://phabricator.wikimedia.org/T278516) (owner: 10Mvolz) [10:33:41] (03CR) 10Muehlenhoff: [C: 03+2] openldap/offboard-user.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/662765 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [10:37:24] 10Puppet, 10SRE-tools, 10Patch-For-Review, 10Python3-Porting, and 3 others: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10MoritzMuehlenhoff) [10:37:31] 10Puppet, 10SRE-tools, 10Patch-For-Review, 10Python3-Porting, and 3 others: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10MoritzMuehlenhoff) [10:37:33] 10Puppet, 10SRE, 10SRE-tools, 10Patch-For-Review, and 4 others: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10MoritzMuehlenhoff) [10:39:13] (03PS1) 10Mvolz: Bump chart version to use new crossref e-mail [deployment-charts] - 10https://gerrit.wikimedia.org/r/685751 (https://phabricator.wikimedia.org/T278516) [10:39:19] (03PS2) 10Mvolz: Bump chart version to use new crossref e-mail [deployment-charts] - 10https://gerrit.wikimedia.org/r/685751 (https://phabricator.wikimedia.org/T278516) [10:42:24] (03CR) 10ArielGlenn: "nit: typo in commit message (Seach -> Search)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682102 (https://phabricator.wikimedia.org/T265939) (owner: 10Matthias Mullie) [10:43:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 75%: Repool db1110', diff saved to https://phabricator.wikimedia.org/P15821 and previous config saved to /var/cache/conftool/dbconfig/20210506-104346-root.json [10:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:06] (03CR) 10ArielGlenn: Enable Extension:MediaSearch on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682105 (https://phabricator.wikimedia.org/T265939) (owner: 10Matthias Mullie) [10:48:40] (03PS2) 10Matthias Mullie: Enable Extension:MediaSeach on testcommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682102 (https://phabricator.wikimedia.org/T265939) [10:49:35] (03PS3) 10Matthias Mullie: Enable Extension:MediaSeach on testcommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682102 (https://phabricator.wikimedia.org/T265939) [10:49:47] (03PS4) 10Matthias Mullie: Enable Extension:MediaSearch on testcommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682102 (https://phabricator.wikimedia.org/T265939) [10:52:11] (03CR) 10Mvolz: [C: 03+2] Bump chart version to use new crossref e-mail [deployment-charts] - 10https://gerrit.wikimedia.org/r/685751 (https://phabricator.wikimedia.org/T278516) (owner: 10Mvolz) [10:53:26] (03Merged) 10jenkins-bot: Bump chart version to use new crossref e-mail [deployment-charts] - 10https://gerrit.wikimedia.org/r/685751 (https://phabricator.wikimedia.org/T278516) (owner: 10Mvolz) [10:56:43] (03CR) 10Jbond: Add python_deploy::venv class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685537 (https://phabricator.wikimedia.org/T276589) (owner: 10Volans) [10:58:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 100%: Repool db1110', diff saved to https://phabricator.wikimedia.org/P15822 and previous config saved to /var/cache/conftool/dbconfig/20210506-105850-root.json [10:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1144:3315 for schema change', diff saved to https://phabricator.wikimedia.org/P15823 and previous config saved to /var/cache/conftool/dbconfig/20210506-105909-marostegui.json [10:59:14] (03PS1) 10Matthias Mullie: Enable Extension:MediaSearch on betacommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685752 (https://phabricator.wikimedia.org/T265939) [10:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:28] !log upgrading spicerack on cumin hosts to 0.0.51-1 [10:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:53] o/ [11:00:05] Amir1, Lucas_WMDE, apergos, and duesen: Time to snap out of that daydream and deploy EU Backport and Config training. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210506T1100). [11:00:05] matthiasmullie and CFisch_WMDE: A patch you scheduled for EU Backport and Config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] o/ [11:00:23] good $timezone! [11:00:26] (03CR) 10jerkins-bot: [V: 04-1] Enable Extension:MediaSearch on betacommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685752 (https://phabricator.wikimedia.org/T265939) (owner: 10Matthias Mullie) [11:00:26] o/ [11:00:47] I will join the google meet since this is a training window, but no one has signed up so we may not have any takers. [11:01:28] apergos: I think matthiasmullie and CFisch_WMDE can self serve :) [11:01:39] since there is no one else here, that's fine [11:02:00] matthiasmullie: I left a comment on your second patch in the window too, I might have misread what was happening though, have a look [11:02:02] (03CR) 10Jbond: Add python_deploy::venv class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685537 (https://phabricator.wikimedia.org/T276589) (owner: 10Volans) [11:02:23] +1 [11:02:35] apergos: you got that all right! [11:02:40] :-) [11:03:26] I was hoping to enable this (new) extension today (it's not in extension-list etc) [11:03:38] oh ho [11:03:41] what does that procedure look like? anything specific I should know about? [11:04:05] the one thing I would be cautious about is having multiple config files updated in a single patch UNLESS [11:04:17] you are 100% sure that the order the files land does not make a difference [11:04:23] does a simple sync-file suffice? [11:04:34] and i.e. if one hits first there won't be 'undefined variable' or something [11:04:39] yeah, the order matters, I would like to do them in sequence so that I can test them before proceeding [11:04:56] beta -> testcommons -> commons [11:04:57] if the order of two files in a single patch matters, you need to split them up [11:05:21] (also in doubt about whether to even do prod today, given that it's not yet on beta) [11:05:23] otherwise, if the order of the patches matters, well, you're self-serving, so... [11:05:24] :-) [11:05:31] if you're in doubt, wait for the next window [11:05:42] or early next week I guess it is [11:05:47] cool, will do beta only then [11:05:59] 👍 [11:06:18] at the end just remove the patches from the window in the calendar that didn't go, and make sure the beta one is in there :-) [11:07:06] 10SRE, 10Patch-For-Review: Additional CFSSL tasks - https://phabricator.wikimedia.org/T281369 (10Aklapper) [11:07:14] (03CR) 10Jbond: Add python_deploy::venv class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685537 (https://phabricator.wikimedia.org/T276589) (owner: 10Volans) [11:07:29] apergos: will do [11:08:12] CFisch_WMDE: you can go first with your patch [11:08:23] matthiasmullie: k [11:08:31] (03PS1) 10Kormat: install_server: switch db1173 to buster [puppet] - 10https://gerrit.wikimedia.org/r/685753 (https://phabricator.wikimedia.org/T280751) [11:08:33] (03PS4) 10WMDE-Fisch: Enable ReferencePreviews as full default on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685554 (https://phabricator.wikimedia.org/T271206) [11:09:29] (03CR) 10WMDE-Fisch: [C: 03+2] "Deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685554 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [11:09:36] (03PS1) 10Kormat: db1173: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/685754 (https://phabricator.wikimedia.org/T280751) [11:10:13] (03Merged) 10jenkins-bot: Enable ReferencePreviews as full default on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685554 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [11:10:42] (03CR) 10Kormat: [C: 03+2] install_server: switch db1173 to buster [puppet] - 10https://gerrit.wikimedia.org/r/685753 (https://phabricator.wikimedia.org/T280751) (owner: 10Kormat) [11:10:54] (03CR) 10Kormat: [C: 03+2] db1173: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/685754 (https://phabricator.wikimedia.org/T280751) (owner: 10Kormat) [11:11:39] * CFisch_WMDE testing on mwdebug [11:11:44] woo hoo! [11:12:25] !log reimaging db1173 to buster T280751 [11:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:33] T280751: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 [11:12:42] on 1002 is that? [11:12:56] !log kormat@cumin1001 dbctl commit (dc=all): 'db1173 depooling: Reimage to buster T280751', diff saved to https://phabricator.wikimedia.org/P15824 and previous config saved to /var/cache/conftool/dbconfig/20210506-111256-kormat.json [11:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:16] I guess 1001 [11:13:58] 1001 [11:14:11] yeah, I'm watching logstash and saw the scap notices :-) [11:14:22] Seems to work fine :-), moving forward [11:14:45] 👍 [11:16:15] Since it's two files in the config dir it's fine to scap "wmf-config" right? [11:16:53] apergos: [11:16:57] (03PS1) 10Jbond: P:pki::client: ensure profile is ensureable [puppet] - 10https://gerrit.wikimedia.org/r/685755 (https://phabricator.wikimedia.org/T281369) [11:17:19] (03PS2) 10Jbond: P:pki::client: ensure profile is ensureable [puppet] - 10https://gerrit.wikimedia.org/r/685755 (https://phabricator.wikimedia.org/T281369) [11:17:48] uh [11:18:04] not sure. Urbanecm what do you think? [11:18:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:18:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29424/console" [puppet] - 10https://gerrit.wikimedia.org/r/685755 (https://phabricator.wikimedia.org/T281369) (owner: 10Jbond) [11:18:17] or shall I rather scap one file after the other with the same scap message? [11:18:33] if you know that a specific order will be ok, you can absolutely do one file at a time [11:18:40] (03PS2) 10Matthias Mullie: Enable Extension:MediaSearch on betacommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685752 (https://phabricator.wikimedia.org/T265939) [11:18:49] (03CR) 10jerkins-bot: [V: 04-1] P:pki::client: ensure profile is ensureable [puppet] - 10https://gerrit.wikimedia.org/r/685755 (https://phabricator.wikimedia.org/T281369) (owner: 10Jbond) [11:19:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:19:50] just waiting a few secs for Urbanecm to catch up :-D [11:21:17] ok i'll do one at a time. no dependency there and it feels slightly safer like that ^^' [11:21:22] it is safer [11:21:35] in fact our deployment guide expects you to do it that way [11:21:44] +1 [11:22:30] !log wmde-fisch@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:685554|Enable ReferencePreviews as full default on pilot wikis (T271206)]] (duration: 01m 06s) [11:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:38] T271206: Enable RefPreviews on first wikis - https://phabricator.wikimedia.org/T271206 [11:22:52] (03CR) 10Elukey: [C: 03+1] Remove references to eventlog1002 [puppet] - 10https://gerrit.wikimedia.org/r/685746 (https://phabricator.wikimedia.org/T282025) (owner: 10Hnowlan) [11:23:52] !log wmde-fisch@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:685554|Enable ReferencePreviews as full default on pilot wikis (T271206)]] (duration: 01m 06s) [11:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:39] Done and seems to be working :-) [11:25:42] \o/ [11:25:51] I'm watching logstash and it seems quiet [11:26:00] the last time this patch went out, there was some issue? [11:26:17] if that's right, how long did you have to wait for the problem to be noticeable? [11:26:30] s/this patch/a related patch/ [11:27:18] (03PS3) 10Hnowlan: Remove references to eventlog1002 [puppet] - 10https://gerrit.wikimedia.org/r/685746 (https://phabricator.wikimedia.org/T282025) [11:27:24] CFisch_WMDE: [11:27:54] !log hnowlan@cumin1001 START - Cookbook sre.hosts.decommission for hosts eventlog1002.eqiad.wmnet [11:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:04] apergos: The issue was due to some fault in our code and just lead to the feature not showing. [11:28:12] ah ok [11:28:14] ( there was a config flag set wrongly ) [11:28:30] well things still look ok in the log so [11:28:33] no errors or faults though [11:28:36] w00t and carry on! [11:28:47] CFisch_WMDE: you're done? [11:28:53] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts eventlog1002.eqiad.wmnet [11:28:59] matthiasmullie: yes, thanks [11:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:02] (03CR) 10Hnowlan: [C: 03+2] Remove references to eventlog1002 [puppet] - 10https://gerrit.wikimedia.org/r/685746 (https://phabricator.wikimedia.org/T282025) (owner: 10Hnowlan) [11:29:12] CFisch_WMDE: thanks - I'll move forward [11:29:44] gre3at [11:29:48] s/3// [11:29:58] (03PS3) 10Matthias Mullie: Enable Extension:MediaSearch on betacommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685752 (https://phabricator.wikimedia.org/T265939) [11:30:08] (03CR) 10Matthias Mullie: [C: 03+2] Enable Extension:MediaSearch on betacommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685752 (https://phabricator.wikimedia.org/T265939) (owner: 10Matthias Mullie) [11:30:17] !log hnowlan@cumin1001 START - Cookbook sre.hosts.decommission for hosts eventlog1002.eqiad.wmnet [11:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:57] (03Merged) 10jenkins-bot: Enable Extension:MediaSearch on betacommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685752 (https://phabricator.wikimedia.org/T265939) (owner: 10Matthias Mullie) [11:31:53] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1173.eqiad.wmnet with reason: REIMAGE [11:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:03] (03PS3) 10Jbond: P:pki::client: ensure profile is ensureable [puppet] - 10https://gerrit.wikimedia.org/r/685755 (https://phabricator.wikimedia.org/T281369) [11:34:00] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1173.eqiad.wmnet with reason: REIMAGE [11:34:05] matthiasmullie: the order of deployment of these files in the patch is going to matter, I think [11:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:13] !log mlitn@deploy1002 sync-file aborted: Config: [[gerrit:685752|Enable Extension:MediaSearch on betacommons (T265939)]] (duration: 00m 56s) [11:34:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29425/console" [puppet] - 10https://gerrit.wikimedia.org/r/685755 (https://phabricator.wikimedia.org/T281369) (owner: 10Jbond) [11:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:21] T265939: Split MediaSearch out into its own extension - https://phabricator.wikimedia.org/T265939 [11:35:15] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::client: ensure profile is ensureable [puppet] - 10https://gerrit.wikimedia.org/r/685755 (https://phabricator.wikimedia.org/T281369) (owner: 10Jbond) [11:35:22] !log mlitn@deploy1002 Synchronized wmf-config: Config: [[gerrit:685752|Enable Extension:MediaSearch on betacommons (T265939)]] (duration: 01m 06s) [11:35:28] you need the variable to be defined before it's referenced in an if statement, which means both InitialiseSettings files need to go around before CommonSettings [11:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:06] apergos: you're right, and failed to realize that in time [11:37:13] and I* failed.. [11:37:29] I think that's just a notice which gets absorbed, thopugh? [11:37:36] this is why we like files to be in separate patches, though it's a bit tedious [11:38:03] yup yup. rule of thumb is IS.php -> CS.php [11:39:02] I am not sure, these might wind up as exceptions turning into 500s . it would only be a tiny blip in the present case but still [11:39:25] keep in mind for next time. we'll see you back here early next week for more :-) [11:40:41] will probably be late next week [11:40:47] okey dokey [11:40:49] but yes, will keep an extra eye out for that! [11:40:51] Thanks :) [11:40:57] if it's during this window I'll probably see you there/here! [11:41:14] I'm watching logstash and it's fine, of course this is only on beta so [11:41:19] (03PS1) 10Jbond: P:pki::client: explicitly include P:pki::clinet [puppet] - 10https://gerrit.wikimedia.org/r/685756 (https://phabricator.wikimedia.org/T281369) [11:41:45] I've reclaimed my cpu cycles and closed the google meet tab since clearly no one is showing up for the deployment training this time [11:42:54] (03CR) 10Muehlenhoff: [C: 03+2] openldap: Remove python-ldap [puppet] - 10https://gerrit.wikimedia.org/r/685725 (owner: 10Muehlenhoff) [11:43:02] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/685741 (owner: 10Muehlenhoff) [11:43:28] (03CR) 10jerkins-bot: [V: 04-1] P:pki::client: explicitly include P:pki::clinet [puppet] - 10https://gerrit.wikimedia.org/r/685756 (https://phabricator.wikimedia.org/T281369) (owner: 10Jbond) [11:43:47] well... that's the end of today's backport window unless, Urbanecm, you want to sneak something in :-D [11:43:58] (03PS2) 10Jbond: P:pki::client: explicitly include P:pki::clinet [puppet] - 10https://gerrit.wikimedia.org/r/685756 (https://phabricator.wikimedia.org/T281369) [11:43:59] apergos: it's really not only beta. CS.php was changed, which is loaded for every single request. Maybe we were lucky and rsync synced it in the correct order, but otherwise we'd have few thousands of exceptions :) [11:44:01] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts eventlog1002.eqiad.wmnet [11:44:04] apergos: nope, not this time, thank you :) [11:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:31] Urbanecm: yes that would be about the blip. I mean after the recovery from the blip, we're not expecting any impact on the production wikis [11:44:45] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29427/console" [puppet] - 10https://gerrit.wikimedia.org/r/685756 (https://phabricator.wikimedia.org/T281369) (owner: 10Jbond) [11:44:53] whatever was to be seen, has been seen, in other words [11:45:45] yeah, it's not permanent [11:46:14] (03PS1) 10Hnowlan: wmnet: correct eventlogging CNAME [dns] - 10https://gerrit.wikimedia.org/r/685757 (https://phabricator.wikimedia.org/T278137) [11:46:35] (03CR) 10Hnowlan: "Is this record still required? I see little reference to it in git." [dns] - 10https://gerrit.wikimedia.org/r/685757 (https://phabricator.wikimedia.org/T278137) (owner: 10Hnowlan) [11:47:53] I guess I'll wander off in search of smoothie then, it's about that time. [11:48:14] thanks for being around to fill in the gaps! [11:50:57] (03PS1) 10Jbond: cloud - pki: add deployment prep intermidate [puppet] - 10https://gerrit.wikimedia.org/r/685758 [11:51:12] (03CR) 10Jbond: [V: 03+2 C: 03+2] cloud - pki: add deployment prep intermidate [puppet] - 10https://gerrit.wikimedia.org/r/685758 (owner: 10Jbond) [11:51:51] (03CR) 10Elukey: [C: 03+1] "I don't think it is used but for the moment I'd be in favor of just fixing it, it will then fully deprecated when eventlogging will be dec" [dns] - 10https://gerrit.wikimedia.org/r/685757 (https://phabricator.wikimedia.org/T278137) (owner: 10Hnowlan) [11:53:56] (03PS1) 10Muehlenhoff: Don't install python-etcd starting with bullseye [puppet] - 10https://gerrit.wikimedia.org/r/685759 [11:54:20] (03CR) 10Jgiannelos: [C: 04-1] "Building the image locally raises this error:" [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/685486 (owner: 10MSantos) [11:54:44] (03PS4) 10Volans: Add python_deploy::venv class [puppet] - 10https://gerrit.wikimedia.org/r/685537 (https://phabricator.wikimedia.org/T276589) [11:55:07] (03CR) 10Volans: "Addressed comments" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685537 (https://phabricator.wikimedia.org/T276589) (owner: 10Volans) [11:58:01] (03CR) 10Jgiannelos: [C: 04-1] "I think `$G` in `$GOPATH` should be escaped." [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/685486 (owner: 10MSantos) [12:02:33] (03CR) 10Jgiannelos: [C: 04-1] "Even with that escaped Makefile fails with:" [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/685486 (owner: 10MSantos) [12:13:33] (03CR) 10Hnowlan: [C: 03+2] wmnet: correct eventlogging CNAME [dns] - 10https://gerrit.wikimedia.org/r/685757 (https://phabricator.wikimedia.org/T278137) (owner: 10Hnowlan) [12:14:03] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I think we can also just remove python-etcd at this point, but let's do it explicitly after some testing." [puppet] - 10https://gerrit.wikimedia.org/r/685759 (owner: 10Muehlenhoff) [12:17:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:18:14] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, two typos inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/685537 (https://phabricator.wikimedia.org/T276589) (owner: 10Volans) [12:20:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:22:05] (03PS5) 10Volans: Add python_deploy::venv class [puppet] - 10https://gerrit.wikimedia.org/r/685537 (https://phabricator.wikimedia.org/T276589) [12:22:14] (03CR) 10Volans: "Thanks, fixed comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/685537 (https://phabricator.wikimedia.org/T276589) (owner: 10Volans) [12:22:25] (03PS1) 10Elukey: hadoop: force Yarn to use DominantResourceCalculator [puppet] - 10https://gerrit.wikimedia.org/r/685762 (https://phabricator.wikimedia.org/T281792) [12:22:45] joal: --^ [12:23:46] (03CR) 10Joal: [C: 03+1] "Thanks Luca :)" [puppet] - 10https://gerrit.wikimedia.org/r/685762 (https://phabricator.wikimedia.org/T281792) (owner: 10Elukey) [12:24:31] (03CR) 10Elukey: [C: 03+2] hadoop: force Yarn to use DominantResourceCalculator [puppet] - 10https://gerrit.wikimedia.org/r/685762 (https://phabricator.wikimedia.org/T281792) (owner: 10Elukey) [12:29:18] 10SRE, 10Wikimedia-Logstash, 10observability: Ingest production logs with ELK7 - https://phabricator.wikimedia.org/T235891 (10fgiunchedi) [12:29:26] 10SRE, 10Wikimedia-Logstash, 10observability: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10fgiunchedi) [12:30:02] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10fgiunchedi) [12:33:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29428/console" [puppet] - 10https://gerrit.wikimedia.org/r/685756 (https://phabricator.wikimedia.org/T281369) (owner: 10Jbond) [12:35:49] (03PS1) 10Muehlenhoff: etcd::client::globalconfig: Remove python-etcd [puppet] - 10https://gerrit.wikimedia.org/r/685766 [12:35:53] (03PS3) 10Jbond: P:pki::client: explicitly include P:pki::clinet [puppet] - 10https://gerrit.wikimedia.org/r/685756 (https://phabricator.wikimedia.org/T281369) [12:36:37] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29429/console" [puppet] - 10https://gerrit.wikimedia.org/r/685756 (https://phabricator.wikimedia.org/T281369) (owner: 10Jbond) [12:40:47] (03PS4) 10Jbond: P:pki::client: explicitly include P:pki::clinet [puppet] - 10https://gerrit.wikimedia.org/r/685756 (https://phabricator.wikimedia.org/T281369) [12:40:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29430/console" [puppet] - 10https://gerrit.wikimedia.org/r/685756 (https://phabricator.wikimedia.org/T281369) (owner: 10Jbond) [12:45:05] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::client: explicitly include P:pki::clinet [puppet] - 10https://gerrit.wikimedia.org/r/685756 (https://phabricator.wikimedia.org/T281369) (owner: 10Jbond) [12:50:00] (03CR) 10Jgiannelos: [C: 04-1] "Just removing the `GEOM_HASH` references should be fine for the metrics." [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/685486 (owner: 10MSantos) [12:50:40] (03PS1) 10Filippo Giunchedi: alertmanager: route noc paging alerts to SRE batphone [puppet] - 10https://gerrit.wikimedia.org/r/685778 (https://phabricator.wikimedia.org/T281095) [12:50:42] (03PS1) 10Filippo Giunchedi: icinga: switch to LibreNMS AlertManager paging [puppet] - 10https://gerrit.wikimedia.org/r/685779 (https://phabricator.wikimedia.org/T281095) [12:52:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1144:3315 (re)pooling @ 25%: Repool db1144:3315', diff saved to https://phabricator.wikimedia.org/P15825 and previous config saved to /var/cache/conftool/dbconfig/20210506-125226-root.json [12:52:28] (03CR) 10Filippo Giunchedi: "Note this won't be live yet; post-merge we'll add the paging transports to appropriate alerts in librenms" [puppet] - 10https://gerrit.wikimedia.org/r/685778 (https://phabricator.wikimedia.org/T281095) (owner: 10Filippo Giunchedi) [12:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:24] (03CR) 10Tonina Zhelyazkova: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685776 (https://phabricator.wikimedia.org/T241422) (owner: 10Tonina Zhelyazkova) [12:59:26] (03CR) 10Zabe: Disabling Education Program extension in ru.wiki per T282112 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685597 (owner: 10Rubin) [13:00:04] brennen and liw: May I have your attention please! MediaWiki train - American+European Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210506T1300) [13:02:42] (03CR) 10Muehlenhoff: [C: 03+2] conftool::client: Remove python-socks [puppet] - 10https://gerrit.wikimedia.org/r/685741 (owner: 10Muehlenhoff) [13:06:48] (03PS2) 10Jgiannelos: build: add build info flags [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/685486 (owner: 10MSantos) [13:07:23] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685314 (https://phabricator.wikimedia.org/T281792) (owner: 10Elukey) [13:07:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1144:3315 (re)pooling @ 50%: Repool db1144:3315', diff saved to https://phabricator.wikimedia.org/P15826 and previous config saved to /var/cache/conftool/dbconfig/20210506-130730-root.json [13:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:11] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:08:12] (03CR) 10Elukey: [V: 03+1] bigtop::hadoop::nodemanager: apply systemd override to service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685314 (https://phabricator.wikimedia.org/T281792) (owner: 10Elukey) [13:08:15] (03CR) 10Jgiannelos: "@MSantos I amended the commit without the `GEOM_HASH` (which is not required in the first place)." [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/685486 (owner: 10MSantos) [13:08:29] (03CR) 10Rubin: "thank you" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685597 (owner: 10Rubin) [13:08:35] (03PS5) 10Elukey: bigtop::hadoop::nodemanager: apply systemd override to service [puppet] - 10https://gerrit.wikimedia.org/r/685314 (https://phabricator.wikimedia.org/T281792) [13:08:44] (03PS3) 10Rubin: Disabling Education Program extension in Russian Wikipedia Bug: T282112 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685597 (https://phabricator.wikimedia.org/T282112) [13:09:47] (03CR) 10MSantos: [C: 03+1] build: add build info flags [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/685486 (owner: 10MSantos) [13:10:07] (03PS4) 10Rubin: Disabling Education Program extension in Russian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685597 (https://phabricator.wikimedia.org/T282112) [13:10:33] (03CR) 10Elukey: [C: 03+2] bigtop::hadoop::nodemanager: apply systemd override to service [puppet] - 10https://gerrit.wikimedia.org/r/685314 (https://phabricator.wikimedia.org/T281792) (owner: 10Elukey) [13:10:41] (03CR) 10Joal: [C: 03+1] "LGTM (even if I have no clue what I'm talking about!)" [puppet] - 10https://gerrit.wikimedia.org/r/685314 (https://phabricator.wikimedia.org/T281792) (owner: 10Elukey) [13:10:43] (03CR) 10Jgiannelos: [C: 03+2] build: add build info flags [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/685486 (owner: 10MSantos) [13:11:42] (03Merged) 10jenkins-bot: build: add build info flags [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/685486 (owner: 10MSantos) [13:16:18] (03PS1) 10Elukey: bigtop::hadoop::nodemanager: use double quotes for override content [puppet] - 10https://gerrit.wikimedia.org/r/685787 [13:17:15] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29432/console" [puppet] - 10https://gerrit.wikimedia.org/r/685787 (owner: 10Elukey) [13:17:25] (03PS1) 10Alexandros Kosiaris: linkrecommendation: Match gunicorn status code in statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/685788 [13:17:29] (03CR) 10Elukey: [V: 03+1 C: 03+2] bigtop::hadoop::nodemanager: use double quotes for override content [puppet] - 10https://gerrit.wikimedia.org/r/685787 (owner: 10Elukey) [13:20:00] 10SRE, 10Commons, 10MediaWiki-API: Frequent 504s while using logevents API on Commons - https://phabricator.wikimedia.org/T282122 (10AntiCompositeNumber) [13:21:20] !log push pfw policies - T281942 [13:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:30] (03PS1) 10Volans: icinga: support verbatim hosts in icinga-status [puppet] - 10https://gerrit.wikimedia.org/r/685789 [13:22:20] (03PS1) 10Elukey: bigtop::hadoop::nodemanager: use the correct systemd setting [puppet] - 10https://gerrit.wikimedia.org/r/685790 [13:22:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1144:3315 (re)pooling @ 75%: Repool db1144:3315', diff saved to https://phabricator.wikimedia.org/P15827 and previous config saved to /var/cache/conftool/dbconfig/20210506-132234-root.json [13:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:08] (03CR) 10Elukey: [C: 03+2] bigtop::hadoop::nodemanager: use the correct systemd setting [puppet] - 10https://gerrit.wikimedia.org/r/685790 (owner: 10Elukey) [13:24:36] (03CR) 10Marostegui: [C: 03+1] "thanks!" [software] - 10https://gerrit.wikimedia.org/r/685524 (owner: 10Jcrespo) [13:36:29] (03PS1) 10Jdrewniak: Remove Vector language button from Commons, Wikidata, Mediawiki, Wikispecies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685795 (https://phabricator.wikimedia.org/T281968) [13:37:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1144:3315 (re)pooling @ 100%: Repool db1144:3315', diff saved to https://phabricator.wikimedia.org/P15828 and previous config saved to /var/cache/conftool/dbconfig/20210506-133738-root.json [13:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:39] (03PS1) 10Jgiannelos: Maps vector server PostGIS query improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/685799 [13:46:35] (03PS1) 10Jbond: P:wikidough: Add tcp connect checks for DoH and DTLS [puppet] - 10https://gerrit.wikimedia.org/r/685800 (https://phabricator.wikimedia.org/T252132) [13:46:54] 10SRE, 10Commons, 10MediaWiki-API: Frequent 504s while using logevents API on Commons - https://phabricator.wikimedia.org/T282122 (10Rubin16) p:05Triage→03High [13:47:03] (03CR) 10Jgiannelos: [C: 04-1] "Blocking this for now because its work in progress" [deployment-charts] - 10https://gerrit.wikimedia.org/r/685799 (owner: 10Jgiannelos) [13:48:10] (03CR) 10jerkins-bot: [V: 04-1] P:wikidough: Add tcp connect checks for DoH and DTLS [puppet] - 10https://gerrit.wikimedia.org/r/685800 (https://phabricator.wikimedia.org/T252132) (owner: 10Jbond) [13:48:28] (03PS2) 10Jbond: P:wikidough: Add tcp connect checks for DoH and DTLS [puppet] - 10https://gerrit.wikimedia.org/r/685800 (https://phabricator.wikimedia.org/T252132) [13:49:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29436/console" [puppet] - 10https://gerrit.wikimedia.org/r/685800 (https://phabricator.wikimedia.org/T252132) (owner: 10Jbond) [13:49:53] (03CR) 10Ssingh: "Abandoning in favour of https://gerrit.wikimedia.org/r/c/operations/puppet/+/685800/." [puppet] - 10https://gerrit.wikimedia.org/r/685030 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [13:50:03] (03Abandoned) 10Ssingh: wikidough: add nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/685030 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [13:54:18] (03CR) 10Ssingh: "Thanks Moritz, for the very detailed reply! I learned a lot..." [puppet] - 10https://gerrit.wikimedia.org/r/685515 (owner: 10Ssingh) [13:57:18] 10SRE, 10ops-codfw, 10DBA, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [14:00:11] (03CR) 10Muehlenhoff: [C: 03+2] Don't install python-etcd starting with bullseye [puppet] - 10https://gerrit.wikimedia.org/r/685759 (owner: 10Muehlenhoff) [14:02:10] (03PS1) 10Muehlenhoff: Remove obsolete tlsproxy::ocsp and related configs [puppet] - 10https://gerrit.wikimedia.org/r/685810 [14:02:12] (03PS1) 10Muehlenhoff: Remove obsolete profile cache:ssl:unified [puppet] - 10https://gerrit.wikimedia.org/r/685811 [14:04:32] (03PS1) 10Volans: icinga: fix typo in docstring [software/spicerack] - 10https://gerrit.wikimedia.org/r/685812 [14:04:34] (03PS1) 10Volans: netbox: fix check for server role [software/spicerack] - 10https://gerrit.wikimedia.org/r/685813 [14:04:36] (03PS1) 10Volans: icinga: pass verbatim_hosts option to icinga-status [software/spicerack] - 10https://gerrit.wikimedia.org/r/685814 [14:05:45] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:09:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1113:3315 for schema change', diff saved to https://phabricator.wikimedia.org/P15829 and previous config saved to /var/cache/conftool/dbconfig/20210506-140916-marostegui.json [14:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:21] (03PS1) 10Muehlenhoff: Add cumin2002 as cumin master and allow for tcpircbot and ganeti/rapi [puppet] - 10https://gerrit.wikimedia.org/r/685817 (https://phabricator.wikimedia.org/T276589) [14:12:20] (03CR) 10Volans: [C: 03+1] "LGTM, I can't tell by memory if there is any other places. I think there might be some MySQL grant though." [puppet] - 10https://gerrit.wikimedia.org/r/685817 (https://phabricator.wikimedia.org/T276589) (owner: 10Muehlenhoff) [14:13:51] PROBLEM - Host dbprov2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:16:31] (03CR) 10Muehlenhoff: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/685817 (https://phabricator.wikimedia.org/T276589) (owner: 10Muehlenhoff) [14:18:00] (03PS2) 10Jsn.sherman: labs: Enable TheWikipediaLibrary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685520 (https://phabricator.wikimedia.org/T282143) [14:18:38] (03CR) 10Muehlenhoff: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/685515 (owner: 10Ssingh) [14:18:51] RECOVERY - Host elastic2043 is UP: PING OK - Packet loss = 0%, RTA = 31.61 ms [14:19:02] (03CR) 10Muehlenhoff: [C: 03+2] Add cumin2002 as cumin master and allow for tcpircbot and ganeti/rapi [puppet] - 10https://gerrit.wikimedia.org/r/685817 (https://phabricator.wikimedia.org/T276589) (owner: 10Muehlenhoff) [14:19:11] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/685814 (owner: 10Volans) [14:20:08] (03CR) 10Volans: [C: 03+2] "docstring typo, self-merging" [software/spicerack] - 10https://gerrit.wikimedia.org/r/685812 (owner: 10Volans) [14:20:24] (03CR) 10CDanis: [C: 03+1] alertmanager: route noc paging alerts to SRE batphone [puppet] - 10https://gerrit.wikimedia.org/r/685778 (https://phabricator.wikimedia.org/T281095) (owner: 10Filippo Giunchedi) [14:21:32] (03PS3) 10Ssingh: P:wikidough: Add TCP connect check for DoH [puppet] - 10https://gerrit.wikimedia.org/r/685800 (https://phabricator.wikimedia.org/T252132) (owner: 10Jbond) [14:30:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1113:3315', diff saved to https://phabricator.wikimedia.org/P15833 and previous config saved to /var/cache/conftool/dbconfig/20210506-143002-marostegui.json [14:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:52] (03CR) 10Jbond: [C: 03+1] icinga: support verbatim hosts in icinga-status [puppet] - 10https://gerrit.wikimedia.org/r/685789 (owner: 10Volans) [14:40:02] (03PS1) 10Ssingh: Revert "package_builder: add python3-yaml" [puppet] - 10https://gerrit.wikimedia.org/r/685602 [14:40:06] (03CR) 10Scardenasmolinar: [C: 03+1] labs: Enable TheWikipediaLibrary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685520 (https://phabricator.wikimedia.org/T282143) (owner: 10Jsn.sherman) [14:40:18] (03Merged) 10jenkins-bot: icinga: fix typo in docstring [software/spicerack] - 10https://gerrit.wikimedia.org/r/685812 (owner: 10Volans) [14:40:21] (03PS3) 10Jsn.sherman: labs: Enable TheWikipediaLibrary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685520 (https://phabricator.wikimedia.org/T282143) [14:40:25] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29437/console" [puppet] - 10https://gerrit.wikimedia.org/r/685800 (https://phabricator.wikimedia.org/T252132) (owner: 10Jbond) [14:46:33] RECOVERY - Host dbprov2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 44.94 ms [14:47:11] (03PS1) 10WMDE-Fisch: Enable ReferencePreviews as full default on Marathi wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685820 (https://phabricator.wikimedia.org/T282147) [14:48:52] (03CR) 10Jbond: "drop export and assum +1 from me" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685537 (https://phabricator.wikimedia.org/T276589) (owner: 10Volans) [14:51:07] (03PS6) 10Volans: Add python_deploy::venv class [puppet] - 10https://gerrit.wikimedia.org/r/685537 (https://phabricator.wikimedia.org/T276589) [14:51:43] (03PS1) 10Ahmon Dancy: Make $wgGEDatabaseCluster default to false in train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/685822 [14:51:57] (03CR) 10Volans: "addressed comments" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685537 (https://phabricator.wikimedia.org/T276589) (owner: 10Volans) [14:52:24] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/685537 (https://phabricator.wikimedia.org/T276589) (owner: 10Volans) [14:52:56] jynus: dbprov2002 is back online [14:53:06] thank you very much, papaul ! [14:53:12] (03CR) 10jerkins-bot: [V: 04-1] Make $wgGEDatabaseCluster default to false in train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/685822 (owner: 10Ahmon Dancy) [14:54:42] (03PS1) 10Ssingh: nagios_common: add check_tcp_ssl [puppet] - 10https://gerrit.wikimedia.org/r/685823 (https://phabricator.wikimedia.org/T252132) [14:55:34] !log powerdown kafka-main2002 for relocation [14:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:16] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/685537 (https://phabricator.wikimedia.org/T276589) (owner: 10Volans) [14:56:19] (03PS2) 10Ahmon Dancy: Make $wgGEDatabaseCluster default to false in train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/685822 [14:57:37] PROBLEM - Host kafka-main2002 is DOWN: PING CRITICAL - Packet loss = 100% [14:58:33] (03CR) 10Ahmon Dancy: [C: 03+2] Make $wgGEDatabaseCluster default to false in train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/685822 (owner: 10Ahmon Dancy) [14:58:48] (03CR) 10Ahmon Dancy: [V: 03+2 C: 03+2] Make $wgGEDatabaseCluster default to false in train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/685822 (owner: 10Ahmon Dancy) [15:00:35] PROBLEM - Host kafka-main2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:04:52] had an internet fault during my deploy window earlier today so heads up I'm going to be using this free hour in the schedule to finish up. Provided it doesn't drop again. >.< [15:05:18] !log imported wmfmariadbpy 0.6+deb11u1 for bullseye-wikimedia to apt.wikimedia.org [15:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:04] !log mvolz@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [15:06:10] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 105 hosts with reason: T270704 [15:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:19] T270704: cloud: introduce new edge network architecture for eqiad1 and codfw1dev - https://phabricator.wikimedia.org/T270704 [15:06:48] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 105 hosts with reason: T270704 [15:06:54] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 9 hosts with reason: T270704 [15:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:57] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 9 hosts with reason: T270704 [15:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:00] !log imported wmfbackups 0.5+deb11u1 for bullseye-wikimedia to apt.wikimedia.org [15:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:25] RECOVERY - Host kafka-main2002 is UP: PING OK - Packet loss = 0%, RTA = 31.63 ms [15:13:47] RECOVERY - Host kafka-main2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms [15:14:54] !log powerdown ms-be2053 for relocation [15:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:09] (03PS1) 10Muehlenhoff: Unconditionally install spicerack from "main" [puppet] - 10https://gerrit.wikimedia.org/r/685825 [15:16:26] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable ReferencePreviews as full default on Marathi wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685820 (https://phabricator.wikimedia.org/T282147) (owner: 10WMDE-Fisch) [15:16:44] !log mvolz@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [15:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:14] (03Restored) 10Ahmon Dancy: Test emailing notification of security patch failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679015 (owner: 10Ahmon Dancy) [15:18:20] (03PS2) 10Ahmon Dancy: WIP: Test emailing notification of security patch failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679015 [15:18:37] PROBLEM - Host ms-be2053 is DOWN: PING CRITICAL - Packet loss = 100% [15:20:05] !log mvolz@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [15:20:11] (03PS3) 10Ahmon Dancy: WIP: Test emailing notification of security patch failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679015 [15:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:21] moritzm: is that okay? I can stop if it will be a problem? I have one more service to redeploy. [15:26:04] !log T280382 `wdqs1007.eqiad.wmnet` has been re-imaged and had the appropriate wikidata/categories journal files transferred. `df -h` shows disk space is no longer an issue following the switch to `raid0`: `/dev/md2 2.6T 998G 1.5T 40% /srv` [15:26:10] !log T280382 `wdqs2004.codfw.wmnet` has been re-imaged and had the appropriate wikidata/categories journal files transferred. `df -h` shows disk space is no longer an issue following the switch to `raid0`: `/dev/md2 2.6T 998G 1.5T 40% /srv` [15:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:13] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [15:26:15] !log T280382 [WDQS] Pooled `wdqs1007` and `wdqs2004` [15:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:34] (03CR) 10Mvolz: [C: 03+2] Update Zotero to use new email for crossRef [deployment-charts] - 10https://gerrit.wikimedia.org/r/685747 (https://phabricator.wikimedia.org/T278516) (owner: 10Mvolz) [15:27:43] PROBLEM - Host ms-be2053.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:27:53] (03PS4) 10Arturo Borrero Gonzalez: openstack: neutron: topology changes for cloudgw [puppet] - 10https://gerrit.wikimedia.org/r/683268 (https://phabricator.wikimedia.org/T270704) [15:27:56] (03PS2) 10Arturo Borrero Gonzalez: wikimediacloud.org: update names for cloudgw migration [dns] - 10https://gerrit.wikimedia.org/r/684864 (https://phabricator.wikimedia.org/T270704) [15:28:43] (03CR) 10Jbond: [C: 04-1] "see comment" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/685800 (https://phabricator.wikimedia.org/T252132) (owner: 10Jbond) [15:28:49] (03Merged) 10jenkins-bot: Update Zotero to use new email for crossRef [deployment-charts] - 10https://gerrit.wikimedia.org/r/685747 (https://phabricator.wikimedia.org/T278516) (owner: 10Mvolz) [15:29:01] !log cdanis@cumin1001 START - Cookbook sre.hosts.downtime for 0:05:00 on cumin1001.eqiad.wmnet with reason: quiz [15:29:01] !log cdanis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on cumin1001.eqiad.wmnet with reason: quiz [15:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:16] 10SRE, 10Patch-For-Review: Additional CFSSL tasks - https://phabricator.wikimedia.org/T281369 (10jbond) [15:29:49] 10SRE, 10ops-codfw, 10DBA, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [15:31:13] !log mvolz@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' . [15:31:15] 10SRE, 10ops-codfw, 10DBA, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [15:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:46] (03PS2) 10Giuseppe Lavagetto: Add diff tasks to rake [deployment-charts] - 10https://gerrit.wikimedia.org/r/685721 [15:31:48] (03PS4) 10Giuseppe Lavagetto: eventgate: add kafka egress policy stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/684855 (https://phabricator.wikimedia.org/T253058) [15:31:50] (03PS4) 10Giuseppe Lavagetto: eventgate-main: autogenerate egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/684856 [15:31:51] RECOVERY - Host ms-be2053.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.34 ms [15:32:03] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs2003.codfw.wmnet` on `ryankemper@cumin1001` tmux session `reimage` [15:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:10] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [15:32:12] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs1012.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` [15:32:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: neutron: topology changes for cloudgw [puppet] - 10https://gerrit.wikimedia.org/r/683268 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [15:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:02] (03PS3) 10Giuseppe Lavagetto: Add diff tasks to rake [deployment-charts] - 10https://gerrit.wikimedia.org/r/685721 [15:33:04] (03PS5) 10Giuseppe Lavagetto: eventgate: add kafka egress policy stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/684855 (https://phabricator.wikimedia.org/T253058) [15:33:06] (03PS5) 10Giuseppe Lavagetto: eventgate-main: autogenerate egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/684856 [15:33:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimediacloud.org: update names for cloudgw migration [dns] - 10https://gerrit.wikimedia.org/r/684864 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [15:33:11] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563 [15:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:18] T280563: Reboot elasticsearch* and relforge* to apply kernel security updates - https://phabricator.wikimedia.org/T280563 [15:34:07] !log push cloud-gw-transport-eqiad to asw2-b-eqiad and cloudsw [15:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:33] (03CR) 10jerkins-bot: [V: 04-1] eventgate: add kafka egress policy stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/684855 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [15:34:39] (03CR) 10jerkins-bot: [V: 04-1] eventgate-main: autogenerate egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/684856 (owner: 10Giuseppe Lavagetto) [15:35:03] (03CR) 10jerkins-bot: [V: 04-1] Add diff tasks to rake [deployment-charts] - 10https://gerrit.wikimedia.org/r/685721 (owner: 10Giuseppe Lavagetto) [15:53:24] 10SRE, 10RESTBase: Restbase: traffic to 3050/udp dropped by iptables - https://phabricator.wikimedia.org/T249699 (10Pchelolo) It's possible it has not been noticed. RESTBase rate-limiting is done per-host, and then per-host counters are distributed across all the hosts over UDP via a distributed hash table. So... [15:54:03] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/29439/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/685823 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:54:10] (03CR) 10Ssingh: [C: 03+2] nagios_common: add check_tcp_ssl [puppet] - 10https://gerrit.wikimedia.org/r/685823 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:54:29] RECOVERY - Host logstash2027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms [15:54:37] (03PS2) 10Ottomata: Declare WikidataCompletionSearchClicks stream and migrate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685836 (https://phabricator.wikimedia.org/T282140) [15:55:11] 10SRE, 10ops-codfw, 10DBA, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [15:55:13] (03CR) 10CRusnov: [C: 03+2] dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) (owner: 10CRusnov) [15:55:23] (03PS35) 10CRusnov: dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) [15:55:35] (03PS4) 10Ssingh: P:wikidough: Add TCP connect check for DoH and DoT [puppet] - 10https://gerrit.wikimedia.org/r/685800 (https://phabricator.wikimedia.org/T252132) (owner: 10Jbond) [15:56:56] (03PS1) 10Andrew Bogott: nfs-exportd: include floating IPs in allowed list [puppet] - 10https://gerrit.wikimedia.org/r/685837 [15:58:26] !log starting upgrade of public mailing lists in group d and e (T280322) [15:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:35] T280322: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 [15:59:06] Amir1: any way to delay cloud-* until our current maintenance is done? [15:59:26] let me see if it's in the group [15:59:28] particularly -announce which we would like to have access to during the maintenance [15:59:31] I think it's in d [16:00:04] jbond42 and cdanis: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210506T1600). [16:00:51] (03CR) 10Volans: [C: 03+2] Add python_deploy::venv class [puppet] - 10https://gerrit.wikimedia.org/r/685537 (https://phabricator.wikimedia.org/T276589) (owner: 10Volans) [16:01:32] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/685800 (https://phabricator.wikimedia.org/T252132) (owner: 10Jbond) [16:01:41] we can survive if it's too much trouble, also not sure how long it makes the list unusable so it might not be that big of a problem [16:02:17] stopped [16:02:30] I remove clouds and start again [16:02:51] thanks! [16:03:04] I'll ping you when we're done, and sorry for the trouble [16:03:16] (03PS1) 10Volans: python_deploy::venv: fix typo in path [puppet] - 10https://gerrit.wikimedia.org/r/685840 [16:03:25] all good [16:03:44] (03CR) 10Andrew Bogott: [C: 03+2] nfs-exportd: include floating IPs in allowed list [puppet] - 10https://gerrit.wikimedia.org/r/685837 (owner: 10Andrew Bogott) [16:04:16] (03PS1) 10Andrew Bogott: nfs-exportd: remove some old code for nova-network handling [puppet] - 10https://gerrit.wikimedia.org/r/685842 [16:06:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] nfs-exportd: remove some old code for nova-network handling [puppet] - 10https://gerrit.wikimedia.org/r/685842 (owner: 10Andrew Bogott) [16:06:07] (03CR) 10Volans: "LGTM, but I've a question inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685825 (owner: 10Muehlenhoff) [16:06:24] (03PS1) 10Jgiannelos: Deploy chromium-render version 2021-05-04-135833-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/685843 [16:08:03] (03PS5) 10Ssingh: P:wikidough: Add TCP connect check for DoH and DoT [puppet] - 10https://gerrit.wikimedia.org/r/685800 (https://phabricator.wikimedia.org/T252132) (owner: 10Jbond) [16:08:14] (03CR) 10Volans: [C: 03+2] python_deploy::venv: fix typo in path [puppet] - 10https://gerrit.wikimedia.org/r/685840 (owner: 10Volans) [16:08:18] (03CR) 10Andrew Bogott: [C: 03+2] nfs-exportd: remove some old code for nova-network handling [puppet] - 10https://gerrit.wikimedia.org/r/685842 (owner: 10Andrew Bogott) [16:09:34] !log [Elastic] Set `elastic2058` as the only banned node in Cirrussearch Elasticsearch clusters (`elastic2058-production-search-codfw`, `elastic2058-production-search-omega-codfw`, `elastic2058-production-search-psi-codfw`) [16:09:37] RECOVERY - Host logstash2027 is UP: PING OK - Packet loss = 0%, RTA = 31.65 ms [16:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:39] PROBLEM - Check systemd state on logstash2027 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:11:05] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29441/console" [puppet] - 10https://gerrit.wikimedia.org/r/685800 (https://phabricator.wikimedia.org/T252132) (owner: 10Jbond) [16:12:02] !log powerdown mc-gp2002 for relocation [16:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:20] (03CR) 10Jbond: P:wikidough: Add TCP connect check for DoH and DoT (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/685800 (https://phabricator.wikimedia.org/T252132) (owner: 10Jbond) [16:13:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/685405 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [16:14:23] PROBLEM - Host mc-gp2002 is DOWN: PING CRITICAL - Packet loss = 100% [16:14:27] (03PS2) 10Volans: icinga: pass verbatim_hosts option to icinga-status [software/spicerack] - 10https://gerrit.wikimedia.org/r/685814 [16:14:45] Amir1: we're done with cloud maintenance, feel free to migrate those lists as well now [16:15:17] yeah but now they have to wait until the whole thing is done (~five hours) [16:17:23] PROBLEM - Host mc-gp2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:17:32] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Rename mailinglists eliso, and eliso-anoncoj - https://phabricator.wikimedia.org/T281686 (10KuboF) Great, thanks! I have checked it and everything seems to work well! [16:23:20] (03CR) 10Herron: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/685090 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [16:23:25] RECOVERY - Host mc-gp2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms [16:24:55] i am done done with 2043 yet still [16:25:01] ryankemper: [16:25:32] ack, will ban it from the cluster as well [16:26:05] papaul: just to confirm that should read “not done” right? [16:26:19] ryankemper: not done [16:26:53] (03CR) 10Volans: [C: 03+2] icinga: pass verbatim_hosts option to icinga-status [software/spicerack] - 10https://gerrit.wikimedia.org/r/685814 (owner: 10Volans) [16:31:21] (03PS1) 10Vgutierrez: trafficserver: Clear outbound TLS cacert_path for ats@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/685851 (https://phabricator.wikimedia.org/T281673) [16:32:15] RECOVERY - Check systemd state on logstash2027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:30] (03Merged) 10jenkins-bot: icinga: pass verbatim_hosts option to icinga-status [software/spicerack] - 10https://gerrit.wikimedia.org/r/685814 (owner: 10Volans) [16:33:52] (03PS2) 10Volans: netbox: fix check for server role [software/spicerack] - 10https://gerrit.wikimedia.org/r/685813 [16:34:33] (03PS1) 10Addshore: Wikibase: Use wikidataclient dblist directly for repo localClientDatabases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685852 (https://phabricator.wikimedia.org/T282160) [16:34:48] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29442/console" [puppet] - 10https://gerrit.wikimedia.org/r/685851 (https://phabricator.wikimedia.org/T281673) (owner: 10Vgutierrez) [16:36:09] (03PS1) 10Addshore: Wikibase: Use wikidataclient-test dblist for testwikidata localClientDatabases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685853 (https://phabricator.wikimedia.org/T282160) [16:36:46] (03PS6) 10Ssingh: P:wikidough: Add TCP connect check for DoH and DoT [puppet] - 10https://gerrit.wikimedia.org/r/685800 (https://phabricator.wikimedia.org/T252132) (owner: 10Jbond) [16:38:12] jouncebot: now [16:38:12] For the next 1 hour(s) and 21 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210506T1600) [16:38:16] jouncebot: next [16:38:16] In 0 hour(s) and 21 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210506T1700) [16:39:33] (03PS6) 10Arturo Borrero Gonzalez: cloudgw: introduce icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/685379 (https://phabricator.wikimedia.org/T270704) [16:39:45] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29443/console" [puppet] - 10https://gerrit.wikimedia.org/r/685800 (https://phabricator.wikimedia.org/T252132) (owner: 10Jbond) [16:40:22] (03CR) 10Ayounsi: [C: 03+1] netbox: fix check for server role [software/spicerack] - 10https://gerrit.wikimedia.org/r/685813 (owner: 10Volans) [16:40:43] (03CR) 10Volans: [C: 03+2] netbox: fix check for server role [software/spicerack] - 10https://gerrit.wikimedia.org/r/685813 (owner: 10Volans) [16:41:23] (03PS1) 10Volans: sre.deploy.python-code: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/685855 [16:42:01] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Clear outbound TLS cacert_path for ats@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/685851 (https://phabricator.wikimedia.org/T281673) (owner: 10Vgutierrez) [16:42:44] oh wow, i just saw https://deploy-commands.toolforge.org/ [16:42:57] nice [16:42:57] 10SRE, 10SRE-Access-Requests: Gaining access to MaxMind account associated with noc@wikimedia.org - https://phabricator.wikimedia.org/T282066 (10Dzahn) Hello [[ Olja | @odimitrijevic ]], the noc@ account is an email forwarder that sends mail to all (SRE) root users but the real name for it was set to Nuria.... [16:42:59] (03CR) 10Ssingh: [V: 03+1] "> Patch Set 4: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/685800 (https://phabricator.wikimedia.org/T252132) (owner: 10Jbond) [16:43:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: introduce icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/685379 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [16:43:53] !log Enforce Puppet Internal CA validation on trafficserver@ulsfo - T281673 [16:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:27] addshore: :D [16:44:33] Did you try to report problems? [16:44:39] nope? :O [16:44:41] (03CR) 10Cwhite: "> Agree, that would be nice to do. What do you think might be a good approach to handle it without disrupting deployment-prep?" [puppet] - 10https://gerrit.wikimedia.org/r/685090 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [16:44:59] addshore: Can you try? I want to see if it works as expected [16:45:06] how? where? [16:45:17] ohn report issues at the bottom? [16:45:20] oh lol.... [16:45:24] >.> [16:45:27] RECOVERY - Host mc-gp2002 is UP: PING OK - Packet loss = 0%, RTA = 31.91 ms [16:45:35] no one was taking my bait [16:45:41] 10SRE, 10ops-codfw, 10DBA, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [16:46:16] (03Merged) 10jenkins-bot: netbox: fix check for server role [software/spicerack] - 10https://gerrit.wikimedia.org/r/685813 (owner: 10Volans) [16:46:49] PROBLEM - Check whether ferm is active by checking the default input chain on mc-gp2002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:48:18] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:48:23] 10SRE, 10SRE-Access-Requests: Gaining access to MaxMind account associated with noc@wikimedia.org - https://phabricator.wikimedia.org/T282066 (10Dzahn) a:05Dzahn→03odimitrijevic Let me know if you got that email and can use the service. Cheers [16:48:23] Amir1! [16:48:34] wassup? [16:49:08] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:49:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:50:46] PROBLEM - SSH on logstash2020.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:51:01] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.52 [software/spicerack] - 10https://gerrit.wikimedia.org/r/685860 [16:51:27] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.52 [software/spicerack] - 10https://gerrit.wikimedia.org/r/685860 (owner: 10Volans) [16:52:30] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:53:31] (03PS2) 10Brennen Bearnes: Fixed a few minor typos in README.md [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/682689 (owner: 10Ahmon Dancy) [16:55:00] Amir1: that stupid report issues link [16:55:10] :D [16:55:11] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Services, 10Service-deployment-requests: New Service Request tegola - https://phabricator.wikimedia.org/T274390 (10MSantos) [16:55:18] Sorry [16:56:33] It's ok [16:56:40] I have sound turned down! [16:57:08] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.52 [software/spicerack] - 10https://gerrit.wikimedia.org/r/685860 (owner: 10Volans) [16:58:01] (03CR) 10Ahmon Dancy: [C: 04-1] Fixed a few minor typos in README.md (031 comment) [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/682689 (owner: 10Ahmon Dancy) [16:58:08] You're not the first to try that [16:58:55] 10SRE, 10ops-codfw, 10DBA, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10RKemper) [16:59:49] (03PS1) 10Volans: Upstream release v0.0.52 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/685865 [17:00:04] chrisalbon and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210506T1700). [17:00:11] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.52 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/685865 (owner: 10Volans) [17:00:57] !log powerdown elastic2058 for relocation [17:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:06] PROBLEM - Host elastic2058 is DOWN: PING CRITICAL - Packet loss = 100% [17:05:46] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:06:27] (03CR) 10Jgiannelos: [C: 03+2] Deploy chromium-render version 2021-05-04-135833-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/685843 (owner: 10Jgiannelos) [17:07:38] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:07:56] (03Merged) 10jenkins-bot: Deploy chromium-render version 2021-05-04-135833-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/685843 (owner: 10Jgiannelos) [17:08:01] bblack: still waiting on confirmation for cp2033 and cp2034 [17:08:04] (03Merged) 10jenkins-bot: Upstream release v0.0.52 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/685865 (owner: 10Volans) [17:08:43] papaul: not bblack, but bblack is in a meeting right now [17:08:48] PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 40 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [17:10:28] i check this [17:10:40] I think it's listadmins [17:11:36] it'll grow a bit [17:11:50] RECOVERY - Host elastic2058 is UP: PING OK - Packet loss = 0%, RTA = 31.65 ms [17:12:20] (03PS3) 10Brennen Bearnes: Fixed a few minor typos in README.md [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/682689 (owner: 10Ahmon Dancy) [17:12:28] sukhe: thanks [17:12:59] !log uploaded spicerack_0.0.52 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [17:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:11] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [17:13:14] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:24] I did some changes on bounce processing of listadmins [17:13:32] !log powerdown ms-be2057 for relocation [17:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:36] !log [Elastic] Set `elastic2043` as the only banned node in Cirrussearch Elasticsearch clusters (`elastic2058-production-search-codfw`, `elastic2058-production-search-omega-codfw`, `elastic2058-production-search-psi-codfw`) [17:15:40] !log upgrade spicerack on cumin* to 0.0.52 [17:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:54] PROBLEM - Host ms-be2057 is DOWN: PING CRITICAL - Packet loss = 100% [17:17:14] PROBLEM - Check systemd state on elastic2058 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:36] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:18:02] RECOVERY - Check whether ferm is active by checking the default input chain on mc-gp2002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:20:08] !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' . [17:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:02] PROBLEM - Host ms-be2057.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:27:09] (03CR) 10Marostegui: [C: 03+2] parsercachepurging.pp: Reduce parsercache retention to 21 days [puppet] - 10https://gerrit.wikimedia.org/r/685222 (https://phabricator.wikimedia.org/T280605) (owner: 10Marostegui) [17:27:44] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp203[34].codfw.wmnet [17:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:54] papaul: explicitly depooled just in case, but please go ahead at will [17:29:04] bblack: you can leave it for now i have a meeting at 12:30 [17:29:13] (03PS1) 10Jgiannelos: Revert "Deploy chromium-render version 2021-05-04-135833-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/685873 [17:29:22] RECOVERY - Host ms-be2057.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms [17:29:22] i don't think i can get to it today [17:31:03] (03CR) 10Jgiannelos: [C: 03+2] Revert "Deploy chromium-render version 2021-05-04-135833-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/685873 (owner: 10Jgiannelos) [17:31:18] 10SRE, 10ops-codfw, 10DBA, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [17:31:48] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:32:22] (03Merged) 10jenkins-bot: Revert "Deploy chromium-render version 2021-05-04-135833-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/685873 (owner: 10Jgiannelos) [17:33:22] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [17:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:12] PROBLEM - SSH on phab2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:35:18] !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' . [17:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:08] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:37:02] (03PS4) 10Volans: sre.hosts.remove-downtime: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/681308 [17:37:24] (03CR) 10Jgiannelos: "Reverting after encountering some errors on production deployment" [deployment-charts] - 10https://gerrit.wikimedia.org/r/685873 (owner: 10Jgiannelos) [17:38:29] (03CR) 10Volans: [C: 03+2] sre.deploy.python-code: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/685855 (owner: 10Volans) [17:38:58] wrong patch, removed +2 [17:41:40] 10SRE, 10Cloud-VPS, 10Traffic, 10HTTPS: certificate for Cloud VPS has expired - https://phabricator.wikimedia.org/T282102 (10Nintendofan885) [17:42:42] (03CR) 10Volans: [C: 03+2] sre.hosts.remove-downtime: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/681308 (owner: 10Volans) [17:44:53] oh, that's cool, removing downtimes:) [17:45:28] (03Merged) 10jenkins-bot: sre.hosts.remove-downtime: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/681308 (owner: 10Volans) [17:47:32] !log volans@cumin2001 START - Cookbook sre.hosts.remove-downtime for cumin1001.eqiad.wmnet [17:47:32] !log volans@cumin2001 END (FAIL) - Cookbook sre.hosts.remove-downtime (exit_code=99) for cumin1001.eqiad.wmnet [17:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:52] [17:49:26] (03PS1) 10Volans: sre.hosts.remove-downtime: fix wrong parameter [cookbooks] - 10https://gerrit.wikimedia.org/r/685876 [17:51:38] (03PS4) 10Ahmon Dancy: WIP: Test emailing notification of security patch failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679015 [17:53:08] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:53:47] (03CR) 10Volans: [C: 03+2] "fix broken call" [cookbooks] - 10https://gerrit.wikimedia.org/r/685876 (owner: 10Volans) [17:55:06] RECOVERY - Host ms-be2057 is UP: PING OK - Packet loss = 0%, RTA = 31.65 ms [17:57:29] (03Merged) 10jenkins-bot: sre.hosts.remove-downtime: fix wrong parameter [cookbooks] - 10https://gerrit.wikimedia.org/r/685876 (owner: 10Volans) [17:57:40] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:59:13] !log volans@cumin2001 START - Cookbook sre.hosts.remove-downtime for cumin1001.eqiad.wmnet [17:59:13] !log volans@cumin2001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cumin1001.eqiad.wmnet [17:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210506T1800) [18:00:04] addshore: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:05:14] 10SRE, 10Prod-Kubernetes, 10SRE-tools: Support downtiming services in our cookbooks - https://phabricator.wikimedia.org/T277740 (10Volans) 05Open→03Resolved @JMeybohm this is now all supported. We have a `sre.hosts.remove-downtime` cookbook that when run with `--force` will ask the user if it wants to pr... [18:15:28] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:19:56] O/ I wonder if anyone is around to deploy that [18:22:08] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:22:26] (03PS1) 10Legoktm: mailman3: Fix URLs getting line wrapped [puppet] - 10https://gerrit.wikimedia.org/r/685882 (https://phabricator.wikimedia.org/T282044) [18:27:06] Niharika: Urbanecm I have a config change i'd like to deploy [18:27:18] is backport happening? I see that addshore has a change to deploy? [18:27:41] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review, 10Upstream: Mailman3 "New subscription request to" template line wraps, breaking long links - https://phabricator.wikimedia.org/T282044 (10Legoktm) {F34443980} [18:28:22] addshore: dunno, but I suppose I could deploy them for you? [18:28:28] not sure if that steps on toes [18:28:34] i'm here in case i'm needed [18:28:37] thought addshore will self-serve [18:28:41] ah! [18:28:59] i need to deploy a patch anyway and addshore's look simple enough so I can do them Urbanecm [18:29:02] addshore: yt? [18:29:28] gonna go ahead and do mine... [18:29:35] (03CR) 10Ottomata: [C: 03+2] Declare WikidataCompletionSearchClicks stream and migrate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685836 (https://phabricator.wikimedia.org/T282140) (owner: 10Ottomata) [18:30:03] (03CR) 10Legoktm: [C: 03+2] mailman3: Fix URLs getting line wrapped [puppet] - 10https://gerrit.wikimedia.org/r/685882 (https://phabricator.wikimedia.org/T282044) (owner: 10Legoktm) [18:31:46] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Declare WikidataCompletionSearchClicks stream and migrate on testwiki - T282140 (duration: 01m 06s) [18:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:55] T282140: WikidataCompletionSearchClicks Event Platform Migration - https://phabricator.wikimedia.org/T282140 [18:32:05] (03PS1) 10Legoktm: Fix URLs getting line wrapped [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/685883 (https://phabricator.wikimedia.org/T282044) [18:32:58] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Fix URLs getting line wrapped [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/685883 (https://phabricator.wikimedia.org/T282044) (owner: 10Legoktm) [18:34:15] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review, 10Upstream: Mailman3 "New subscription request to" template line wraps, breaking long links - https://phabricator.wikimedia.org/T282044 (10Legoktm) 05Open→03Resolved a:03Legoktm [18:38:15] I didn't see anyone so I went an ate! [18:38:16] Not presently in a situation I can click the buttons myself :) [18:38:16] Poke Urbanecm :) [18:38:30] hello addshore :) [18:38:33] ottomata: are you done? [18:40:02] addshore: sorry, I should ask if you're going to self-serve in the future 🙂 [18:41:28] I should / could make if I am on the calender :D [18:41:56] ottomata:are you done with your deploy? [18:42:00] ottomata: ^ [18:42:44] 10SRE, 10Wikimedia-Mailing-lists: Upload new mailman3 and hyperkitty packages - https://phabricator.wikimedia.org/T282092 (10Legoktm) [18:42:47] Urbanecm: yes done [18:42:51] sorry [18:42:51] thanks [18:42:58] (03CR) 10Urbanecm: [C: 03+2] Wikibase: Use wikidataclient dblist directly for repo localClientDatabases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685852 (https://phabricator.wikimedia.org/T282160) (owner: 10Addshore) [18:43:08] ty [18:43:16] First one should be a wonderful noop :) [18:43:23] yup [18:43:29] confirmed by quick looking at the lists [18:43:46] Yup, leftover cleanup for aaaggess ago [18:43:46] (03Merged) 10jenkins-bot: Wikibase: Use wikidataclient dblist directly for repo localClientDatabases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685852 (https://phabricator.wikimedia.org/T282160) (owner: 10Addshore) [18:43:48] addshore: will you want to test it at a mwdebug? [18:44:12] it is there in case you want [18:44:24] Probably doesn't need it :) just if the site is still up ;) [18:44:33] aaaarrrrayy etc [18:44:51] Looks good to me [18:45:01] good, syncing [18:45:41] (03PS2) 10Urbanecm: Wikibase: Use wikidataclient-test dblist for testwikidata localClientDatabases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685853 (https://phabricator.wikimedia.org/T282160) (owner: 10Addshore) [18:45:45] (03CR) 10Urbanecm: [C: 03+2] Wikibase: Use wikidataclient-test dblist for testwikidata localClientDatabases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685853 (https://phabricator.wikimedia.org/T282160) (owner: 10Addshore) [18:46:01] we still use 1002 right? :) [18:46:40] !log urbanecm@deploy1002 Synchronized wmf-config/Wikibase.php: 7e21cf0d96541d0ab5cb18cd7741756ab1dfe7b8: NO-OP: Wikibase: Use wikidataclient dblist directly for repo localClientDatabases (T282160) (duration: 01m 04s) [18:46:40] (03Merged) 10jenkins-bot: Wikibase: Use wikidataclient-test dblist for testwikidata localClientDatabases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685853 (https://phabricator.wikimedia.org/T282160) (owner: 10Addshore) [18:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:49] T282160: Enable testcommonswiki as a client of testwikidatawiki with dispatching running - https://phabricator.wikimedia.org/T282160 [18:46:54] mwdebug1001...but i loaded few sites too, and it worked :D [18:47:14] addshore: second patch is on mwdebug1001, please test. [18:47:22] aaah 1001 ;) [18:47:35] looks good, site is up! [18:48:09] if there is any fallout from these it's more likely to come some time after (during the dispatch cron script that runs all the time) [18:48:09] (03PS1) 10Ottomata: Migrate WikidataCompletionSearchClicks to event platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685906 (https://phabricator.wikimedia.org/T282140) [18:48:21] (03PS2) 10Ottomata: Migrate WikidataCompletionSearchClicks to event platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685906 (https://phabricator.wikimedia.org/T282140) [18:48:22] hehe [18:48:58] (03PS1) 10Dzahn: site: update comment on mwdebug servers, remove mwdebug1003 [puppet] - 10https://gerrit.wikimedia.org/r/685907 [18:49:18] your comment made me do that one [18:49:32] :D [18:50:16] (03PS3) 10Ottomata: Migrate WikidataCompletionSearchClicks to event platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685906 (https://phabricator.wikimedia.org/T282140) [18:50:42] addshore: lmk how testcommonswiki works like :) [18:51:22] https://test-commons.wikimedia.org/ [18:51:40] you'll see the homepage is up to date ;) [18:51:52] yeah... [18:52:05] maybe we should delete it instead :D [18:52:21] Urbanecm: ok if I sync another patch? [18:52:26] ottomata: not now please [18:52:28] k [18:52:40] no hurry at all, lemme know when clear :) [18:52:49] will do [18:53:33] addshore: please lmk if ok to sync :) [18:53:54] lgtm! [18:54:06] thanks, syncing [18:55:43] !log urbanecm@deploy1002 Synchronized wmf-config/Wikibase.php: 338d1df5903cdc963b9eef22ec2c1750b7b3a02b: Wikibase: Use wikidataclient-test dblist for testwikidata localClientDatabases (T282160) (duration: 01m 05s) [18:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:52] T282160: Enable testcommonswiki as a client of testwikidatawiki with dispatching running - https://phabricator.wikimedia.org/T282160 [18:55:58] addshore: should be done [18:56:04] woo! ty! [18:56:06] ottomata: all clear for you. [18:56:11] ty [18:56:57] (03CR) 10Ottomata: [C: 03+2] Migrate WikidataCompletionSearchClicks to event platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685906 (https://phabricator.wikimedia.org/T282140) (owner: 10Ottomata) [18:58:41] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:685906|Migrate WikidataCompletionSearchClicks to event platform on all wikis (T282140)]] (duration: 01m 04s) [18:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:50] T282140: WikidataCompletionSearchClicks Event Platform Migration - https://phabricator.wikimedia.org/T282140 [18:59:38] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:00:04] brennen and liw: May I have your attention please! MediaWiki train - American+European Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210506T1900) [19:01:40] 10SRE, 10Commons, 10MediaWiki-API, 10Regression: Frequent 504s while using logevents API on Commons - https://phabricator.wikimedia.org/T282122 (10AntiCompositeNumber) p:05High→03Triage Please don't set priorities on tasks unless you plan to work on them. [19:02:38] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:03:03] (03PS1) 10Brennen Bearnes: all wikis to 1.37.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685910 [19:03:05] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.37.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685910 (owner: 10Brennen Bearnes) [19:03:39] (03PS1) 10Ottomata: refine - finalize WikidataCompletionSearchClicks migration to event platform [puppet] - 10https://gerrit.wikimedia.org/r/685911 (https://phabricator.wikimedia.org/T282140) [19:03:59] (03Merged) 10jenkins-bot: all wikis to 1.37.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685910 (owner: 10Brennen Bearnes) [19:05:02] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:05:23] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.37.0-wmf.4 [19:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:48] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:11:36] 10SRE, 10Commons, 10MediaWiki-API, 10Traffic, 10Regression: Frequent 504s while using logevents API on Commons - https://phabricator.wikimedia.org/T282122 (10Urbanecm) I can see the change in 504 trends at commonswiki ([link](https://logstash.wikimedia.org/goto/798565e37db13c2dbe0944acad7ebb37)): {F3444... [19:14:32] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:16:58] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:20:19] 10SRE, 10ops-codfw, 10DBA, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10BBlack) [19:33:38] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:33:42] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 1.070 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:51:04] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:54:30] (03PS1) 10Dzahn: thumbor/mwmaint: add periodic job to pull fc-list file [puppet] - 10https://gerrit.wikimedia.org/r/685914 (https://phabricator.wikimedia.org/T280718) [19:54:40] RECOVERY - SSH on logstash2020.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:56:02] (03CR) 10jerkins-bot: [V: 04-1] thumbor/mwmaint: add periodic job to pull fc-list file [puppet] - 10https://gerrit.wikimedia.org/r/685914 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [19:56:42] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:00:50] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:04:17] (03PS2) 10Dzahn: thumbor/mwmaint: add periodic job to pull fc-list file [puppet] - 10https://gerrit.wikimedia.org/r/685914 (https://phabricator.wikimedia.org/T280718) [20:15:30] (03PS3) 10Jforrester: [wikitech] Enable VE desktop section edit links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679940 (https://phabricator.wikimedia.org/T280291) [20:15:36] (03CR) 10Jforrester: "Team says go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679940 (https://phabricator.wikimedia.org/T280291) (owner: 10Jforrester) [20:17:19] (03CR) 10Dzahn: "the problem now: "thumbor_memcached_servers" are in hieradata/role/eqiad/thumbor/mediawiki but not on mwmaint role.... so it can't find th" [puppet] - 10https://gerrit.wikimedia.org/r/685914 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [20:21:04] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:21:14] (03CR) 10Dzahn: "Any suggestions on just the part "how to get the name of one random thumbor server"? Should the list of thumbor servers be moved to common" [puppet] - 10https://gerrit.wikimedia.org/r/685914 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [20:24:08] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:26:36] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:35:02] Pchelolo: these keep trickling in: T282181 [20:35:03] T282181: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'actor_user' in 'on clause' - https://phabricator.wikimedia.org/T282181 [20:35:15] something subtly different from the thing patched yesterday? [20:36:53] (03CR) 10Cwhite: [C: 03+2] prometheus: Migrate node_gdnsd cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/685582 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [20:37:03] (03PS3) 10Cwhite: prometheus: Migrate node_gdnsd cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/685582 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [20:37:29] brennen: yeah, different stack trace [20:40:27] 10SRE, 10SRE-Access-Requests: Gaining access to MaxMind account associated with noc@wikimedia.org - https://phabricator.wikimedia.org/T282066 (10odimitrijevic) 05Open→03Resolved Thank you Daniel! Confirming that I now have access. We can close the ticket as resolved, and I will separately coordinate additi... [20:40:40] Pchelolo: my next question is: rollback-worthy? [20:40:51] depends on how often it happens [20:41:19] brennen: gimme a minute, the fix might be trivial [20:41:35] 45 odd since deploy to group2. [20:41:41] kk [20:45:36] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:46:07] brennen: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/685921 [20:46:24] my bad, fixed other two and didn't notice this one is broken too [20:48:37] (03CR) 10Cwhite: [C: 03+2] prometheus: Migrate node_file_count cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/685583 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [20:48:43] (03PS3) 10Cwhite: prometheus: Migrate node_file_count cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/685583 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [20:51:37] (03PS1) 10Razzi: kerberos: add reset-password action to manage_principals.py [puppet] - 10https://gerrit.wikimedia.org/r/685923 (https://phabricator.wikimedia.org/T282185) [20:51:47] (03PS1) 10Cwhite: Revert "prometheus: Migrate node_file_count cron to systemd timer" [puppet] - 10https://gerrit.wikimedia.org/r/685889 [20:52:18] (03CR) 10Cwhite: [V: 03+2 C: 03+2] Revert "prometheus: Migrate node_file_count cron to systemd timer" [puppet] - 10https://gerrit.wikimedia.org/r/685889 (owner: 10Cwhite) [20:52:34] (03PS1) 10Brennen Bearnes: Reorder tables in SpecialWatchlist [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685890 (https://phabricator.wikimedia.org/T282181) [20:53:35] (03CR) 10Ppchelko: [C: 03+1] Reorder tables in SpecialWatchlist [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685890 (https://phabricator.wikimedia.org/T282181) (owner: 10Brennen Bearnes) [20:53:49] Pchelolo: cool, thx. should we wait for review to sync backport, or pretty safe? [20:54:10] seems safe, but let's at least give jenkins a chance [20:54:16] yeah for sure [20:56:21] (03PS2) 10Razzi: kerberos: add reset-password action to manage_principals.py [puppet] - 10https://gerrit.wikimedia.org/r/685923 (https://phabricator.wikimedia.org/T282185) [20:56:36] (03PS5) 10Ahmon Dancy: WIP: Test emailing notification of security patch failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679015 [20:57:54] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:58:19] (03CR) 10Cwhite: [C: 03+2] prometheus: Migrate node_puppet_agent cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/685581 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [20:58:24] (03PS2) 10Cwhite: prometheus: Migrate node_puppet_agent cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/685581 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [21:01:06] (03CR) 10Razzi: "Did a little refactoring here as well. The subject says password reset but the body is the same." [puppet] - 10https://gerrit.wikimedia.org/r/685923 (https://phabricator.wikimedia.org/T282185) (owner: 10Razzi) [21:03:36] (03PS1) 10Mforns: Migrate VirtualPageView to EventPlatform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685928 (https://phabricator.wikimedia.org/T238138) [21:05:29] (03CR) 10Ahmon Dancy: Fixed a few minor typos in README.md (034 comments) [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/682689 (owner: 10Ahmon Dancy) [21:06:07] (03CR) 10Ahmon Dancy: Fixed a few minor typos in README.md (031 comment) [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/682689 (owner: 10Ahmon Dancy) [21:09:46] (03PS6) 10Ahmon Dancy: WIP: Test emailing notification of security patch failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679015 [21:10:37] bah gotta restart the Jenkins CI due to T281737 :-\ [21:10:38] T281737: Zuul can't stop jobs or set the build description - https://phabricator.wikimedia.org/T281737 [21:11:15] !log restarted CI Jenkins due to T281737 [21:11:17] :-\ [21:11:17] 10Puppet, 10SRE, 10puppet-compiler, 10Patch-For-Review: Ensure Puppet checks types as part of the build - https://phabricator.wikimedia.org/T261693 (10razzi) I vote to close this in favor of T166066 as @elukey suggested. Writing spec tests for modules doesn't scale as a solution, and we don't want to slow... [21:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:30] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:15:56] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:17:20] (03PS7) 10Ahmon Dancy: WIP: Test emailing notification of security patch failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679015 [21:18:20] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:20:58] (03PS3) 10Razzi: kerberos: add reset-password action to manage_principals.py [puppet] - 10https://gerrit.wikimedia.org/r/685923 (https://phabricator.wikimedia.org/T282185) [21:22:22] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29448/console" [puppet] - 10https://gerrit.wikimedia.org/r/685923 (https://phabricator.wikimedia.org/T282185) (owner: 10Razzi) [21:25:42] (03CR) 10Razzi: [V: 03+1 C: 03+2] "Hmm, I wouldn't expect a no-op from PCC: https://puppet-compiler.wmflabs.org/compiler1001/29448/" [puppet] - 10https://gerrit.wikimedia.org/r/685923 (https://phabricator.wikimedia.org/T282185) (owner: 10Razzi) [21:30:20] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:32:46] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:32:54] Pchelolo: passed jenkins... [21:34:08] brennen: I'm pretty sure it's safe to go in [21:34:15] cool, i'll go ahead with backport [21:34:20] famous last words :) [21:34:32] (03CR) 10Brennen Bearnes: [C: 03+2] Reorder tables in SpecialWatchlist [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685890 (https://phabricator.wikimedia.org/T282181) (owner: 10Brennen Bearnes) [21:34:36] haha [21:35:46] (03PS8) 10Ahmon Dancy: WIP: Test emailing notification of security patch failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679015 [21:39:30] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:39:35] (03PS1) 10Bstorm: toolforge kubernetes: change class for the new cinder environment [puppet] - 10https://gerrit.wikimedia.org/r/685936 (https://phabricator.wikimedia.org/T282087) [21:39:46] RECOVERY - SSH on phab2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:40:32] (03CR) 10Bstorm: "Andrew, this should work on the existing nodes, right?" [puppet] - 10https://gerrit.wikimedia.org/r/685936 (https://phabricator.wikimedia.org/T282087) (owner: 10Bstorm) [21:41:58] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:42:29] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Dc-Ops Commands for Cumin - https://phabricator.wikimedia.org/T279721 (10wiki_willy) Hi @MoritzMuehlenhoff - we have all the data consolidated for you, so feel free to proceed. Thanks, Willy [21:42:32] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:44:58] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:46:28] !log uploaded new mailman3 and hyperkitty packages to apt.wm.o (T282092) [21:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:37] T282092: Upload new mailman3 and hyperkitty packages - https://phabricator.wikimedia.org/T282092 [21:48:00] !log upgraded mailman3 and hyperkitty on lists1002 (T282092) [21:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:55] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: improve new mailing list admin notifications - https://phabricator.wikimedia.org/T281987 (10Legoktm) Here's the patch I've hacked in for now: ` From: Kunal Mehta Date: Thu, 6 May 2021 14:25:41 -0700 Subject: Hack in a link to Postorius for a... [21:59:10] (03CR) 10Bstorm: "I cherry-picked this into toolsbeta docker and it does this on a k8s worker:" [puppet] - 10https://gerrit.wikimedia.org/r/685936 (https://phabricator.wikimedia.org/T282087) (owner: 10Bstorm) [22:00:37] (03Merged) 10jenkins-bot: Reorder tables in SpecialWatchlist [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685890 (https://phabricator.wikimedia.org/T282181) (owner: 10Brennen Bearnes) [22:03:14] jouncebot next [22:03:14] In 0 hour(s) and 56 minute(s): US Backport and Config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210506T2300) [22:03:25] 10SRE, 10SRE-Access-Requests: Gaining access to MaxMind account associated with noc@wikimedia.org - https://phabricator.wikimedia.org/T282066 (10Dzahn) Ok, thank you for confirming. You can also feel free to reopen this later. A different person might continue it another week. [22:04:26] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: improve new mailing list admin notifications - https://phabricator.wikimedia.org/T281987 (10Legoktm) And because of {9b8147775d1e468a1b8578004aff645da9d153eb} the emails will now come from listadmins-owner@lists.wikimedia.org, which should be a bit better. I'll f... [22:04:48] Pchelolo: that's on mwdebug1002, not sure if it's really testable? [22:07:02] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/685936 (https://phabricator.wikimedia.org/T282087) (owner: 10Bstorm) [22:08:44] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:09:51] ah, whoops, i was getting confused by title having been stripped out of the reported URL in phatality. [22:10:02] definitely reproducible, fixed by patch. going ahead with synch. [22:10:04] -h [22:11:08] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:11:19] !log brennen@deploy1002 Synchronized php-1.37.0-wmf.4/includes/specials/SpecialWatchlist.php: Backport: [[gerrit:685890|Reorder tables in SpecialWatchlist (T282181)]] (duration: 00m 57s) [22:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:27] 10SRE, 10ops-codfw, 10DBA, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) @BBlack i had meetings from 12:30 pm to 4PM so I didn't have the chance to work on the cp nodes. You can re-pool those since i will not be able to get back on those until th... [22:11:28] T282181: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'actor_user' in 'on clause' - https://phabricator.wikimedia.org/T282181 [22:15:50] (03PS1) 10Legoktm: Add Debian packaging [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/685937 [22:16:42] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:17:35] (03PS2) 10Dzahn: site: update comment on mwdebug servers, remove mwdebug1003 [puppet] - 10https://gerrit.wikimedia.org/r/685907 (https://phabricator.wikimedia.org/T267248) [22:18:26] (03PS3) 10Dzahn: site: update comment on mwdebug servers, remove mwdebug1003 [puppet] - 10https://gerrit.wikimedia.org/r/685907 (https://phabricator.wikimedia.org/T267248) [22:19:10] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:19:18] (03CR) 10Dzahn: [C: 03+2] site: update comment on mwdebug servers, remove mwdebug1003 [puppet] - 10https://gerrit.wikimedia.org/r/685907 (https://phabricator.wikimedia.org/T267248) (owner: 10Dzahn) [22:33:06] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:40:32] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:52:33] !log upgrading mailman3 and hyperkitty on lists1001 (T282092) [22:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:44] T282092: Upload new mailman3 and hyperkitty packages - https://phabricator.wikimedia.org/T282092 [22:53:37] (03CR) 10Ladsgroup: lists: Add Apache configuration for pipermail redirects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685711 (owner: 10Legoktm) [23:00:04] brennen: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) US Backport and Config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210506T2300). [23:02:13] (03PS1) 10Legoktm: mailman3: Remove useless mailman_sync command [puppet] - 10https://gerrit.wikimedia.org/r/685946 [23:02:49] (03PS2) 10Legoktm: mailman3: Don't call mailman_sync when migrating, it's useless [puppet] - 10https://gerrit.wikimedia.org/r/685946 [23:04:47] (03CR) 10Ladsgroup: [C: 03+1] "You have my virtual blessing" [puppet] - 10https://gerrit.wikimedia.org/r/685946 (owner: 10Legoktm) [23:05:43] (03PS1) 10Bstorm: wikireplicas: cut over the last IPs to the new cluster [puppet] - 10https://gerrit.wikimedia.org/r/685947 (https://phabricator.wikimedia.org/T260389) [23:07:02] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:07:42] (03CR) 10Legoktm: [C: 03+2] mailman3: Don't call mailman_sync when migrating, it's useless [puppet] - 10https://gerrit.wikimedia.org/r/685946 (owner: 10Legoktm) [23:09:32] (03PS2) 10Jdlrobson: Remove Vector language button from Commons, Wikidata, Mediawiki, Wikispecies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685795 (https://phabricator.wikimedia.org/T281968) (owner: 10Jdrewniak) [23:09:39] (03CR) 10Jdlrobson: [C: 03+1] Remove Vector language button from Commons, Wikidata, Mediawiki, Wikispecies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685795 (https://phabricator.wikimedia.org/T281968) (owner: 10Jdrewniak) [23:15:02] PROBLEM - Disk space on releases1002 is CRITICAL: DISK CRITICAL - /srv/docker/containers/4fd0e237ad99e976430bd90d1e6e5ab77754f92bd8cdf42f7b2c6f69497f0927/mounts/shm is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=releases1002&var-datasource=eqiad+prometheus/ops [23:18:24] (03PS1) 10Ladsgroup: lists: Rename langcom-l to langcom-internal [puppet] - 10https://gerrit.wikimedia.org/r/685948 [23:20:18] (03CR) 10Legoktm: [C: 03+2] lists: Rename langcom-l to langcom-internal [puppet] - 10https://gerrit.wikimedia.org/r/685948 (owner: 10Ladsgroup) [23:24:42] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:28:08] (03CR) 10Krinkle: [C: 04-2] "Must not merge until the cronjob has had a chance to run to completion at least once since https://gerrit.wikimedia.org/r/c/685222. This i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685181 (https://phabricator.wikimedia.org/T280605) (owner: 10Krinkle) [23:32:46] (Traffic bill over quota) firing: (2) Traffic bill over quota - https://alerts.wikimedia.org [23:35:00] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:37:46] (Traffic bill over quota) firing: (3) Traffic bill over quota - https://alerts.wikimedia.org [23:41:34] (03PS1) 10Ahmon Dancy: .pipeline/wmf-publish/build: Use --skip-message-purge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685952 [23:42:34] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:42:36] (03Abandoned) 10Ahmon Dancy: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680405 (owner: 10Ahmon Dancy) [23:42:47] (Traffic bill over quota) resolved: Traffic bill over quota - https://alerts.wikimedia.org [23:44:12] (03CR) 10BryanDavis: wikireplicas: cut over the last IPs to the new cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685947 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [23:50:21] 10SRE, 10Wikimedia-Mailing-lists: Upload new mailman3 and hyperkitty packages - https://phabricator.wikimedia.org/T282092 (10Legoktm) 05Open→03Resolved Upgraded both Cloud VPS instances too. [23:50:50] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: Rollback group1 and group2 to 1.37.0-wmf.3 (T282193) [23:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:00] T282193: Query time out in ApiQueryLogEvents query - https://phabricator.wikimedia.org/T282193 [23:53:20] (03PS1) 10Brennen Bearnes: Rollback group1 and group2 to 1.37.0-wmf.3 (T282193) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685954 [23:53:22] (03CR) 10Brennen Bearnes: [C: 03+2] Rollback group1 and group2 to 1.37.0-wmf.3 (T282193) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685954 (owner: 10Brennen Bearnes) [23:54:35] (03Merged) 10jenkins-bot: Rollback group1 and group2 to 1.37.0-wmf.3 (T282193) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685954 (owner: 10Brennen Bearnes)