[00:00:04] ACK, a little too early. I was thinking it's cache [00:00:22] nope, config changes need to be manually synced out [00:00:30] it just needs mwmaint [00:00:37] !log urbanecm@deploy1002 Synchronized fc-list: 93970496da7678d896b7f812b3bb5f4cf0b691ad: update fc-list to current version on buster (T79424) (duration: 01m 09s) [00:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:46] T79424: update svg font list - https://phabricator.wikimedia.org/T79424 [00:01:00] confirmed at https://noc.wikimedia.org/conf/highlight.php?file=fc-list [00:01:10] and yea. the pathes are in there now, but that changed in fc-list [00:01:39] 10SRE, 10Patch-For-Review: update svg font list - https://phabricator.wikimedia.org/T79424 (10Dzahn) https://noc.wikimedia.org/conf/highlight.php?file=fc-list [00:01:46] yeah, in this case, only mwmaint is actually needed :). Anyway, should be done. [00:01:57] thanks :) [00:02:01] any time [00:02:09] 10SRE, 10Patch-For-Review: update svg font list - https://phabricator.wikimedia.org/T79424 (10Dzahn) 05Open→03Resolved [00:02:19] (03CR) 10Urbanecm: [C: 03+2] Growth: enwiki: Add list of mentors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685143 (https://phabricator.wikimedia.org/T281896) (owner: 10Urbanecm) [00:02:22] i'll just call that resolved, ther are like 5 other tickets linked to that, heh [00:02:25] RECOVERY - Check systemd state on an-worker1130 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:02:42] in that case, good to have a RT-era ticket fixed :D [00:03:07] I'm wondering if T1 should be fixed or just kept opened forever as a memory [00:03:09] T1: Get puppet runs into logstash - https://phabricator.wikimedia.org/T1 [00:03:22] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar, 10Patch-For-Review: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10Dzahn) An updated fc-list has been deployed: https://noc.wikimedia.org/conf/highlight.php?fil... [00:04:36] (03Merged) 10jenkins-bot: Growth: enwiki: Add list of mentors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685143 (https://phabricator.wikimedia.org/T281896) (owner: 10Urbanecm) [00:04:38] Urbanecm: and now I'll use your people.wm httpbb tests and put them in the repo [00:05:24] feel free to :). But maybe we should create something...more permanent in case the userdirs get changed? [00:05:44] Urbanecm: that's the wrong bug 1 :( [00:06:04] Urbanecm: cough, permanent.. i just fixed an assertion that annual.wm redirects to 2019 [00:06:13] :D [00:06:17] RECOVERY - Hadoop NodeManager on an-worker1130 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:06:41] https://static-bugzilla.wikimedia.org/bug1.html [00:06:44] i mean i probably won't delete the folders soon, so if you're fine with it, go for it [00:06:50] I even forgot the "offset" right now [00:06:57] that bz ticket numbers have [00:07:31] hehe [00:07:32] Urbanecm: on a host like that it seems kind of impossible to rely on anything staying as it is [00:07:45] well, I can put a test file in my own home of course [00:08:13] putting it under people.wikimedia.org/tests/ might work [00:08:36] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10Andrew) [00:08:38] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 3f6ea8c0e5a4dc667969f5847207902727625bbe: Growth: enwiki: Add list of mentors (T281896) (duration: 01m 10s) [00:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:46] T281896: Deploy Growth features on English Wikipedia - https://phabricator.wikimedia.org/T281896 [00:19:08] (03CR) 10Dave Pifke: "Sorry I didn't get to this today. Overall it looks good; I'll look at it more thoroughly and add the extended descriptions tomorrow." [alerts] - 10https://gerrit.wikimedia.org/r/670230 (https://phabricator.wikimedia.org/T281358) (owner: 10Filippo Giunchedi) [00:22:37] PROBLEM - dump of es5 in eqiad on alert1001 is CRITICAL: dump for es5 at eqiad taken more than 8 days ago: Most recent backup 2021-04-27 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:22:47] assert_body_contains: "fenari" [00:25:07] PROBLEM - dump of es4 in eqiad on alert1001 is CRITICAL: dump for es4 at eqiad taken more than 8 days ago: Most recent backup 2021-04-27 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:25:32] (03PS1) 10Andrew Bogott: Trove policy: allow projectadmins to reset instance/cluster status [puppet] - 10https://gerrit.wikimedia.org/r/685148 [00:25:54] (03PS1) 10Dzahn: httpbb: add tests and test_suite for people.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/685149 (https://phabricator.wikimedia.org/T280989) [00:26:39] (03CR) 10Andrew Bogott: [C: 03+2] Trove policy: allow projectadmins to reset instance/cluster status [puppet] - 10https://gerrit.wikimedia.org/r/685148 (owner: 10Andrew Bogott) [00:27:21] PROBLEM - Hadoop NodeManager on an-worker1115 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:27:23] PROBLEM - Check systemd state on an-worker1115 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:27:35] PROBLEM - Check systemd state on an-worker1106 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:27:59] PROBLEM - Hadoop NodeManager on an-worker1106 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:28:11] (03CR) 10Dzahn: "Most tests are done by Urbanecm (thanks!). They actually test things that are supposed to be behind login. Which was really good to test p" [puppet] - 10https://gerrit.wikimedia.org/r/685149 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [00:30:01] RECOVERY - Hadoop NodeManager on an-worker1115 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:30:03] RECOVERY - Check systemd state on an-worker1115 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:03] RECOVERY - Check systemd state on elastic1040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:15] RECOVERY - Check systemd state on an-worker1106 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:41] RECOVERY - Hadoop NodeManager on an-worker1106 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:35:43] PROBLEM - dump of es4 in codfw on alert1001 is CRITICAL: dump for es4 at codfw taken more than 8 days ago: Most recent backup 2021-04-27 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:36:41] PROBLEM - dump of es5 in codfw on alert1001 is CRITICAL: dump for es5 at codfw taken more than 8 days ago: Most recent backup 2021-04-27 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:37:09] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:41:57] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:42:03] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:42:07] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:42:13] PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:42:29] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:42:33] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 2/4 UP : OSPFv3: 2/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:46:44] 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10Dzahn) new httpbb tests show webserver working as expected, including URLs behind auth: ` [deploy1002:~] $ httpbb --hosts people[1002,1003].eqiad.wmnet /home/dzahn/test_people.yaml Sending to 2 hosts... PA... [00:51:13] PROBLEM - Check systemd state on an-worker1135 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:51:55] PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:52:02] (03PS1) 10Andrew Bogott: Horizon local_settings: fix name of trove policy setting [puppet] - 10https://gerrit.wikimedia.org/r/685157 (https://phabricator.wikimedia.org/T281655) [00:52:07] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:52:11] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:52:15] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:52:23] RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:52:37] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:52:39] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:03:43] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [01:07:06] (03CR) 10Andrew Bogott: [C: 03+2] Horizon local_settings: fix name of trove policy setting [puppet] - 10https://gerrit.wikimedia.org/r/685157 (https://phabricator.wikimedia.org/T281655) (owner: 10Andrew Bogott) [01:08:33] RECOVERY - Check systemd state on an-worker1135 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:08:43] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 3 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [01:09:13] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:16:03] (03PS1) 10Andrew Bogott: cloud-vps: install python3-troveclient on Stretch VMs [puppet] - 10https://gerrit.wikimedia.org/r/685172 [01:16:58] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: install python3-troveclient on Stretch VMs [puppet] - 10https://gerrit.wikimedia.org/r/685172 (owner: 10Andrew Bogott) [01:18:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:21:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:36:01] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563 [01:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:24] T280563: Reboot elasticsearch* and relforge* to apply kernel security updates - https://phabricator.wikimedia.org/T280563 [01:39:22] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [01:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:39:29] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1006.eqiad.wmnet --dest wdqs1011.eqiad.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin1001` tmux session `reimage` [01:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:39:37] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [01:43:55] !log T280382 [WDQS] `racadm>>racadm serveraction powercycle` on `wdqs2007` [01:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:44:31] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10RKemper) 05Resolved→03Open [01:44:42] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [01:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:16] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [01:45:22] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1006.eqiad.wmnet --dest wdqs1011.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `reimage` [01:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:32] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [01:47:00] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10RKemper) @Papaul The host became ssh unreachable again. It looks like there might be something wrong with the underlying hardware. As... [01:47:57] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [01:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:25] PROBLEM - Check systemd state on elastic1056 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:56] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2007.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `reimage` (will likely fail due to underlying hw but we'll see) [01:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:50:13] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [01:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:54:38] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10RKemper) ` curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocatio... [01:55:00] !log T281327 [Elastic] Unbanned `elastic2043` from cluster [01:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:55:08] T281327: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 [01:56:24] ACKNOWLEDGEMENT - MD RAID on wdqs2007 is CRITICAL: CRITICAL: State: degraded, Active: 7, Working: 7, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T281956 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [01:56:28] 10SRE, 10ops-codfw: Degraded RAID on wdqs2007 - https://phabricator.wikimedia.org/T281956 (10ops-monitoring-bot) [02:06:09] RECOVERY - Check systemd state on elastic1056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:23:35] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10Papaul) @RKemper Will try to upgrade the firmware on it tomorrow or Friday [02:28:26] (03PS1) 10Krinkle: Temporarily shorten $wgParserCacheExpireTime from 30 to 22 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685181 (https://phabricator.wikimedia.org/T280605) [02:29:19] (03PS2) 10Krinkle: Temporarily shorten $wgParserCacheExpireTime from 30 to 22 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685181 (https://phabricator.wikimedia.org/T280605) [02:56:10] (03PS1) 10Dzahn: admin: update email address for shell user Alangi Derick [puppet] - 10https://gerrit.wikimedia.org/r/685189 (https://phabricator.wikimedia.org/T281564) [02:59:51] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563 [03:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:00:02] T280563: Reboot elasticsearch* and relforge* to apply kernel security updates - https://phabricator.wikimedia.org/T280563 [03:01:28] (03PS1) 10Dzahn: admin: upgrade derick from ldap_only to deployer [puppet] - 10https://gerrit.wikimedia.org/r/685190 (https://phabricator.wikimedia.org/T281564) [03:02:17] (03Abandoned) 10Dzahn: admin: update email address for shell user Alangi Derick [puppet] - 10https://gerrit.wikimedia.org/r/685189 (https://phabricator.wikimedia.org/T281564) (owner: 10Dzahn) [03:02:58] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [03:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:53] (03PS2) 10Dzahn: admin: upgrade derick from ldap_only to deployer [puppet] - 10https://gerrit.wikimedia.org/r/685190 (https://phabricator.wikimedia.org/T281564) [03:05:07] PROBLEM - WDQS high update lag on wdqs1011 is CRITICAL: 4428 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [03:05:19] RECOVERY - WDQS SPARQL on wdqs1011 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.068 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:07:51] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [03:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:08:09] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:08:11] PROBLEM - WDQS high update lag on wdqs2001 is CRITICAL: 1.08e+05 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [03:09:11] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:09:21] PROBLEM - Blazegraph Port for wdqs-categories on wdqs2007 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:09:23] RECOVERY - WDQS SPARQL on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.200 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:09:27] RECOVERY - Check systemd state on wdqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:09:33] PROBLEM - puppet last run on wdqs2007 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:09:47] PROBLEM - Check systemd state on wdqs2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wdqs-blazegraph.service,prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:09:47] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.200 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:09:55] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2001 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:10:03] RECOVERY - Query Service HTTP Port on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.143 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [03:11:31] PROBLEM - puppet last run on wdqs2001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:16:57] PROBLEM - Check systemd state on an-worker1130 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:17:07] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:18:05] PROBLEM - Hadoop NodeManager on an-worker1130 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:18:05] RECOVERY - puppet last run on wdqs2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:21:57] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2007 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:22:01] PROBLEM - WDQS high update lag on wdqs2007 is CRITICAL: 1.077e+05 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [03:22:05] RECOVERY - Blazegraph Port for wdqs-categories on wdqs2007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:22:37] RECOVERY - puppet last run on wdqs2007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:30:53] RECOVERY - Hadoop NodeManager on an-worker1130 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:31:36] (03PS1) 10Andrew Bogott: Install python3-troveclient on VMs [puppet] - 10https://gerrit.wikimedia.org/r/685199 [03:32:21] RECOVERY - Check systemd state on an-worker1130 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:33:00] (03CR) 10Andrew Bogott: [C: 03+2] Install python3-troveclient on VMs [puppet] - 10https://gerrit.wikimedia.org/r/685199 (owner: 10Andrew Bogott) [03:38:05] RECOVERY - WDQS high update lag on wdqs1011 is OK: (C)3600 ge (W)1200 ge 1002 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [03:50:31] RECOVERY - Check systemd state on wdqs2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:50:55] !log T280382 `wdqs2007.codfw.wmnet` has been re-imaged and had the appropriate wikidata/categories journal files transferred. `df -h` shows disk space is no longer an issue following the switch to `raid0`: `/dev/mapper/vg0-srv 2.7T 998G 1.6T 39% /srv` [03:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:51:04] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [03:51:39] !log T280382 [WDQS] `ryankemper@wdqs2007:~$ sudo depool` (need to monitor host to see if it becomes ssh unreachable again or if it was a one-off; also high update lag) [03:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:52:48] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563 [03:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:52:56] T280563: Reboot elasticsearch* and relforge* to apply kernel security updates - https://phabricator.wikimedia.org/T280563 [03:54:11] !log T280382 `wdqs1011.eqiad.wmnet` has been re-imaged and had the appropriate wikidata/categories journal files transferred. `df -h` shows disk space is no longer an issue following the switch to `raid0`: `/dev/mapper/vg0-srv 2.7T 998G 1.6T 39% /srv` [03:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:55:05] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [03:56:18] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563 [03:56:25] RECOVERY - Check systemd state on elastic1037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:56:54] !log T280563 Reboot of `eqiad` complete. Only ~half of `codfw` is remaining. [03:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:58:00] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563 [03:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:58:08] T280563: Reboot elasticsearch* and relforge* to apply kernel security updates - https://phabricator.wikimedia.org/T280563 [03:58:14] !log T280563 `sudo -i cookbook sre.elasticsearch.rolling-operation search_codfw "codfw reboot" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563` on `ryankemper@cumin1001` tmux session `elastic_restarts` [03:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:12:43] PROBLEM - Check systemd state on elastic2059 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:13:15] PROBLEM - Check systemd state on elastic2047 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:15:13] RECOVERY - Check systemd state on elastic2059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:16:27] 10SRE, 10ops-codfw: Degraded RAID on wdqs2007 - https://phabricator.wikimedia.org/T281956 (10Kizule) [04:16:51] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10Kizule) [04:20:29] RECOVERY - Check systemd state on elastic2047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:26:10] (03PS1) 10Legoktm: mailman3: Copy "info" field over manually [puppet] - 10https://gerrit.wikimedia.org/r/685212 (https://phabricator.wikimedia.org/T281933) [04:27:46] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Mailman3 import script is unnecessarily truncating list descriptions - https://phabricator.wikimedia.org/T281933 (10Legoktm) >>! In T281933#7060186, @Ladsgroup wrote: > One thing I was thinking was that we can simply set the description from the old file... [04:30:03] (03CR) 10Legoktm: "I only tested the pickle.loads() part of this to make sure that Python3 can actually open these Python2 pickle files properly." [puppet] - 10https://gerrit.wikimedia.org/r/685212 (https://phabricator.wikimedia.org/T281933) (owner: 10Legoktm) [04:30:56] 10SRE, 10Mail, 10Wikimedia-Mailing-lists: In Mailman3 if a list has no owners, mail goes to root@ - https://phabricator.wikimedia.org/T281753 (10Majavah) Maybe {T280744} is related? [04:32:45] PROBLEM - Check systemd state on elastic2050 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:33:47] PROBLEM - Check systemd state on elastic2051 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:34:45] PROBLEM - Check systemd state on elastic2036 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:35:13] RECOVERY - Check systemd state on elastic2050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:52:23] RECOVERY - Check systemd state on elastic2036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:53:53] RECOVERY - Check systemd state on elastic2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:54:06] (03PS1) 10Marostegui: db1178: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/685221 (https://phabricator.wikimedia.org/T275633) [04:54:13] 10SRE, 10Traffic: provision more machines for eqsin caches - https://phabricator.wikimedia.org/T275046 (10BBlack) These are just about ready and running correct puppetization, but **don't** pool these yet. I think they may have some bad BIOS settings or something, at least related to power mgmt. cpufreq keep... [04:57:26] (03PS1) 10Marostegui: parsercachepurging.pp: Reduce parsercache retention to 21 days [puppet] - 10https://gerrit.wikimedia.org/r/685222 (https://phabricator.wikimedia.org/T280605) [05:01:53] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 54 (contint2001, ...), Fresh: 48 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:06:57] (03CR) 10Marostegui: [C: 03+2] db1178: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/685221 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [05:11:21] 10SRE, 10ops-codfw, 10DBA: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Marostegui) @jcrespo can coordinate better the dbprov downtimes, I am swapping names there :) [05:11:35] 10SRE, 10ops-codfw, 10DBA: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Marostegui) [05:11:50] (03PS1) 10Marostegui: instances.yaml: Add db1178 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/685228 (https://phabricator.wikimedia.org/T275633) [05:29:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3316 for schema change', diff saved to https://phabricator.wikimedia.org/P15722 and previous config saved to /var/cache/conftool/dbconfig/20210505-052943-marostegui.json [05:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1106 into s1 vslow, remove db1099:3311', diff saved to https://phabricator.wikimedia.org/P15723 and previous config saved to /var/cache/conftool/dbconfig/20210505-053211-marostegui.json [05:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:36] I just started seeing a banner on wikidata about T281212 but its saying we may not be able to save edits - that isn't listed on T281375 as an issue. Are any wikis going to be set read only? Or just x1 and we can still edit, just not some other stuff [05:35:37] T281212: Restart x1 database master (db1103) - https://phabricator.wikimedia.org/T281212 [05:35:37] T281375: Read only time for extension 1 (x1) primary database on 2021-05-05 - https://phabricator.wikimedia.org/T281375 [05:35:54] DannyS712: only x1 [05:37:00] okay, can the banner be clarified? Its being shown on all wikis talking about not being able to save edits [05:37:30] DannyS712: you might want to comment on https://phabricator.wikimedia.org/T281375 about it - but I am afraid it is probably too late to fix it [05:38:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1099:3311 into main traffic', diff saved to https://phabricator.wikimedia.org/P15724 and previous config saved to /var/cache/conftool/dbconfig/20210505-053841-marostegui.json [05:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:08] left a comment, but yeah, the only time people see it (before and during the window) its too late to fix :( [05:39:21] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1178 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/685228 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [05:39:53] (03PS1) 10Marostegui: Revert "db1106: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/685053 [05:44:10] (03PS1) 10Brennen Bearnes: logspam-watch: correctly handle 0 for total error counts [puppet] - 10https://gerrit.wikimedia.org/r/685231 (https://phabricator.wikimedia.org/T281121) [05:44:45] (03PS2) 10Brennen Bearnes: logspam-watch: correctly handle 0 for total error counts [puppet] - 10https://gerrit.wikimedia.org/r/685231 (https://phabricator.wikimedia.org/T281121) [05:46:27] (03CR) 10Marostegui: [C: 03+2] Revert "db1106: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/685053 (owner: 10Marostegui) [05:55:33] PROBLEM - Check systemd state on elastic2034 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:56:59] PROBLEM - Check systemd state on elastic2035 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:57:13] PROBLEM - Check systemd state on elastic2053 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:59:33] RECOVERY - Check systemd state on elastic2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:59:45] RECOVERY - Check systemd state on elastic2053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:00:03] !log Restart mysqld on x1 database primary master (db1103) T281212 [06:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:12] T281212: Restart x1 database master (db1103) - https://phabricator.wikimedia.org/T281212 [06:00:57] all done [06:03:08] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: db2135 crashed - https://phabricator.wikimedia.org/T278408 (10Marostegui) [06:06:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1104 from API', diff saved to https://phabricator.wikimedia.org/P15725 and previous config saved to /var/cache/conftool/dbconfig/20210505-060636-marostegui.json [06:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1178 into dbctl T275633', diff saved to https://phabricator.wikimedia.org/P15726 and previous config saved to /var/cache/conftool/dbconfig/20210505-060814-marostegui.json [06:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:22] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [06:09:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 1%: Slowly pool db1178 into s8 T275633', diff saved to https://phabricator.wikimedia.org/P15727 and previous config saved to /var/cache/conftool/dbconfig/20210505-060912-root.json [06:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:03] PROBLEM - Check systemd state on elastic2054 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:11:13] PROBLEM - Check systemd state on elastic2060 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:11:15] PROBLEM - Check systemd state on elastic2052 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:13:09] RECOVERY - Check systemd state on elastic2034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:13:47] RECOVERY - Check systemd state on elastic2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:14:13] 10SRE, 10Dumps-Generation, 10Wikidata, 10observability, 10wdwb-tech: various weekly and daily dumps run from systemd timers are broken - https://phabricator.wikimedia.org/T281267 (10ArielGlenn) What are the next steps on this? Should I be tweaking a manifest someplace? [06:24:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 2%: Slowly pool db1178 into s8 T275633', diff saved to https://phabricator.wikimedia.org/P15728 and previous config saved to /var/cache/conftool/dbconfig/20210505-062416-root.json [06:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:25] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [06:26:20] RECOVERY - Check systemd state on elastic2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:24] PROBLEM - Check systemd state on elastic2055 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:36:58] RECOVERY - Check systemd state on elastic2060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:38:36] PROBLEM - Check systemd state on elastic2038 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:39:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 3%: Slowly pool db1178 into s8 T275633', diff saved to https://phabricator.wikimedia.org/P15729 and previous config saved to /var/cache/conftool/dbconfig/20210505-063920-root.json [06:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:29] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [06:40:58] PROBLEM - Check systemd state on elastic2056 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:40:59] !log Check tables on db1112 (lag might show up on s3 on wiki replicas) T280492 [06:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:08] T280492: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 [06:42:04] RECOVERY - Check systemd state on elastic2038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:42:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1074 and db1156 to switch sanitarium hosts T280492', diff saved to https://phabricator.wikimedia.org/P15730 and previous config saved to /var/cache/conftool/dbconfig/20210505-064204-marostegui.json [06:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:52] PROBLEM - Check systemd state on elastic2027 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:45:20] RECOVERY - Check systemd state on elastic2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:36] PROBLEM - Check systemd state on elastic2040 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:38] PROBLEM - Check systemd state on elastic2037 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:47:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 25%: Repool db1074', diff saved to https://phabricator.wikimedia.org/P15731 and previous config saved to /var/cache/conftool/dbconfig/20210505-064712-root.json [06:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:40] RECOVERY - Check systemd state on elastic2040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:49:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 25%: Repool db1156', diff saved to https://phabricator.wikimedia.org/P15732 and previous config saved to /var/cache/conftool/dbconfig/20210505-064905-root.json [06:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 25%: Repool db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P15733 and previous config saved to /var/cache/conftool/dbconfig/20210505-065142-root.json [06:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 5%: Slowly pool db1178 into s8 T275633', diff saved to https://phabricator.wikimedia.org/P15734 and previous config saved to /var/cache/conftool/dbconfig/20210505-065423-root.json [06:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:32] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [06:54:48] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:55:18] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:57:58] RECOVERY - Check systemd state on elastic2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:58:00] RECOVERY - Check systemd state on elastic2056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:58:04] (03PS1) 10Elukey: bigtop::hadoop::nodemanager: apply systemd override to service [puppet] - 10https://gerrit.wikimedia.org/r/685314 (https://phabricator.wikimedia.org/T281792) [06:58:43] (03CR) 10jerkins-bot: [V: 04-1] bigtop::hadoop::nodemanager: apply systemd override to service [puppet] - 10https://gerrit.wikimedia.org/r/685314 (https://phabricator.wikimedia.org/T281792) (owner: 10Elukey) [07:00:09] uff [07:01:24] (03PS1) 10Marostegui: site.pp: Remove comments from db1154,db1155 [puppet] - 10https://gerrit.wikimedia.org/r/685315 [07:01:28] (03PS2) 10Elukey: bigtop::hadoop::nodemanager: apply systemd override to service [puppet] - 10https://gerrit.wikimedia.org/r/685314 (https://phabricator.wikimedia.org/T281792) [07:02:02] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove comments from db1154,db1155 [puppet] - 10https://gerrit.wikimedia.org/r/685315 (owner: 10Marostegui) [07:02:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 50%: Repool db1074', diff saved to https://phabricator.wikimedia.org/P15735 and previous config saved to /var/cache/conftool/dbconfig/20210505-070216-root.json [07:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 50%: Repool db1156', diff saved to https://phabricator.wikimedia.org/P15736 and previous config saved to /var/cache/conftool/dbconfig/20210505-070409-root.json [07:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:03] (03PS1) 10Marostegui: install_server: Do not reimage db1118, db1125 [puppet] - 10https://gerrit.wikimedia.org/r/685316 [07:05:49] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1118, db1125 [puppet] - 10https://gerrit.wikimedia.org/r/685316 (owner: 10Marostegui) [07:06:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 50%: Repool db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P15737 and previous config saved to /var/cache/conftool/dbconfig/20210505-070646-root.json [07:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:46] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active - Telia, AS1299/IPv4: Connect - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:09:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 10%: Slowly pool db1178 into s8 T275633', diff saved to https://phabricator.wikimedia.org/P15738 and previous config saved to /var/cache/conftool/dbconfig/20210505-070927-root.json [07:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:36] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [07:11:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1082 T281794', diff saved to https://phabricator.wikimedia.org/P15739 and previous config saved to /var/cache/conftool/dbconfig/20210505-071132-marostegui.json [07:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:43] T281794: decommission db1082.eqiad.wmnet - https://phabricator.wikimedia.org/T281794 [07:12:33] (03PS1) 10Marostegui: db1082: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/685317 (https://phabricator.wikimedia.org/T281794) [07:13:15] (03CR) 10Marostegui: [C: 03+2] db1082: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/685317 (https://phabricator.wikimedia.org/T281794) (owner: 10Marostegui) [07:14:30] RECOVERY - Check systemd state on elastic2027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:17:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 75%: Repool db1074', diff saved to https://phabricator.wikimedia.org/P15740 and previous config saved to /var/cache/conftool/dbconfig/20210505-071720-root.json [07:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 75%: Repool db1156', diff saved to https://phabricator.wikimedia.org/P15741 and previous config saved to /var/cache/conftool/dbconfig/20210505-071912-root.json [07:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:59] (03PS3) 10Elukey: bigtop::hadoop::nodemanager: apply systemd override to service [puppet] - 10https://gerrit.wikimedia.org/r/685314 (https://phabricator.wikimedia.org/T281792) [07:20:54] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29394/console" [puppet] - 10https://gerrit.wikimedia.org/r/685314 (https://phabricator.wikimedia.org/T281792) (owner: 10Elukey) [07:21:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 75%: Repool db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P15742 and previous config saved to /var/cache/conftool/dbconfig/20210505-072149-root.json [07:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 15%: Slowly pool db1178 into s8 T275633', diff saved to https://phabricator.wikimedia.org/P15743 and previous config saved to /var/cache/conftool/dbconfig/20210505-072431-root.json [07:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:40] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [07:28:53] (03PS2) 10Muehlenhoff: Make cumin2002 a Cumin host [puppet] - 10https://gerrit.wikimedia.org/r/681404 (https://phabricator.wikimedia.org/T276589) [07:32:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 100%: Repool db1074', diff saved to https://phabricator.wikimedia.org/P15744 and previous config saved to /var/cache/conftool/dbconfig/20210505-073223-root.json [07:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:10] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 64, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:34:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 100%: Repool db1156', diff saved to https://phabricator.wikimedia.org/P15745 and previous config saved to /var/cache/conftool/dbconfig/20210505-073416-root.json [07:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:30] !log rolling restart of cassandra in eqiad to pick up Java security updates [07:35:35] !log jmm@cumin2001 START - Cookbook sre.cassandra.roll-restart [07:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 100%: Repool db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P15746 and previous config saved to /var/cache/conftool/dbconfig/20210505-073653-root.json [07:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1113:3316 for schema change', diff saved to https://phabricator.wikimedia.org/P15747 and previous config saved to /var/cache/conftool/dbconfig/20210505-073722-marostegui.json [07:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 20%: Slowly pool db1178 into s8 T275633', diff saved to https://phabricator.wikimedia.org/P15748 and previous config saved to /var/cache/conftool/dbconfig/20210505-073934-root.json [07:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:44] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [07:41:23] (03CR) 10Muehlenhoff: [C: 03+2] Make cumin2002 a Cumin host [puppet] - 10https://gerrit.wikimedia.org/r/681404 (https://phabricator.wikimedia.org/T276589) (owner: 10Muehlenhoff) [07:45:50] (03PS1) 10Muehlenhoff: cumin: Add support for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/685323 [07:48:02] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/685323 (owner: 10Muehlenhoff) [07:50:19] (03CR) 10Muehlenhoff: [C: 03+2] cumin: Add support for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/685323 (owner: 10Muehlenhoff) [07:53:19] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563 [07:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:27] T280563: Reboot elasticsearch* and relforge* to apply kernel security updates - https://phabricator.wikimedia.org/T280563 [07:54:04] (03PS1) 10Muehlenhoff: wmf_root_client: Also allow cumin2002 [puppet] - 10https://gerrit.wikimedia.org/r/685331 [07:54:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 25%: Slowly pool db1178 into s8 T275633', diff saved to https://phabricator.wikimedia.org/P15749 and previous config saved to /var/cache/conftool/dbconfig/20210505-075438-root.json [07:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:47] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [08:01:53] (03CR) 10Filippo Giunchedi: "> Patch Set 3:" [alerts] - 10https://gerrit.wikimedia.org/r/670230 (https://phabricator.wikimedia.org/T281358) (owner: 10Filippo Giunchedi) [08:02:56] (03CR) 10Muehlenhoff: [C: 03+2] wmf_root_client: Also allow cumin2002 [puppet] - 10https://gerrit.wikimedia.org/r/685331 (owner: 10Muehlenhoff) [08:03:56] (03CR) 10Filippo Giunchedi: "When using the decom cookbook I'd remove all references to the hosts from puppet and run the cookbook" [puppet] - 10https://gerrit.wikimedia.org/r/685087 (https://phabricator.wikimedia.org/T279602) (owner: 10Herron) [08:06:12] Hi, we would like to do a production deployment for mobileapps outside the deployment window to land a fix for a UBN ticket: https://phabricator.wikimedia.org/T281938 [08:07:00] Is that ok? [08:09:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 30%: Slowly pool db1178 into s8 T275633', diff saved to https://phabricator.wikimedia.org/P15750 and previous config saved to /var/cache/conftool/dbconfig/20210505-080942-root.json [08:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:55] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [08:10:17] nemo-yiannis: Assuming that's deploying the kubernetes service "mobileapps" I'd say it's fine. If possible, verify on staging first ofc. [08:13:53] !log uploaded spicerack_0.0.51 to apt.wikimedia.org buster-wikimedia [08:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:10] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:18:38] one backup transfer failed AFAICT [08:22:58] jayme: yeah thats kubernetes, sounds good [08:24:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 35%: Slowly pool db1178 into s8 T275633', diff saved to https://phabricator.wikimedia.org/P15751 and previous config saved to /var/cache/conftool/dbconfig/20210505-082446-root.json [08:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:58] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [08:25:14] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar, 10Patch-For-Review: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (104nn1l2) @Dzahn the list is not complete yet. For example, FreeMono and FreeSerif are not in y... [08:26:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1113:3316', diff saved to https://phabricator.wikimedia.org/P15752 and previous config saved to /var/cache/conftool/dbconfig/20210505-082609-marostegui.json [08:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:29] (03PS1) 10Jgiannelos: Deploy mobileapps 2021-05-05-080505-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/685346 [08:36:35] (03PS1) 10Muehlenhoff: Also select wmf-mariadb104-client for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/685347 [08:37:27] 10SRE, 10ops-codfw, 10DBA: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10jcrespo) @Papaul dbprov2002 should be shut down carefully to make sure data is kept intact (I'd prefer to do so). Otherwise, it can be down for e.g. 1 day.Will it need IP changes done beforehand?... [08:37:30] 10SRE, 10ops-codfw: Move YubiHSM from auth2001 to pki2001 - https://phabricator.wikimedia.org/T281459 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:37:34] (03CR) 10Jgiannelos: [C: 03+2] Deploy mobileapps 2021-05-05-080505-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/685346 (owner: 10Jgiannelos) [08:37:38] 10SRE, 10CAS-SSO: Tomcat/CAS fails to start with OpenJDK 11.0.11 - https://phabricator.wikimedia.org/T281345 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:38:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1168 for schema change', diff saved to https://phabricator.wikimedia.org/P15753 and previous config saved to /var/cache/conftool/dbconfig/20210505-083810-marostegui.json [08:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:27] 10SRE, 10ops-codfw, 10DBA: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10jcrespo) [08:38:58] (03Merged) 10jenkins-bot: Deploy mobileapps 2021-05-05-080505-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/685346 (owner: 10Jgiannelos) [08:39:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 50%: Slowly pool db1178 into s8 T275633', diff saved to https://phabricator.wikimedia.org/P15754 and previous config saved to /var/cache/conftool/dbconfig/20210505-083950-root.json [08:39:52] (03CR) 10David Caro: [C: 03+2] "Added and removed an etcd node on toolsbeta to test it without problems." [cookbooks] - 10https://gerrit.wikimedia.org/r/684964 (https://phabricator.wikimedia.org/T281508) (owner: 10David Caro) [08:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:01] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [08:41:01] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [08:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:55] (03Merged) 10jenkins-bot: wmcs: use yaml vs json for k8s objects [cookbooks] - 10https://gerrit.wikimedia.org/r/684964 (https://phabricator.wikimedia.org/T281508) (owner: 10David Caro) [08:47:14] !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [08:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:30] (03CR) 10Muehlenhoff: [C: 03+2] Also select wmf-mariadb104-client for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/685347 (owner: 10Muehlenhoff) [08:48:24] (03PS1) 10WMDE-Fisch: Enable ReferencePreviews on first wikis InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685349 (https://phabricator.wikimedia.org/T271206) [08:48:26] (03PS1) 10WMDE-Fisch: Enable ReferencePreviews on first wikis CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685350 (https://phabricator.wikimedia.org/T271206) [08:50:44] !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [08:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:10] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar, 10Patch-For-Review: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10JoKalliauer) @4nn1l2 : `FreeMono` fallback to `DejaVu Sans Mono` `FreeSerif` falback to `Dej... [08:52:50] (03CR) 10Kormat: [C: 03+1] parsercachepurging.pp: Reduce parsercache retention to 21 days [puppet] - 10https://gerrit.wikimedia.org/r/685222 (https://phabricator.wikimedia.org/T280605) (owner: 10Marostegui) [08:54:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 60%: Slowly pool db1178 into s8 T275633', diff saved to https://phabricator.wikimedia.org/P15755 and previous config saved to /var/cache/conftool/dbconfig/20210505-085454-root.json [08:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:03] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [08:55:43] !log Restarting CI Jenkins # T281737 [08:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:51] T281737: Zuul can't stop jobs or set the build description - https://phabricator.wikimedia.org/T281737 [08:57:48] (03PS1) 10Muehlenhoff: profile::dbbackups::transfer: Also set internal/description for standby hosts [puppet] - 10https://gerrit.wikimedia.org/r/685358 [09:02:03] 10SRE, 10ops-codfw, 10DBA: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10elukey) [09:04:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 25%: Repool db1168', diff saved to https://phabricator.wikimedia.org/P15756 and previous config saved to /var/cache/conftool/dbconfig/20210505-090434-root.json [09:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:32] !log Upgraded Jenkins ldap plugin from 1.26 to 2.6 # T281737 [09:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:40] T281737: Zuul can't stop jobs or set the build description - https://phabricator.wikimedia.org/T281737 [09:09:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 70%: Slowly pool db1178 into s8 T275633', diff saved to https://phabricator.wikimedia.org/P15757 and previous config saved to /var/cache/conftool/dbconfig/20210505-090957-root.json [09:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:07] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [09:12:54] (03PS2) 10Hashar: ci: add docker0 IP to /etc/hosts entry [puppet] - 10https://gerrit.wikimedia.org/r/684965 (https://phabricator.wikimedia.org/T281737) [09:13:14] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/684965 (https://phabricator.wikimedia.org/T281737) (owner: 10Hashar) [09:13:19] 10SRE, 10ops-codfw, 10DBA, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10jijiki) [09:15:48] (03CR) 10Jcrespo: [C: 03+1] profile::dbbackups::transfer: Also set internal/description for standby hosts [puppet] - 10https://gerrit.wikimedia.org/r/685358 (owner: 10Muehlenhoff) [09:16:22] (03CR) 10Jcrespo: [C: 03+1] "Sorry, we didn't test a lot the removal case. Please merge asap." [puppet] - 10https://gerrit.wikimedia.org/r/685358 (owner: 10Muehlenhoff) [09:19:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 50%: Repool db1168', diff saved to https://phabricator.wikimedia.org/P15758 and previous config saved to /var/cache/conftool/dbconfig/20210505-091938-root.json [09:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:23] (03PS2) 10Arturo Borrero Gonzalez: openstack: wmcs-dns-floating-ip-updater: fix file permission [puppet] - 10https://gerrit.wikimedia.org/r/684989 [09:20:27] (03CR) 10Muehlenhoff: [C: 03+2] profile::dbbackups::transfer: Also set internal/description for standby hosts [puppet] - 10https://gerrit.wikimedia.org/r/685358 (owner: 10Muehlenhoff) [09:21:46] (03PS4) 10Elukey: bigtop::hadoop::nodemanager: apply systemd override to service [puppet] - 10https://gerrit.wikimedia.org/r/685314 (https://phabricator.wikimedia.org/T281792) [09:22:46] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29395/console" [puppet] - 10https://gerrit.wikimedia.org/r/685314 (https://phabricator.wikimedia.org/T281792) (owner: 10Elukey) [09:25:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 80%: Slowly pool db1178 into s8 T275633', diff saved to https://phabricator.wikimedia.org/P15759 and previous config saved to /var/cache/conftool/dbconfig/20210505-092501-root.json [09:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:11] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [09:26:29] (03PS3) 10Arturo Borrero Gonzalez: openstack: wmcs-dns-floating-ip-updater: fix file permission [puppet] - 10https://gerrit.wikimedia.org/r/684989 [09:28:58] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/684989 (owner: 10Arturo Borrero Gonzalez) [09:31:17] (03CR) 10Thiemo Kreuz (WMDE): Enable ReferencePreviews on first wikis CommonSettings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685350 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [09:32:25] (03CR) 10Hashar: "Compiler https://puppet-compiler.wmflabs.org/compiler1001/730/" [puppet] - 10https://gerrit.wikimedia.org/r/684965 (https://phabricator.wikimedia.org/T281737) (owner: 10Hashar) [09:34:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 75%: Repool db1168', diff saved to https://phabricator.wikimedia.org/P15760 and previous config saved to /var/cache/conftool/dbconfig/20210505-093441-root.json [09:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: wmcs-dns-floating-ip-updater: fix file permission [puppet] - 10https://gerrit.wikimedia.org/r/684989 (owner: 10Arturo Borrero Gonzalez) [09:39:17] (03PS1) 10Muehlenhoff: homer: Only use scap up to Buster, deployment will switch to a cookbook [puppet] - 10https://gerrit.wikimedia.org/r/685374 [09:40:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 100%: Slowly pool db1178 into s8 T275633', diff saved to https://phabricator.wikimedia.org/P15761 and previous config saved to /var/cache/conftool/dbconfig/20210505-094005-root.json [09:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:14] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [09:40:18] (03CR) 10Volans: [C: 03+1] "if PPC doesn't complain LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/685374 (owner: 10Muehlenhoff) [09:42:49] (03CR) 10Thiemo Kreuz (WMDE): Enable ReferencePreviews on first wikis InitialiseSettings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685349 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [09:47:07] (03PS2) 10David Caro: wmcs.drain_hypervisor: use canary project instead of VM name [puppet] - 10https://gerrit.wikimedia.org/r/683857 (https://phabricator.wikimedia.org/T280641) [09:48:29] (03PS2) 10Muehlenhoff: homer: Only use scap up to Buster, deployment will switch to a cookbook [puppet] - 10https://gerrit.wikimedia.org/r/685374 [09:49:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: Repool db1168', diff saved to https://phabricator.wikimedia.org/P15762 and previous config saved to /var/cache/conftool/dbconfig/20210505-094945-root.json [09:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:14] !log Restarted Zuul / CI [09:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:28] (03CR) 10Volans: homer: Only use scap up to Buster, deployment will switch to a cookbook (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685374 (owner: 10Muehlenhoff) [09:56:25] 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10jcrespo) Latest status: ` root@backup1001:~$ check_bacula.py people1003.eqiad.wmnet-Monthly-1st-Sun-production-home None: type: I, status: C, bytes: 0 2021-04-27 04:54:36: type: F, status: f, bytes: 0 2021-0... [09:57:56] (03PS1) 10Jcrespo: Revert "bacula: add people1003 job to monitoring ignorelist" [puppet] - 10https://gerrit.wikimedia.org/r/685056 [10:00:06] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: introduce icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/685379 (https://phabricator.wikimedia.org/T270704) [10:00:13] (03PS3) 10Muehlenhoff: homer: Only use scap up to Buster, deployment will switch to a cookbook [puppet] - 10https://gerrit.wikimedia.org/r/685374 [10:02:00] (03CR) 10jerkins-bot: [V: 04-1] homer: Only use scap up to Buster, deployment will switch to a cookbook [puppet] - 10https://gerrit.wikimedia.org/r/685374 (owner: 10Muehlenhoff) [10:02:03] (03CR) 10jerkins-bot: [V: 04-1] cloudgw: introduce icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/685379 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [10:02:08] (03CR) 10Jcrespo: "I am self merging, as I believe Daniel will want this after T280989#7060999" [puppet] - 10https://gerrit.wikimedia.org/r/685056 (owner: 10Jcrespo) [10:02:29] (03CR) 10Jcrespo: [C: 03+2] Revert "bacula: add people1003 job to monitoring ignorelist" [puppet] - 10https://gerrit.wikimedia.org/r/685056 (owner: 10Jcrespo) [10:03:19] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/29397/" [puppet] - 10https://gerrit.wikimedia.org/r/685374 (owner: 10Muehlenhoff) [10:03:30] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: introduce icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/685379 (https://phabricator.wikimedia.org/T270704) [10:04:43] (03PS4) 10Muehlenhoff: homer: Only use scap up to Buster, deployment will switch to a cookbook [puppet] - 10https://gerrit.wikimedia.org/r/685374 [10:06:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.drain_hypervisor: use canary project instead of VM name [puppet] - 10https://gerrit.wikimedia.org/r/683857 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [10:06:43] (03CR) 10WMDE-Fisch: Enable ReferencePreviews on first wikis InitialiseSettings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685349 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [10:07:34] (03CR) 10Volans: [C: 03+1] "LGTM as a temporary fix for bullseye" [puppet] - 10https://gerrit.wikimedia.org/r/685374 (owner: 10Muehlenhoff) [10:07:46] (03PS2) 10WMDE-Fisch: Enable ReferencePreviews on first wikis CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685350 (https://phabricator.wikimedia.org/T271206) [10:09:52] (03CR) 10WMDE-Fisch: Enable ReferencePreviews on first wikis CommonSettings (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685350 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [10:11:06] (03CR) 10Muehlenhoff: [C: 03+2] homer: Only use scap up to Buster, deployment will switch to a cookbook [puppet] - 10https://gerrit.wikimedia.org/r/685374 (owner: 10Muehlenhoff) [10:12:21] (03CR) 10WMDE-Fisch: Enable ReferencePreviews on first wikis InitialiseSettings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685349 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [10:12:21] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:29] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [10:15:21] (03CR) 10David Caro: [C: 03+2] wmcs.drain_hypervisor: use canary project instead of VM name [puppet] - 10https://gerrit.wikimedia.org/r/683857 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [10:15:44] (03CR) 10David Caro: [C: 03+2] wmcs.drain_hypervisor: use canary project instead of VM name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/683857 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [10:17:09] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: introduce icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/685379 (https://phabricator.wikimedia.org/T270704) [10:17:46] (03CR) 10Jbond: [C: 03+2] P:pki::multirootca: Add timer to clean expired certificates [puppet] - 10https://gerrit.wikimedia.org/r/685026 (https://phabricator.wikimedia.org/T281369) (owner: 10Jbond) [10:18:45] (03CR) 10jerkins-bot: [V: 04-1] cloudgw: introduce icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/685379 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [10:19:08] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/685379 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [10:19:33] dcaro: happy for me to merge yours [10:19:59] jbond42: yes please [10:20:11] mnerged [10:20:15] thanks! [10:20:25] np [10:22:47] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [10:23:46] (Primary inbound port utilisation over 80% #page) firing: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [10:24:23] mmhh not sure why the device/hostname isn't in there ^ anyways [10:26:54] PROBLEM - LibreNMS has a critical alert #page on alert1001 is CRITICAL: Primary inbound port utilisation over 80% #page (cr2-eqiad.wikimedia.org) // Primary outbound port utilisation over 80% #page (asw2-d-eqiad.mgmt.eqiad.wmnet) https://bit.ly/wmf-librenms [10:27:29] looking [10:27:43] XioNoX: see _security too [10:27:47] meanwhile I will check impact [10:27:47] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [10:28:46] (Primary inbound port utilisation over 80% #page) resolved: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [10:29:20] !log jmm@cumin2001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) [10:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:44] impact should be very minimal to null [10:30:47] and I can confirm that looking at several graphs and eqiad responsiveness [10:34:33] it means that 1/4th of the row A flows will suffer congestion and retransmits [10:35:04] 1/4th down to 1/8th depending on where the VRRP gateway is [10:35:37] er, row D [10:37:41] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.072 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:38:55] RECOVERY - LibreNMS has a critical alert #page on alert1001 is OK: OK: zero critical LibreNMS alerts https://bit.ly/wmf-librenms [10:40:02] (03PS1) 10Jbond: cloud - hiera: add mfa to idp [puppet] - 10https://gerrit.wikimedia.org/r/685396 [10:40:38] (03PS1) 10Jcrespo: dbbackups: Reenable notifications on db1102 and db2101 after setup [puppet] - 10https://gerrit.wikimedia.org/r/685397 (https://phabricator.wikimedia.org/T280979) [10:40:47] (03PS2) 10Jcrespo: dbbackups: Reenable notifications on db1102 and db2101 after setup [puppet] - 10https://gerrit.wikimedia.org/r/685397 (https://phabricator.wikimedia.org/T280979) [10:40:54] (03CR) 10jerkins-bot: [V: 04-1] dbbackups: Reenable notifications on db1102 and db2101 after setup [puppet] - 10https://gerrit.wikimedia.org/r/685397 (https://phabricator.wikimedia.org/T280979) (owner: 10Jcrespo) [10:42:13] (03CR) 10Jbond: [C: 03+2] cloud - hiera: add mfa to idp [puppet] - 10https://gerrit.wikimedia.org/r/685396 (owner: 10Jbond) [10:42:33] (03CR) 10Jcrespo: "FYI With these new instances (on a 3-per-host configuration, at least for now) we now have coverage of buster of all sections except s7 an" [puppet] - 10https://gerrit.wikimedia.org/r/685397 (https://phabricator.wikimedia.org/T280979) (owner: 10Jcrespo) [10:43:56] XioNoX, I can see some potential impact on the worse case scenario: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=31&orgId=1&from=1620207808676&to=1620211408676&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200 [10:44:13] but not on the average [10:48:49] (03CR) 10Jcrespo: [C: 04-1] "Waiting for db2101:s2 to catch up after table checks: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=6&orgId=1&var-s" [puppet] - 10https://gerrit.wikimedia.org/r/685397 (https://phabricator.wikimedia.org/T280979) (owner: 10Jcrespo) [10:50:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1173 for schema change', diff saved to https://phabricator.wikimedia.org/P15763 and previous config saved to /var/cache/conftool/dbconfig/20210505-105015-marostegui.json [10:53:56] (03CR) 10Svantje Lilienthal: [C: 03+1] Enable ReferencePreviews on first wikis InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685349 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [10:58:52] (03CR) 10Svantje Lilienthal: [C: 03+1] Enable ReferencePreviews on first wikis CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685350 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210505T1100). [11:00:04] jan_drewniak and CFisch_WMDE: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:14] i can deploy today [11:00:17] o/ [11:00:33] o/ [11:00:44] Would be happy if you could do mine again Urbanecm :-D [11:00:50] any time :) [11:01:01] (03PS4) 10Urbanecm: Enable new language button for all logged in users outside test projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682758 (https://phabricator.wikimedia.org/T280526) (owner: 10Jdlrobson) [11:01:28] (03CR) 10Urbanecm: [C: 03+2] Enable new language button for all logged in users outside test projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682758 (https://phabricator.wikimedia.org/T280526) (owner: 10Jdlrobson) [11:02:58] (03Merged) 10jenkins-bot: Enable new language button for all logged in users outside test projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682758 (https://phabricator.wikimedia.org/T280526) (owner: 10Jdlrobson) [11:03:36] jan_drewniak: pulled onto mwdebug1001, can you test? [11:04:11] Urbanecm: ok, we're testing it [11:04:16] thanks [11:06:23] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/685405 (https://phabricator.wikimedia.org/T270704) [11:07:03] (03PS1) 10Jbond: hiera - cloud: add cloud servername [puppet] - 10https://gerrit.wikimedia.org/r/685406 [11:07:46] Urbanecm: ok, we're good to sync! [11:07:53] thanks a lot, syncing it out [11:08:44] (03CR) 10Urbanecm: [C: 03+2] Enable ReferencePreviews on first wikis InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685349 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [11:08:48] (03PS2) 10Urbanecm: Enable ReferencePreviews on first wikis InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685349 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [11:08:54] (03CR) 10Urbanecm: [C: 03+2] Enable ReferencePreviews on first wikis InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685349 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [11:09:11] (03PS3) 10Urbanecm: Enable ReferencePreviews on first wikis CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685350 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [11:09:17] (03CR) 10Urbanecm: [C: 03+2] Enable ReferencePreviews on first wikis CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685350 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [11:09:19] (03CR) 10Jbond: [C: 03+2] hiera - cloud: add cloud servername [puppet] - 10https://gerrit.wikimedia.org/r/685406 (owner: 10Jbond) [11:09:41] (03Merged) 10jenkins-bot: Enable ReferencePreviews on first wikis InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685349 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [11:09:58] (03Merged) 10jenkins-bot: Enable ReferencePreviews on first wikis CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685350 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [11:11:01] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 289dc34feeb0703bb45f4a71c149cd607ef26455: Enable new language button for all logged in users outside test projects (T280526) (duration: 02m 24s) [11:11:03] jan_drewniak: should be live! [11:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:09] T280526: Deploy new language switching functionality to logged-in users - https://phabricator.wikimedia.org/T280526 [11:11:22] CFisch_WMDE: your commits are on mwdebug1001, please test [11:11:29] (i pulled both of them there, as they're closely related) [11:11:32] Urbanecm: I'll do [11:12:14] thanks [11:12:28] Urbanecm: thanks! [11:12:32] any time [11:13:04] (03PS4) 10Arturo Borrero Gonzalez: cloudgw: introduce icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/685379 (https://phabricator.wikimedia.org/T270704) [11:16:32] There seems to be a small issue but we'll figure that out differently. Nothing urgent. Please go on Urbanecm . [11:16:39] okay, syncing [11:17:11] !log urbanecm@deploy1002 Scap failed!: Call to mwscript eval.php stderr: not empty [11:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:22] what [11:17:45] ohh [11:17:48] my mistake [11:17:56] CFisch_WMDE: we need a default to the variable [11:18:05] otherwise the variable won't be defined [11:18:23] hmmm? [11:18:54] CFisch_WMDE: i mean this https://www.irccloud.com/pastebin/14r6Xlob/ [11:19:03] the variable needs to have a 'default' => xxx entry [11:19:05] Ahhh [11:19:07] damn [11:19:10] it looks it should have default => false? [11:19:14] yes [11:19:20] CFisch_WMDE: can you upload a patch please? [11:19:25] sure [11:19:28] thanks [11:21:01] (03PS1) 10WMDE-Fisch: Add default to ReferencePreviews wmg var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685409 (https://phabricator.wikimedia.org/T271206) [11:21:11] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/685409 [11:21:18] (03CR) 10Urbanecm: [C: 03+2] Add default to ReferencePreviews wmg var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685409 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [11:21:18] Urbanecm: ^ [11:21:21] :-) [11:21:28] thanks. Sorry, should've noticed it before pressing +2 :/ [11:21:41] Yeah, me too I guess. No worries. [11:21:58] But good, that it's nothing else that's broken ^^' (hopefully) [11:22:03] (03Merged) 10jenkins-bot: Add default to ReferencePreviews wmg var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685409 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [11:22:08] yep, hopefully [11:22:12] at least scap has checks :) [11:22:19] Hehe [11:22:45] okay, no notices now. Syncing it out then :) [11:24:13] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 4f3051bf286b89e47ef153532de76756f2e7ade9: Enable ReferencePreviews on first wikis (T271206; 1/2) (duration: 01m 20s) [11:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:21] T271206: Enable RefPreviews on first wikis - https://phabricator.wikimedia.org/T271206 [11:25:53] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 3565427dcd80e78352c99eb322de3318ae89a4ee: Enable ReferencePreviews on first wikis (T271206; 2/2) (duration: 01m 10s) [11:25:59] CFisch_WMDE: that should be it :) [11:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:02] anything else? [11:27:29] (03PS3) 10Urbanecm: Enable Wikidata description override on dewiki at beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682337 (https://phabricator.wikimedia.org/T279829) (owner: 10Luke081515) [11:27:36] (03CR) 10Urbanecm: [C: 03+2] "no-op for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682337 (https://phabricator.wikimedia.org/T279829) (owner: 10Luke081515) [11:28:21] (03Merged) 10jenkins-bot: Enable Wikidata description override on dewiki at beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682337 (https://phabricator.wikimedia.org/T279829) (owner: 10Luke081515) [11:28:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 25%: Repool db1173', diff saved to https://phabricator.wikimedia.org/P15764 and previous config saved to /var/cache/conftool/dbconfig/20210505-112842-root.json [11:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:28] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reenable notifications on db1102 and db2101 after setup [puppet] - 10https://gerrit.wikimedia.org/r/685397 (https://phabricator.wikimedia.org/T280979) (owner: 10Jcrespo) [11:43:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 50%: Repool db1173', diff saved to https://phabricator.wikimedia.org/P15765 and previous config saved to /var/cache/conftool/dbconfig/20210505-114345-root.json [11:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:27] (03CR) 10Alexandros Kosiaris: [C: 03+1] prometheus: Clean up absent file resource [puppet] - 10https://gerrit.wikimedia.org/r/684801 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [11:58:07] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 116755280 and 15 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:58:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 75%: Repool db1173', diff saved to https://phabricator.wikimedia.org/P15767 and previous config saved to /var/cache/conftool/dbconfig/20210505-115849-root.json [11:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:30] (03PS1) 10Jbond: O:apereo_cas: add toggle to enfore MFA logins [puppet] - 10https://gerrit.wikimedia.org/r/685418 (https://phabricator.wikimedia.org/T280691) [12:00:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29400/console" [puppet] - 10https://gerrit.wikimedia.org/r/685418 (https://phabricator.wikimedia.org/T280691) (owner: 10Jbond) [12:00:38] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:apereo_cas: add toggle to enfore MFA logins [puppet] - 10https://gerrit.wikimedia.org/r/685418 (https://phabricator.wikimedia.org/T280691) (owner: 10Jbond) [12:00:43] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 17480 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:01:17] (03PS1) 10Kormat: dbtools: Simplify master-pos [software] - 10https://gerrit.wikimedia.org/r/685423 [12:01:45] !log installing exim security updates on stretch [12:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:15] (03CR) 10Kormat: [V: 03+2 C: 03+2] dbtools: Simplify master-pos [software] - 10https://gerrit.wikimedia.org/r/685423 (owner: 10Kormat) [12:06:00] (03PS1) 10Jbond: (DO NOT MERGE) enforce u2f logins for turnilo [puppet] - 10https://gerrit.wikimedia.org/r/685425 (https://phabricator.wikimedia.org/T280691) [12:09:11] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (8) node(s) change every puppet run: cp5016.eqsin.wmnet, snapshot1015.eqiad.wmnet, cp5015.eqsin.wmnet, snapshot1014.eqiad.wmnet, maps1009.eqiad.wmnet, webperf1001.eqiad.wmnet, cp5014.eqsin.wmnet, cp5013.eqsin.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [12:09:45] (03CR) 10Kormat: "I've deployed the grants, so feel free to merge when ready." [puppet] - 10https://gerrit.wikimedia.org/r/682124 (owner: 10Aklapper) [12:12:59] resolving the VO page is me, has been resolved a while ago [12:13:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 100%: Repool db1173', diff saved to https://phabricator.wikimedia.org/P15768 and previous config saved to /var/cache/conftool/dbconfig/20210505-121353-root.json [12:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:05] (03CR) 10Filippo Giunchedi: "Is cloudgw.eqiad1.wikimediacloud.org supposed to be pingable ? I'm asking because I think the previous approach made more sense (to my clo" [puppet] - 10https://gerrit.wikimedia.org/r/685379 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [12:18:28] 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: IPMI sensor critical for elastic1042.eqiad.wmnet - https://phabricator.wikimedia.org/T278185 (10Gehel) [12:23:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1180 for schema change', diff saved to https://phabricator.wikimedia.org/P15769 and previous config saved to /var/cache/conftool/dbconfig/20210505-122351-marostegui.json [12:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:09] PROBLEM - Disk space on mx1001 is CRITICAL: DISK CRITICAL - /var/spool/exim4/scan is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mx1001&var-datasource=eqiad+prometheus/ops [12:31:25] (03PS1) 10WMDE-Fisch: Revert "Enable ReferencePreviews on first wikis CommonSettings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685062 [12:44:50] I know train is starting in a few minutes but would it be okay to deploy the above config revert? ^^' [12:45:13] ( me or a colleague doing that ) [12:45:41] 10SRE, 10serviceops: mcrouter proxies and scap proxies - https://phabricator.wikimedia.org/T245841 (10hashar) [12:46:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 25%: Repool db1180', diff saved to https://phabricator.wikimedia.org/P15770 and previous config saved to /var/cache/conftool/dbconfig/20210505-124651-root.json [12:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:49:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:50:21] RECOVERY - Disk space on mx1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mx1001&var-datasource=eqiad+prometheus/ops [12:50:37] FYI: Not doing it now but rather after the train window. [12:51:14] (03CR) 10Jbond: "See comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/685030 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [12:51:32] nothing is happening this train window - train conductor is no in the EU this week; unless someone objects, I'd say it's OK to backport [12:51:45] CFisch_WMDE, ^ [12:52:13] speaking backup conductor this week [12:53:29] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, let me know when you merge this and I can run some tests in parallel." [puppet] - 10https://gerrit.wikimedia.org/r/684831 (owner: 10Jbond) [12:58:19] (03PS4) 10Jbond: O:debmonitor::server: request cert using cfssl [puppet] - 10https://gerrit.wikimedia.org/r/684831 [13:00:04] brennen and liw: #bothumor I � Unicode. All rise for MediaWiki train - American+European Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210505T1300). [13:01:10] (03CR) 10Ottomata: [C: 03+1] eventgate-logging-external: add codfw kafka-logging hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/683047 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [13:01:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 50%: Repool db1180', diff saved to https://phabricator.wikimedia.org/P15771 and previous config saved to /var/cache/conftool/dbconfig/20210505-130155-root.json [13:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:56] (03CR) 10Ottomata: [C: 03+1] "How does cp get its metadata.broker.list configured. Is that already done?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/683706 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [13:07:22] (03CR) 10Jbond: [C: 03+2] O:debmonitor::server: request cert using cfssl [puppet] - 10https://gerrit.wikimedia.org/r/684831 (owner: 10Jbond) [13:07:51] (03CR) 10Alexandros Kosiaris: [C: 03+1] "I think that should do it." [puppet] - 10https://gerrit.wikimedia.org/r/682685 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [13:08:10] it is jouncebot now [13:08:14] jouncebot: now [13:08:14] For the next 1 hour(s) and 51 minute(s): MediaWiki train - American+European Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210505T1300) [13:08:17] jouncebot: next [13:08:18] In 4 hour(s) and 51 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210505T1800) [13:08:18] In 4 hour(s) and 51 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210505T1800) [13:08:35] (03PS1) 10Kormat: install_server: Switch db2129 to buster. [puppet] - 10https://gerrit.wikimedia.org/r/685440 (https://phabricator.wikimedia.org/T280751) [13:10:26] (03CR) 10Kormat: [C: 03+2] install_server: Switch db2129 to buster. [puppet] - 10https://gerrit.wikimedia.org/r/685440 (https://phabricator.wikimedia.org/T280751) (owner: 10Kormat) [13:12:30] !log reimaging db2129 to buster T280751 [13:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:38] T280751: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 [13:16:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 75%: Repool db1180', diff saved to https://phabricator.wikimedia.org/P15772 and previous config saved to /var/cache/conftool/dbconfig/20210505-131658-root.json [13:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:45] liw: Thanks, will come back to this in a couple of minutes [13:18:27] PROBLEM - MariaDB Replica IO: s6 on db2124 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2129.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:18:46] ^ sigh, me. [13:19:01] PROBLEM - MariaDB Replica IO: s6 on db2076 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2129.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:19:01] PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2129.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:19:11] PROBLEM - MariaDB Replica IO: s6 on db2097 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2129.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:19:19] PROBLEM - MariaDB Replica IO: s6 on db2117 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2129.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:19:20] \o/ [13:19:49] PROBLEM - MariaDB Replica IO: s6 on db2089 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2129.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:19:49] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 9 hosts with reason: Reimage db2129 T280751 [13:19:52] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 9 hosts with reason: Reimage db2129 T280751 [13:19:53] PROBLEM - MariaDB Replica IO: s6 on db2087 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2129.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:57] T280751: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 [13:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:46] (03PS2) 10Majavah: P::mariadb::beta: Set read only by default [puppet] - 10https://gerrit.wikimedia.org/r/684034 (https://phabricator.wikimedia.org/T110115) [13:20:50] ACKNOWLEDGEMENT - MariaDB Replica IO: s6 on db2076 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2129.codfw.wmnet (111 Connection refused) Kormat reimaging s6 master https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:20:50] ACKNOWLEDGEMENT - MariaDB Replica Lag: s6 on db2076 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 436.36 seconds Kormat reimaging s6 master https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:20:50] ACKNOWLEDGEMENT - MariaDB Replica IO: s6 on db2087 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2129.codfw.wmnet (111 Connection refused) Kormat reimaging s6 master https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:20:50] ACKNOWLEDGEMENT - MariaDB Replica Lag: s6 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 472.34 seconds Kormat reimaging s6 master https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:20:50] ACKNOWLEDGEMENT - MariaDB Replica IO: s6 on db2089 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2129.codfw.wmnet (111 Connection refused) Kormat reimaging s6 master https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:20:50] ACKNOWLEDGEMENT - MariaDB Replica Lag: s6 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 322.03 seconds Kormat reimaging s6 master https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:20:50] ACKNOWLEDGEMENT - MariaDB Replica IO: s6 on db2097 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2129.codfw.wmnet (111 Connection refused) Kormat reimaging s6 master https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:20:51] ACKNOWLEDGEMENT - MariaDB Replica Lag: s6 on db2097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 353.17 seconds Kormat reimaging s6 master https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:20:52] ACKNOWLEDGEMENT - MariaDB Replica IO: s6 on db2114 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2129.codfw.wmnet (111 Connection refused) Kormat reimaging s6 master https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:20:52] ACKNOWLEDGEMENT - MariaDB Replica Lag: s6 on db2114 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 463.18 seconds Kormat reimaging s6 master https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:21:03] (03PS2) 10Majavah: P::mariadb::beta: Use default socket file location [puppet] - 10https://gerrit.wikimedia.org/r/684088 [13:21:32] sorry folks [13:21:39] my inner elukey has struck [13:22:45] kormat: <3 [13:22:52] I know you always think about me [13:23:04] 💜 [13:25:10] PROBLEM - Keyholder SSH agent on cumin2002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [13:29:31] (03CR) 10Volans: "I was approaching the same thing as I would need the bullseye image. Much appreciated the refactor effort, couple of questions/comments in" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/683979 (owner: 10Legoktm) [13:31:47] (03PS4) 10Herron: icinga: remove icinga[12]001 site.pp entries [puppet] - 10https://gerrit.wikimedia.org/r/685087 (https://phabricator.wikimedia.org/T279602) [13:32:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 100%: Repool db1180', diff saved to https://phabricator.wikimedia.org/P15773 and previous config saved to /var/cache/conftool/dbconfig/20210505-133202-root.json [13:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1165 for schema change', diff saved to https://phabricator.wikimedia.org/P15774 and previous config saved to /var/cache/conftool/dbconfig/20210505-133259-marostegui.json [13:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:39] 10SRE, 10serviceops, 10cloud-services-team (Kanban): Replace bootstrap-vz to generate Docker base images - https://phabricator.wikimedia.org/T281984 (10MoritzMuehlenhoff) [13:37:09] I'll deploy a config change now if nobody offends. :-) [13:37:27] (03CR) 10Herron: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/685087 (https://phabricator.wikimedia.org/T279602) (owner: 10Herron) [13:37:40] (03PS2) 10WMDE-Fisch: Revert "Enable ReferencePreviews on first wikis CommonSettings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685062 [13:40:11] 10SRE, 10serviceops: Publish wikimedia-bullseye base docker image - https://phabricator.wikimedia.org/T281596 (10Volans) If possible it would be great if we could get a base bullseye image somehow, even if not auto-updated right from the start and otherwise created. We have already a couple of hosts on bullsey... [13:41:26] !log kevinbazira@deploy1002 Started deploy [ores/deploy@5612f30]: Regular ORES Deployment T278723 [13:41:27] (03CR) 10WMDE-Fisch: [C: 03+2] "Deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685062 (owner: 10WMDE-Fisch) [13:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:34] T278723: ORES deployment - Spring 2021 - https://phabricator.wikimedia.org/T278723 [13:42:08] (03CR) 10Majavah: "cherrypicked on beta to get it deployed safely, feel free to merge at any time" [puppet] - 10https://gerrit.wikimedia.org/r/684088 (owner: 10Majavah) [13:42:10] (03Merged) 10jenkins-bot: Revert "Enable ReferencePreviews on first wikis CommonSettings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685062 (owner: 10WMDE-Fisch) [13:42:12] (03CR) 10Majavah: "cherrypicked on beta to get it deployed safely, feel free to merge at any time" [puppet] - 10https://gerrit.wikimedia.org/r/684034 (https://phabricator.wikimedia.org/T110115) (owner: 10Majavah) [13:47:36] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/29401/" [puppet] - 10https://gerrit.wikimedia.org/r/685090 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [13:48:39] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/685087 (https://phabricator.wikimedia.org/T279602) (owner: 10Herron) [13:48:57] !log wmde-fisch@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:685062|Revert "Enable ReferencePreviews on first wikis CommonSettings" ()]] (duration: 02m 08s) [13:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:22] (03PS1) 10Marostegui: install_server: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/685447 [13:52:52] (03CR) 10Kormat: [C: 03+2] install_server: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/685447 (owner: 10Marostegui) [13:54:27] (03PS1) 10Jbond: demonitor: add debmonitor.wikimedia.org to sni [puppet] - 10https://gerrit.wikimedia.org/r/685450 [13:55:09] (03CR) 10Volans: demonitor: add debmonitor.wikimedia.org to sni (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685450 (owner: 10Jbond) [13:56:05] (03PS3) 10Herron: add kafka-main[12]00[45] to existing kafka-main egress rules and broker lists [deployment-charts] - 10https://gerrit.wikimedia.org/r/683706 (https://phabricator.wikimedia.org/T225005) [13:56:43] (03CR) 10Jbond: [C: 03+2] demonitor: add debmonitor.wikimedia.org to sni [puppet] - 10https://gerrit.wikimedia.org/r/685450 (owner: 10Jbond) [13:56:44] dancy: I think we need to rollback until RC is unbroken [13:56:58] At least for group1 wikis [13:57:30] (03PS1) 10WMDE-Fisch: [beta] Disable ReferencePreviews beta mode on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685451 (https://phabricator.wikimedia.org/T271206) [13:58:13] !log kevinbazira@deploy1002 Finished deploy [ores/deploy@5612f30]: Regular ORES Deployment T278723 (duration: 16m 47s) [13:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:21] T278723: ORES deployment - Spring 2021 - https://phabricator.wikimedia.org/T278723 [13:59:09] (03CR) 10Herron: "> Patch Set 2: Code-Review+1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/683706 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [13:59:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 25%: Repool db1165', diff saved to https://phabricator.wikimedia.org/P15775 and previous config saved to /var/cache/conftool/dbconfig/20210505-135920-root.json [13:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:16] Hmm apparently it didn't roll forward yet. And while d.ancy is assigned its actually someone else on the calendar today [14:02:29] So this broke last week? [14:03:25] (03CR) 10Ottomata: [C: 03+1] add kafka-main[12]00[45] to existing kafka-main egress rules and broker lists [deployment-charts] - 10https://gerrit.wikimedia.org/r/683706 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [14:03:43] That's third week in a row we didn't catch UBN until a week later. Maybe unlucky. Or maybe we need to look more closely at logstash after group1/2 [14:05:44] (03CR) 10Urbanecm: "CR-1. I'd suggest to set IS variable to false, as now we have an unused variable which doesn't do anything the name would suggest." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685062 (owner: 10WMDE-Fisch) [14:06:54] (03PS1) 10Volans: Use system python to define variable [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/685459 [14:07:15] Krinkle: well if the RC issue was inded caused by https://gerrit.wikimedia.org/r/c/mediawiki/core/+/680822, then it is only in wmf.4 :) [14:09:07] (03PS1) 10Volans: Add python-build-bullseye image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/685462 [14:10:19] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/685459 (owner: 10Volans) [14:10:41] (03CR) 10WMDE-Fisch: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685062 (owner: 10WMDE-Fisch) [14:10:41] Victory! ORES deploy looks good thanks to elukey and klausman [14:11:58] (03CR) 10Volans: [V: 03+2 C: 03+2] Use system python to define variable [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/685459 (owner: 10Volans) [14:12:23] (03CR) 10Urbanecm: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685062 (owner: 10WMDE-Fisch) [14:14:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 50%: Repool db1165', diff saved to https://phabricator.wikimedia.org/P15776 and previous config saved to /var/cache/conftool/dbconfig/20210505-141423-root.json [14:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:00] * CFisch_WMDE going to merge a beta labs only config patch [14:15:54] (03CR) 10WMDE-Fisch: [C: 03+2] [beta] Disable ReferencePreviews beta mode on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685451 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [14:16:54] (03Merged) 10jenkins-bot: [beta] Disable ReferencePreviews beta mode on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685451 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [14:17:28] !log kormat@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2129.codfw.wmnet with reason: REIMAGE [14:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1126 to enable report_host', diff saved to https://phabricator.wikimedia.org/P15777 and previous config saved to /var/cache/conftool/dbconfig/20210505-141735-marostegui.json [14:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:17] !log Upgrade kernel and enable report_host on db1126 [14:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:35] !log kormat@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2129.codfw.wmnet with reason: REIMAGE [14:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:21] PROBLEM - MariaDB Replica Lag: s6 on db2117 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4048.41 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:20:21] PROBLEM - MariaDB Replica Lag: s6 on db2124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4049.69 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:21:52] CFisch_WMDE: /me complains about the confusing term 'beta labs', use 'beta cluster' instead [14:22:19] Majavah: Sure, next time ;-) [14:22:19] 10SRE, 10serviceops: Publish wikimedia-bullseye base docker image - https://phabricator.wikimedia.org/T281596 (10Joe) >>! In T281596#7061654, @Volans wrote: > If possible it would be great if we could get a base bullseye image somehow, even if not auto-updated right from the start and otherwise created. > We h... [14:24:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 5%: Repool db1126', diff saved to https://phabricator.wikimedia.org/P15778 and previous config saved to /var/cache/conftool/dbconfig/20210505-142431-root.json [14:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:43] PROBLEM - MariaDB Replica Lag: s6 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4551.86 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:28:47] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 10 hosts with reason: Reimage db2129 T280751 [14:28:51] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 10 hosts with reason: Reimage db2129 T280751 [14:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:56] T280751: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 [14:29:02] sigh, downtimes expired because it took 45mins to figure out why the reimage wasn't working [14:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 75%: Repool db1165', diff saved to https://phabricator.wikimedia.org/P15779 and previous config saved to /var/cache/conftool/dbconfig/20210505-142927-root.json [14:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:57] ACKNOWLEDGEMENT - MariaDB Replica IO: s6 on db2124 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2129.codfw.wmnet (111 Connection refused) Kormat db2129 isnt back yet. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:29:57] ACKNOWLEDGEMENT - MariaDB Replica Lag: s6 on db2124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4595.98 seconds Kormat db2129 isnt back yet. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:29:57] ACKNOWLEDGEMENT - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2129.codfw.wmnet (111 Connection refused) Kormat db2129 isnt back yet. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:29:57] ACKNOWLEDGEMENT - MariaDB Replica Lag: s6 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4551.86 seconds Kormat db2129 isnt back yet. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:31:02] ACKNOWLEDGEMENT - MariaDB Replica IO: s6 on db2117 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2129.codfw.wmnet (111 Connection refused) Kormat reimage https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:31:02] ACKNOWLEDGEMENT - MariaDB Replica Lag: s6 on db2117 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4673.84 seconds Kormat reimage https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:31:14] kormat, this has happened to me many times :-)- my suggestion is that if you think something is going to take 1h, to downtime for 1 day :-), there is always something in the way (distractions, outages, things going badly) [14:31:49] worse case scenario, you can manually delete the downtimes if needed [14:33:59] I guess, technically "best case scenario" (maintance is fast) [14:34:14] (03CR) 10DCausse: "PS17 changes:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [14:35:05] (03Abandoned) 10Gergő Tisza: GrowthExperiments: enable link recommendations backend on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675567 (https://phabricator.wikimedia.org/T278710) (owner: 10Gergő Tisza) [14:35:11] (03Abandoned) 10Gergő Tisza: GrowthExperiments: enable link recommendations backend on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675568 (https://phabricator.wikimedia.org/T278710) (owner: 10Gergő Tisza) [14:35:16] (03Abandoned) 10Gergő Tisza: GrowthExperiments: enable link recommendations backend on group 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675569 (https://phabricator.wikimedia.org/T278710) (owner: 10Gergő Tisza) [14:36:21] RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:36:31] RECOVERY - MariaDB Replica IO: s6 on db2089 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:36:37] RECOVERY - MariaDB Replica IO: s6 on db2076 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:36:43] RECOVERY - MariaDB Replica IO: s6 on db2124 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:36:43] RECOVERY - MariaDB Replica IO: s6 on db2117 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:36:47] RECOVERY - MariaDB Replica IO: s6 on db2087 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:37:43] RECOVERY - MariaDB Replica IO: s6 on db2097 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:38:48] (03PS1) 10MSantos: build: add build info flags [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/685486 [14:39:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 10%: Repool db1126', diff saved to https://phabricator.wikimedia.org/P15780 and previous config saved to /var/cache/conftool/dbconfig/20210505-143934-root.json [14:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:21] (03CR) 10DCausse: rdf-streaming-updater: enable HA capability (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/679519 (https://phabricator.wikimedia.org/T273098) (owner: 10Mstyles) [14:43:22] 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: (Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10RobH) [14:43:33] 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: (Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10RobH) [14:44:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 100%: Repool db1165', diff saved to https://phabricator.wikimedia.org/P15781 and previous config saved to /var/cache/conftool/dbconfig/20210505-144431-root.json [14:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:49] (03PS1) 10Arturo Borrero Gonzalez: openstack: wmcs-dns-floating-ip-updater.py: fix typo in config option [puppet] - 10https://gerrit.wikimedia.org/r/685488 [14:46:55] RECOVERY - MariaDB Replica Lag: s6 on db2141 is OK: OK slave_sql_lag Replication lag: 0.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:47:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:49:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:50:13] RECOVERY - MariaDB Replica Lag: s6 on db2117 is OK: OK slave_sql_lag Replication lag: 0.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:50:15] RECOVERY - MariaDB Replica Lag: s6 on db2124 is OK: OK slave_sql_lag Replication lag: 0.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:50:58] contint2001 backups are running now, hopefully we will get rid of the alerts soon [14:51:09] (03PS3) 10Herron: remove all references to icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/682992 (https://phabricator.wikimedia.org/T279601) [14:51:33] (03PS1) 10Arturo Borrero Gonzalez: openstack: wmcs-dns-floating-ip-updater.py: run black [puppet] - 10https://gerrit.wikimedia.org/r/685491 [14:52:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P::toolforge::mailrelay: support multiple domains [puppet] - 10https://gerrit.wikimedia.org/r/684032 (https://phabricator.wikimedia.org/T278109) (owner: 10Majavah) [14:53:14] (03CR) 10Herron: [C: 03+2] remove all references to icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/682992 (https://phabricator.wikimedia.org/T279601) (owner: 10Herron) [14:54:07] (03CR) 10Volans: [C: 03+1] "LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/685464 (owner: 10Jbond) [14:54:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 20%: Repool db1126', diff saved to https://phabricator.wikimedia.org/P15782 and previous config saved to /var/cache/conftool/dbconfig/20210505-145438-root.json [14:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:50] (03CR) 10Kormat: [C: 03+1] dbbackups: Switchover s6 codfw database backups from db2097 to db2141 [puppet] - 10https://gerrit.wikimedia.org/r/681621 (https://phabricator.wikimedia.org/T280751) (owner: 10Jcrespo) [14:58:38] (03PS3) 10Jcrespo: dbbackups: Switchover s6 codfw database backups from db2097 to db2141 [puppet] - 10https://gerrit.wikimedia.org/r/681621 (https://phabricator.wikimedia.org/T280751) [14:58:40] (03PS3) 10Herron: remove all references to icinga2001 [puppet] - 10https://gerrit.wikimedia.org/r/682999 (https://phabricator.wikimedia.org/T279602) [14:59:58] (03CR) 10Herron: [C: 03+2] remove all references to icinga2001 [puppet] - 10https://gerrit.wikimedia.org/r/682999 (https://phabricator.wikimedia.org/T279602) (owner: 10Herron) [15:00:25] (03PS5) 10Herron: icinga: remove icinga[12]001 site.pp entries [puppet] - 10https://gerrit.wikimedia.org/r/685087 (https://phabricator.wikimedia.org/T279602) [15:00:29] 10SRE, 10Wikimedia-General-or-Unknown, 10Wikimedia-SVG-rendering, 10Documentation: Document how to request installing additional SVG and PDF fonts on Wikimedia servers - https://phabricator.wikimedia.org/T228591 (10AntiCompositeNumber) To clarify: Based on T280718, fc-list will directly relate to what font... [15:01:55] (03CR) 10Herron: [C: 03+2] icinga: remove icinga[12]001 site.pp entries [puppet] - 10https://gerrit.wikimedia.org/r/685087 (https://phabricator.wikimedia.org/T279602) (owner: 10Herron) [15:02:47] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Switchover s6 codfw database backups from db2097 to db2141 [puppet] - 10https://gerrit.wikimedia.org/r/681621 (https://phabricator.wikimedia.org/T280751) (owner: 10Jcrespo) [15:05:22] (03PS3) 10CDanis: Add a public_cloud bit to X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/679341 (https://phabricator.wikimedia.org/T279380) [15:09:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 30%: Repool db1126', diff saved to https://phabricator.wikimedia.org/P15783 and previous config saved to /var/cache/conftool/dbconfig/20210505-150942-root.json [15:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:01] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 10 hosts with reason: Table check on db2129 T280751 [15:10:04] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 10 hosts with reason: Table check on db2129 T280751 [15:10:05] !log decommissioning icinga[12]001 hosts T279601 T279602 [15:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:08] T280751: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 [15:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:25] T279602: reclaim icinga2001.wikimedia.org - https://phabricator.wikimedia.org/T279602 [15:10:25] T279601: reclaim icinga1001.wikimedia.org - https://phabricator.wikimedia.org/T279601 [15:11:01] !log herron@cumin1001 START - Cookbook sre.hosts.decommission for hosts icinga1001.wikimedia.org [15:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:18] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:18:15] (03PS1) 10Jcrespo: dbbackups: Switchover backup generation for s6 on eqiad from db1139 to db1140 [puppet] - 10https://gerrit.wikimedia.org/r/685494 (https://phabricator.wikimedia.org/T280751) [15:18:28] (03PS2) 10Jcrespo: dbbackups: Switchover backup generation for s6 on eqiad from db1139 to db1140 [puppet] - 10https://gerrit.wikimedia.org/r/685494 (https://phabricator.wikimedia.org/T280751) [15:19:00] (03PS3) 10Jcrespo: dbbackups: Switchover backup generation for s6 on eqiad from db1139 to db1140 [puppet] - 10https://gerrit.wikimedia.org/r/685494 (https://phabricator.wikimedia.org/T280751) [15:21:48] (03PS1) 10Jbond: P:trafficserver::backend: Use a trusted CA file outside of /etc/ssl/certs [puppet] - 10https://gerrit.wikimedia.org/r/685495 (https://phabricator.wikimedia.org/T281673) [15:21:50] (03PS1) 10Jbond: hiera - cp1077: test CA bundle with pki and puppet ca certs [puppet] - 10https://gerrit.wikimedia.org/r/685496 [15:21:52] (03PS1) 10Jbond: P:traffic::backend: update the source of the ATS trusted ca bundle [puppet] - 10https://gerrit.wikimedia.org/r/685497 (https://phabricator.wikimedia.org/T281673) [15:21:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:21:57] (03CR) 10RLazarus: [C: 03+1] httpbb: add tests and test_suite for people.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/685149 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [15:22:06] (03CR) 10Filippo Giunchedi: "LGTM overall, though these hosts should be removed from kafka_brokers_logging too in hieradata/common.yaml AFAICS?" [puppet] - 10https://gerrit.wikimedia.org/r/685090 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [15:22:29] (03PS2) 10Jbond: P:traffic::backend: update the source of the ATS trusted ca bundle [puppet] - 10https://gerrit.wikimedia.org/r/685497 (https://phabricator.wikimedia.org/T281673) [15:23:00] (03PS2) 10Jbond: hiera - cp1077: test CA bundle with pki and puppet ca certs [puppet] - 10https://gerrit.wikimedia.org/r/685496 (https://phabricator.wikimedia.org/T281673) [15:23:17] (03CR) 10Ladsgroup: "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685212 (https://phabricator.wikimedia.org/T281933) (owner: 10Legoktm) [15:23:20] !log herron@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts icinga1001.wikimedia.org [15:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:45] (03CR) 10Ssingh: "Thanks for the review! Comments inline:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/685030 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:24:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 50%: Repool db1126', diff saved to https://phabricator.wikimedia.org/P15784 and previous config saved to /var/cache/conftool/dbconfig/20210505-152445-root.json [15:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:10] !log herron@cumin1001 START - Cookbook sre.hosts.decommission for hosts icinga2001.wikimedia.org [15:25:15] (03CR) 10Jbond: P:trafficserver::backend: Use a trusted CA file outside of /etc/ssl/certs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/685495 (https://phabricator.wikimedia.org/T281673) (owner: 10Jbond) [15:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:55] (03CR) 10Jbond: hiera - cp1077: test CA bundle with pki and puppet ca certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685496 (https://phabricator.wikimedia.org/T281673) (owner: 10Jbond) [15:26:21] (03CR) 10jerkins-bot: [V: 04-1] dbbackups: Switchover backup generation for s6 on eqiad from db1139 to db1140 [puppet] - 10https://gerrit.wikimedia.org/r/685494 (https://phabricator.wikimedia.org/T280751) (owner: 10Jcrespo) [15:27:19] (03PS2) 10Jbond: P:trafficserver::backend: Use a trusted CA file outside of /etc/ssl/certs [puppet] - 10https://gerrit.wikimedia.org/r/685495 (https://phabricator.wikimedia.org/T281673) [15:28:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29406/console" [puppet] - 10https://gerrit.wikimedia.org/r/685495 (https://phabricator.wikimedia.org/T281673) (owner: 10Jbond) [15:28:33] 10SRE, 10Wikimedia-Mailing-lists: improve new mailing list admin notifications - https://phabricator.wikimedia.org/T281987 (10Effeietsanders) The email was in English. Happy to forward if that is of help. [15:29:02] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=icinga site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:31:14] (03PS3) 10Jbond: hiera - cp1077: test CA bundle with pki and puppet ca certs [puppet] - 10https://gerrit.wikimedia.org/r/685496 (https://phabricator.wikimedia.org/T281673) [15:31:26] (03CR) 10Marostegui: [C: 03+1] P::mariadb::beta: Set read only by default [puppet] - 10https://gerrit.wikimedia.org/r/684034 (https://phabricator.wikimedia.org/T110115) (owner: 10Majavah) [15:31:30] (03PS3) 10Jbond: P:traffic::backend: update the source of the ATS trusted ca bundle [puppet] - 10https://gerrit.wikimedia.org/r/685497 (https://phabricator.wikimedia.org/T281673) [15:32:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29408/console" [puppet] - 10https://gerrit.wikimedia.org/r/685496 (https://phabricator.wikimedia.org/T281673) (owner: 10Jbond) [15:32:26] (03CR) 10Giuseppe Lavagetto: [C: 03+2] P::mariadb::beta: Set read only by default [puppet] - 10https://gerrit.wikimedia.org/r/684034 (https://phabricator.wikimedia.org/T110115) (owner: 10Majavah) [15:32:50] (03PS3) 10Giuseppe Lavagetto: P::mariadb::beta: Use default socket file location [puppet] - 10https://gerrit.wikimedia.org/r/684088 (owner: 10Majavah) [15:33:33] (03PS1) 10Vgutierrez: trafficserver: Fix cacert_(dirpath|filename) usage [puppet] - 10https://gerrit.wikimedia.org/r/685503 (https://phabricator.wikimedia.org/T281673) [15:33:35] (03PS1) 10Vgutierrez: trafficserver: Clear outbound TLS cacert_path for cp4026 and cp4032 [puppet] - 10https://gerrit.wikimedia.org/r/685504 (https://phabricator.wikimedia.org/T281673) [15:33:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] P::mariadb::beta: Use default socket file location [puppet] - 10https://gerrit.wikimedia.org/r/684088 (owner: 10Majavah) [15:34:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29410/console" [puppet] - 10https://gerrit.wikimedia.org/r/685497 (https://phabricator.wikimedia.org/T281673) (owner: 10Jbond) [15:34:02] (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/685494 (https://phabricator.wikimedia.org/T280751) (owner: 10Jcrespo) [15:37:37] (03PS2) 10Majavah: mediawiki: Remove 'deployment.wikimedia' vhost from Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/684117 (https://phabricator.wikimedia.org/T198673) (owner: 10Krinkle) [15:38:19] (03PS3) 10Majavah: [Beta] traffic: Set upload_domain to upload.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/684120 (https://phabricator.wikimedia.org/T281650) (owner: 10Krinkle) [15:39:29] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: Remove 'deployment.wikimedia' vhost from Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/684117 (https://phabricator.wikimedia.org/T198673) (owner: 10Krinkle) [15:39:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 75%: Repool db1126', diff saved to https://phabricator.wikimedia.org/P15785 and previous config saved to /var/cache/conftool/dbconfig/20210505-153949-root.json [15:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] [Beta] traffic: Set upload_domain to upload.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/684120 (https://phabricator.wikimedia.org/T281650) (owner: 10Krinkle) [15:43:12] !log herron@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts icinga2001.wikimedia.org [15:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:57] (03PS1) 10WMDE-Fisch: Enable reference previews for anonymous users when not in beta [extensions/Popups] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/685475 [15:45:06] (03PS1) 10Jbond: tlsproxy: make discovery the default cfssl_label in production [puppet] - 10https://gerrit.wikimedia.org/r/685511 [15:45:19] (03PS1) 10WMDE-Fisch: Enable reference previews for anonymous users when not in beta [extensions/Popups] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685476 [15:45:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29412/console" [puppet] - 10https://gerrit.wikimedia.org/r/685511 (owner: 10Jbond) [15:46:40] (03PS1) 10Muehlenhoff: Add control file for wmf-mariadb-client104 for Bullseye [software] - 10https://gerrit.wikimedia.org/r/685512 [15:47:13] (03PS1) 10Andrew-WMDE: Enable ReferencePreviews for old users by default [extensions/Popups] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/685477 (https://phabricator.wikimedia.org/T271206) [15:47:27] (03PS2) 10Vgutierrez: trafficserver: Fix cacert_(dirpath|filename) usage [puppet] - 10https://gerrit.wikimedia.org/r/685503 (https://phabricator.wikimedia.org/T281673) [15:47:29] (03PS2) 10Vgutierrez: trafficserver: Clear outbound TLS cacert_path for cp4026 and cp4032 [puppet] - 10https://gerrit.wikimedia.org/r/685504 (https://phabricator.wikimedia.org/T281673) [15:47:32] (03PS1) 10Andrew-WMDE: Enable ReferencePreviews for old users by default [extensions/Popups] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685478 (https://phabricator.wikimedia.org/T271206) [15:48:45] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar, 10Patch-For-Review: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (104nn1l2) @JoKalliauer That is correct. Thanks [15:48:58] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10JoKalliauer) The issue still exist in `librsvg 2.51.1` Rendering of https://commons.wikimedia.org/wiki/File:Fonttest-Kerning.svg |[... [15:49:43] 10SRE, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking-Neverending: Minimize infrastructure differences between Beta Cluster and production - https://phabricator.wikimedia.org/T87220 (10Majavah) [15:49:58] 10SRE, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10User-Majavah: Possible to run writes (e.g. UPDATE) on Beta Cluster replica - https://phabricator.wikimedia.org/T110115 (10Majavah) 05Open→03Resolved a:03Majavah I'm calling this resolved. The patches are deployed and I amended the documentat... [15:50:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:51:00] 10SRE, 10Wikimedia-Mailing-lists: improve new mailing list admin notifications - https://phabricator.wikimedia.org/T281987 (10Legoktm) >>! In T281987#7062171, @Effeietsanders wrote: > The email was in English. Happy to forward if that is of help. Uhh, yes please. legoktm@wikimedia.org [15:52:58] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): labstore1007 crashed after storage controller errors--replace disk? - https://phabricator.wikimedia.org/T281045 (10Andrew) Just to clarify: we (the wmcs team) are in agreement that we should spend the money and buy a new drive. No objections if y'all wan... [15:54:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 100%: Repool db1126', diff saved to https://phabricator.wikimedia.org/P15786 and previous config saved to /var/cache/conftool/dbconfig/20210505-155453-root.json [15:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:22] (03PS1) 10Ssingh: Add python3-yaml to package builder packages [puppet] - 10https://gerrit.wikimedia.org/r/685515 [15:59:16] (03PS2) 10Ssingh: package_builder: add python3-yaml [puppet] - 10https://gerrit.wikimedia.org/r/685515 [15:59:18] (03CR) 10Marostegui: [C: 03+1] "Haven't tried this, but hopefully will have time "soon"" [software] - 10https://gerrit.wikimedia.org/r/685512 (owner: 10Muehlenhoff) [16:03:23] (03CR) 10Muehlenhoff: "Is this needed during the source generation step? Looks good, then." [puppet] - 10https://gerrit.wikimedia.org/r/685515 (owner: 10Ssingh) [16:04:55] (03PS1) 10CDanis: Add IRC alerting for two relevant NEL subtypes [puppet] - 10https://gerrit.wikimedia.org/r/685516 (https://phabricator.wikimedia.org/T257527) [16:06:42] (03CR) 10jerkins-bot: [V: 04-1] Add IRC alerting for two relevant NEL subtypes [puppet] - 10https://gerrit.wikimedia.org/r/685516 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis) [16:08:15] (03PS2) 10CDanis: Add IRC alerting for two relevant NEL subtypes [puppet] - 10https://gerrit.wikimedia.org/r/685516 (https://phabricator.wikimedia.org/T257527) [16:08:40] (03CR) 10Ssingh: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/685515 (owner: 10Ssingh) [16:10:05] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: improve new mailing list admin notifications - https://phabricator.wikimedia.org/T281987 (10Legoktm) Thanks, this is https://gitlab.com/mailman/mailman/-/blob/master/src/mailman/commands/cli_notify.py#L134 which is not customizable via the template system (there'... [16:12:24] (03CR) 10Ahmon Dancy: [C: 03+1] logspam-watch: correctly handle 0 for total error counts [puppet] - 10https://gerrit.wikimedia.org/r/685231 (https://phabricator.wikimedia.org/T281121) (owner: 10Brennen Bearnes) [16:13:25] (03CR) 10Ssingh: [C: 03+2] package_builder: add python3-yaml [puppet] - 10https://gerrit.wikimedia.org/r/685515 (owner: 10Ssingh) [16:14:13] 10SRE, 10Wikimedia-Mailing-lists: Find list owners for lists without them - https://phabricator.wikimedia.org/T281779 (10Ladsgroup) I closed all the ones mentioned above + pressmeldungen [16:15:04] 10SRE, 10Wikimedia-Mailing-lists: Find list owners for lists without them - https://phabricator.wikimedia.org/T281779 (10Ladsgroup) [16:15:16] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/685516 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis) [16:16:17] (03CR) 10WMDE-Fisch: [C: 03+1] Enable ReferencePreviews for old users by default [extensions/Popups] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/685477 (https://phabricator.wikimedia.org/T271206) (owner: 10Andrew-WMDE) [16:16:26] (03CR) 10WMDE-Fisch: [C: 03+1] Enable ReferencePreviews for old users by default [extensions/Popups] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685478 (https://phabricator.wikimedia.org/T271206) (owner: 10Andrew-WMDE) [16:18:51] (03PS1) 10Legoktm: Initial commit [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/685519 (https://phabricator.wikimedia.org/T282018) [16:20:20] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Initial commit [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/685519 (https://phabricator.wikimedia.org/T282018) (owner: 10Legoktm) [16:22:32] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/685379 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [16:22:58] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29413/console" [puppet] - 10https://gerrit.wikimedia.org/r/685503 (https://phabricator.wikimedia.org/T281673) (owner: 10Vgutierrez) [16:29:35] (03PS5) 10Arturo Borrero Gonzalez: cloudgw: introduce icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/685379 (https://phabricator.wikimedia.org/T270704) [16:31:12] (03PS1) 10Jsn.sherman: labs: Enable TheWikipediaLibrary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685520 (https://phabricator.wikimedia.org/T256297) [16:31:14] (03CR) 10BBlack: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/679341 (https://phabricator.wikimedia.org/T279380) (owner: 10CDanis) [16:31:25] (03PS2) 10Awight: Enable Reference Previews for more users [extensions/Popups] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/685477 (https://phabricator.wikimedia.org/T271206) (owner: 10Andrew-WMDE) [16:31:49] (03Abandoned) 10Awight: Enable reference previews for anonymous users when not in beta [extensions/Popups] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/685475 (owner: 10WMDE-Fisch) [16:32:57] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/685379 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [16:35:19] (03PS2) 10Awight: Enable Reference Previews for more users [extensions/Popups] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685478 (https://phabricator.wikimedia.org/T271206) (owner: 10Andrew-WMDE) [16:35:29] (03Abandoned) 10Awight: Enable reference previews for anonymous users when not in beta [extensions/Popups] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685476 (owner: 10WMDE-Fisch) [16:44:43] (03CR) 10Awight: [C: 03+1] Enable Reference Previews for more users [extensions/Popups] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/685477 (https://phabricator.wikimedia.org/T271206) (owner: 10Andrew-WMDE) [16:44:50] (03CR) 10Awight: [C: 03+1] Enable Reference Previews for more users [extensions/Popups] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685478 (https://phabricator.wikimedia.org/T271206) (owner: 10Andrew-WMDE) [16:48:00] 10SRE, 10Wikimedia-Mailing-lists, 10translatewiki.net: Add mailman-templates to translatewiki.net - https://phabricator.wikimedia.org/T282022 (10Legoktm) [16:49:30] (03PS1) 10Jcrespo: Initial control files for 10.5 mariadb packages [software] - 10https://gerrit.wikimedia.org/r/685524 [16:49:39] (03PS1) 10Ahmon Dancy: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/685525 [16:49:41] (03PS1) 10Ahmon Dancy: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/685526 [16:49:43] (03PS1) 10Ahmon Dancy: DevServices.php: Fix irc entry and add add linkrecommendation [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/685527 [16:50:35] (03PS2) 10Jcrespo: Initial control files for 10.5 mariadb packages [software] - 10https://gerrit.wikimedia.org/r/685524 [16:50:55] (03CR) 10jerkins-bot: [V: 04-1] DevServices.php: Fix irc entry and add add linkrecommendation [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/685527 (owner: 10Ahmon Dancy) [16:51:02] (03PS2) 10Ahmon Dancy: DevServices.php: Fix irc entry and add add linkrecommendation [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/685527 [16:52:30] (03PS3) 10Ahmon Dancy: DevServices.php: Fix irc entry and add add linkrecommendation [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/685527 [16:53:16] (03CR) 10Ladsgroup: [C: 03+1] "Tested on polymorphic. Works like a charm." [puppet] - 10https://gerrit.wikimedia.org/r/685212 (https://phabricator.wikimedia.org/T281933) (owner: 10Legoktm) [16:59:08] (03CR) 10Ahmon Dancy: [C: 03+2] DevServices.php: Fix irc entry and add add linkrecommendation [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/685527 (owner: 10Ahmon Dancy) [17:02:29] (03Merged) 10jenkins-bot: DevServices.php: Fix irc entry and add add linkrecommendation [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/685527 (owner: 10Ahmon Dancy) [17:19:17] 10SRE, 10Wikimedia-Mailing-lists: Find list owners for lists without them - https://phabricator.wikimedia.org/T281779 (10Dzahn) So there was no archive for pressemeldungen then? [17:20:34] (03PS1) 10Legoktm: Add qqq [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/685533 [17:20:54] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [17:21:28] 10SRE, 10Wikimedia-Mailing-lists: Find list owners for lists without them - https://phabricator.wikimedia.org/T281779 (10Ladsgroup) that mailing list doesn't have an archive kept. [17:21:30] ryankemper: FYI ^^^ [17:21:49] (03CR) 10Jbond: "See inline, i should also be around irc a bit longer tonight if you wanted to chat" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/685030 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [17:22:50] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:icinga: add monitogin for debmonitor.wikimedia.org vip [puppet] - 10https://gerrit.wikimedia.org/r/685464 (owner: 10Jbond) [17:23:45] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar, 10Patch-For-Review: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10Dzahn) >>! In T280718#7060805, @4nn1l2 wrote: > @Dzahn the list is not complete yet. The co... [17:23:50] RECOVERY - Host elastic2043 is UP: PING OK - Packet loss = 0%, RTA = 31.60 ms [17:24:33] 10SRE, 10Wikimedia-Mailing-lists: Find list owners for lists without them - https://phabricator.wikimedia.org/T281779 (10Dzahn) ACK, i'll let them know. Then it can be closed as well. Since the only response so far was that they didn't know it but asked for an archive. [17:28:48] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable Reference Previews for more users [extensions/Popups] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685478 (https://phabricator.wikimedia.org/T271206) (owner: 10Andrew-WMDE) [17:28:54] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable Reference Previews for more users [extensions/Popups] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/685477 (https://phabricator.wikimedia.org/T271206) (owner: 10Andrew-WMDE) [17:29:07] (03PS1) 10Razzi: kerberos: enable for sihe [puppet] - 10https://gerrit.wikimedia.org/r/685535 (https://phabricator.wikimedia.org/T281809) [17:30:28] 10SRE, 10Wikimedia-Mailing-lists, 10translatewiki.net: Add mailman-templates to translatewiki.net - https://phabricator.wikimedia.org/T282022 (10Legoktm) I proposed some qqq in https://gerrit.wikimedia.org/r/c/operations/software/mailman-templates/+/685533/ - let me know if that's sufficient or needs improve... [17:31:01] 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10Dzahn) Thank you, sounds good @jcrespo [17:31:44] (03PS2) 10Gergő Tisza: flaggedrevs.php: Use MediaWikiServices, not an extension function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679938 [17:33:43] (03PS1) 10Brennen Bearnes: Fix order of joins in SpecialRecentChanges [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685480 (https://phabricator.wikimedia.org/T281981) [17:36:19] (03PS1) 10Volans: Add python_deploy_venv class [puppet] - 10https://gerrit.wikimedia.org/r/685537 (https://phabricator.wikimedia.org/T276589) [17:36:38] (03CR) 10Dzahn: [C: 03+2] httpbb: add tests and test_suite for people.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/685149 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [17:37:48] (03CR) 10jerkins-bot: [V: 04-1] Add python_deploy_venv class [puppet] - 10https://gerrit.wikimedia.org/r/685537 (https://phabricator.wikimedia.org/T276589) (owner: 10Volans) [17:38:25] (03CR) 10Dzahn: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/685056 (owner: 10Jcrespo) [17:38:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:38:53] (03CR) 10Ppchelko: [C: 03+1] Fix order of joins in SpecialRecentChanges [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685480 (https://phabricator.wikimedia.org/T281981) (owner: 10Brennen Bearnes) [17:39:30] (03CR) 10Dzahn: [C: 03+1] "This _should_ not do anything until it gets the role and is pooled." [puppet] - 10https://gerrit.wikimedia.org/r/685132 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [17:40:12] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:43:10] (03PS1) 10Dzahn: httpbb: fix path to tests for people.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/685539 [17:43:18] Pchelolo: anything testable with that patch / is there a specific order those files need synced in? [17:43:39] (03CR) 10Dzahn: [C: 03+2] httpbb: fix path to tests for people.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/685539 (owner: 10Dzahn) [17:43:41] (03CR) 10Elukey: [C: 03+1] kerberos: enable for sihe [puppet] - 10https://gerrit.wikimedia.org/r/685535 (https://phabricator.wikimedia.org/T281809) (owner: 10Razzi) [17:43:55] (03CR) 10Razzi: [C: 03+2] kerberos: enable for sihe [puppet] - 10https://gerrit.wikimedia.org/r/685535 (https://phabricator.wikimedia.org/T281809) (owner: 10Razzi) [17:44:29] brennen: order - no, I do not think it matters. [17:44:33] testable.. one sec [17:45:05] https://www.mediawiki.org/wiki/Special:RecentChanges?userExpLevel=newcomer&hidebots=1&translations=filter&hidecategorization=1&hideWikibase=1&limit=50&days=7&urlversion=2 [17:46:28] Pchelolo: cool, thanks. in a meeting now but i'll either get that out during the upcoming backport window or before rolling the train during the train window. [17:46:44] (03PS2) 10Volans: Add python_deploy::venv class [puppet] - 10https://gerrit.wikimedia.org/r/685537 (https://phabricator.wikimedia.org/T276589) [17:48:26] (03CR) 10jerkins-bot: [V: 04-1] Add python_deploy::venv class [puppet] - 10https://gerrit.wikimedia.org/r/685537 (https://phabricator.wikimedia.org/T276589) (owner: 10Volans) [17:48:32] RECOVERY - WDQS high update lag on wdqs2007 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.148e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [17:51:31] (03CR) 10Dzahn: [C: 03+2] conftool-data: add phab2002 to codfw git-ssh pool [puppet] - 10https://gerrit.wikimedia.org/r/685132 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [17:51:54] 10SRE, 10Traffic: provision more machines for eqsin caches - https://phabricator.wikimedia.org/T275046 (10BBlack) I checked the BIOS/iDRAC settings on cp5013 against https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_Documentation#Initial_System_Setup (+ the one custom setting we use on t... [17:52:05] (03PS1) 10Ladsgroup: prometheus: Export number of wikidata errors every hour [puppet] - 10https://gerrit.wikimedia.org/r/685541 (https://phabricator.wikimedia.org/T274420) [17:52:14] (03CR) 10Dzahn: [C: 04-2] "no, doesnt exist in DNS yet. first needs a new service IP in netbox and we should never have used a hostname as part of that name" [puppet] - 10https://gerrit.wikimedia.org/r/685132 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [17:53:10] (03CR) 10jerkins-bot: [V: 04-1] prometheus: Export number of wikidata errors every hour [puppet] - 10https://gerrit.wikimedia.org/r/685541 (https://phabricator.wikimedia.org/T274420) (owner: 10Ladsgroup) [17:54:37] (03PS2) 10Ladsgroup: prometheus: Export number of wikidata errors every hour [puppet] - 10https://gerrit.wikimedia.org/r/685541 (https://phabricator.wikimedia.org/T274420) [17:56:22] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/29415/thumbor1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/682685 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [17:58:18] !log push pfw policies - T281942 [17:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:16] !log adding a systemd timer to all thumbor servers that writes output of fc-list command into /srv/fc-list/fc-list (T280718) [17:59:17] !log bblack@cumin1001 conftool action : set/weight=1; selector: name=cp501[3456].eqsin.wmnet,service=varnish-fe [17:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:23] T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 [17:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:38] !log bblack@cumin1001 conftool action : set/weight=1; selector: name=cp501[3456].eqsin.wmnet,service=ats-tls [17:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:46] !log bblack@cumin1001 conftool action : set/weight=100; selector: name=cp501[3456].eqsin.wmnet,service=ats-be [17:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210505T1800). [18:00:04] herron, Andrew-WMDE, and tgr: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:04] brennen and liw: I, the Bot under the Fountain, allow thee, The Deployer, to do Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210505T1800). [18:00:38] o/ [18:00:45] hey [18:00:51] Hi [18:01:16] (03CR) 10Dzahn: "It worked:" [puppet] - 10https://gerrit.wikimedia.org/r/682685 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [18:02:23] (03Abandoned) 10Ahmon Dancy: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/685526 (owner: 10Ahmon Dancy) [18:02:29] (03Abandoned) 10Ahmon Dancy: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/685525 (owner: 10Ahmon Dancy) [18:02:37] (03CR) 10CDanis: [C: 03+2] Add IRC alerting for two relevant NEL subtypes [puppet] - 10https://gerrit.wikimedia.org/r/685516 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis) [18:02:45] (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/685545 [18:03:06] (03PS1) 10Gergő Tisza: Prevent edit notices from appearing [extensions/GrowthExperiments] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685482 (https://phabricator.wikimedia.org/T281960) [18:03:41] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar, 10Patch-For-Review: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10Dzahn) On thumbor servers we now have this: example thumbor1002: ` [thumbor1002:/srv/fc-lis... [18:04:02] (03CR) 10CDanis: [C: 03+2] Add a public_cloud bit to X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/679341 (https://phabricator.wikimedia.org/T279380) (owner: 10CDanis) [18:04:37] (03CR) 10Ahmon Dancy: [C: 03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/685545 (owner: 10Ahmon Dancy) [18:04:59] I added a last-minute entry to the deploy window [18:05:34] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/685545 (owner: 10Ahmon Dancy) [18:05:40] (03PS1) 10Gergő Tisza: Prevent edit notices from appearing [extensions/GrowthExperiments] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/685483 (https://phabricator.wikimedia.org/T281960) [18:05:44] 10SRE, 10Analytics, 10Traffic, 10Patch-For-Review: Add Traffic's notion of "from public cloud" to Analytics webrequest data - https://phabricator.wikimedia.org/T279380 (10CDanis) [18:06:25] two, actually [18:06:40] 10SRE, 10Analytics, 10Traffic, 10Patch-For-Review: Add Traffic's notion of "from public cloud" to Analytics webrequest data - https://phabricator.wikimedia.org/T279380 (10CDanis) @fdans @JAllemandou New map entry should be ready for Analytics to set up in Turnilo :) [18:08:04] I can do the deploys [18:08:36] 10SRE, 10Product-Data-Infrastructure, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis) [18:09:33] tgr_: if you wouldn't mind, ping me when backports done? [18:09:41] sure [18:09:45] thx [18:10:35] (03CR) 10Gergő Tisza: [C: 03+2] Enable Reference Previews for more users [extensions/Popups] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/685477 (https://phabricator.wikimedia.org/T271206) (owner: 10Andrew-WMDE) [18:10:38] (03CR) 10Gergő Tisza: [C: 03+2] Enable Reference Previews for more users [extensions/Popups] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685478 (https://phabricator.wikimedia.org/T271206) (owner: 10Andrew-WMDE) [18:11:16] (03CR) 10Gergő Tisza: [C: 03+2] replace mwlog1001 with new mwlog[12]002 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677002 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [18:11:27] (03CR) 10Gergő Tisza: [C: 03+2] Prevent edit notices from appearing [extensions/GrowthExperiments] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/685483 (https://phabricator.wikimedia.org/T281960) (owner: 10Gergő Tisza) [18:11:37] (03CR) 10Gergő Tisza: [C: 03+2] Prevent edit notices from appearing [extensions/GrowthExperiments] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685482 (https://phabricator.wikimedia.org/T281960) (owner: 10Gergő Tisza) [18:11:51] (03PS5) 10Gergő Tisza: replace mwlog1001 with new mwlog[12]002 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677002 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [18:12:15] (03CR) 10Gergő Tisza: [C: 03+2] replace mwlog1001 with new mwlog[12]002 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677002 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [18:13:42] (03Merged) 10jenkins-bot: replace mwlog1001 with new mwlog[12]002 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677002 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [18:15:17] herron: it's on mwdebug1001 [18:15:30] tgr_: ok looking [18:19:02] (03Merged) 10jenkins-bot: Enable Reference Previews for more users [extensions/Popups] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/685477 (https://phabricator.wikimedia.org/T271206) (owner: 10Andrew-WMDE) [18:19:06] (03Merged) 10jenkins-bot: Enable Reference Previews for more users [extensions/Popups] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685478 (https://phabricator.wikimedia.org/T271206) (owner: 10Andrew-WMDE) [18:21:14] Andrew-WMDE: is there a dependency between the two changed files? [18:21:44] tgr_ no [18:21:55] tgr_: yeah seems ok, still seeing logs flowing to mwlog from mwdebg1001 [18:23:37] (03PS3) 10Gergő Tisza: flaggedrevs.php: Use MediaWikiServices, not an extension function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679938 [18:24:06] !log tgr@deploy1002 Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:677002|replace mwlog1001 with new mwlog[12]002 hosts (T224565)]] (duration: 01m 24s) [18:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:15] T224565: Migrate mwlog/udp2log servers to Buster - https://phabricator.wikimedia.org/T224565 [18:26:50] (03Merged) 10jenkins-bot: Prevent edit notices from appearing [extensions/GrowthExperiments] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/685483 (https://phabricator.wikimedia.org/T281960) (owner: 10Gergő Tisza) [18:26:53] (03Merged) 10jenkins-bot: Prevent edit notices from appearing [extensions/GrowthExperiments] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685482 (https://phabricator.wikimedia.org/T281960) (owner: 10Gergő Tisza) [18:28:01] Andrew-WMDE: it's on mwdebug1001 [18:28:36] tgr_: which one? [18:28:41] both [18:28:56] tgr_: ok checking both [18:31:25] tgr_: Looks good, thank you! [18:32:58] (03CR) 10Gergő Tisza: [C: 03+2] flaggedrevs.php: Use MediaWikiServices, not an extension function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679938 (owner: 10Gergő Tisza) [18:33:08] !log tgr@deploy1002 Synchronized php-1.37.0-wmf.3/extensions/Popups/includes: Backport: [[gerrit:685477|Enable Reference Previews for more users (T271206)]] (duration: 01m 11s) [18:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:17] T271206: Enable RefPreviews on first wikis - https://phabricator.wikimedia.org/T271206 [18:34:14] (03Merged) 10jenkins-bot: flaggedrevs.php: Use MediaWikiServices, not an extension function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679938 (owner: 10Gergő Tisza) [18:34:26] 10SRE, 10Traffic: provision more machines for eqsin caches - https://phabricator.wikimedia.org/T275046 (10BBlack) The others were in the same state. All are fixed and rebooted now, icinga downtimes are removed, netbox status is set to `Active`, and confctl weights are set correctly, but the `pooled` attribute... [18:34:36] !log tgr@deploy1002 Synchronized php-1.37.0-wmf.4/extensions/Popups/includes: Backport: [[gerrit:685478|Enable Reference Previews for more users (T271206)]] (duration: 01m 08s) [18:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:57] (03PS3) 10Volans: Add python_deploy::venv class [puppet] - 10https://gerrit.wikimedia.org/r/685537 (https://phabricator.wikimedia.org/T276589) [18:37:09] (03PS1) 10WMDE-Fisch: Enable ReferencePreviews on first wikis CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685554 (https://phabricator.wikimedia.org/T271206) [18:40:33] !log tgr@deploy1002 Synchronized wmf-config/flaggedrevs.php: Config: [[gerrit:679938|flaggedrevs.php: Use MediaWikiServices, not an extension function]] (duration: 01m 08s) [18:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:00] (03PS1) 10Jbond: P:pki::client: fix chain an chained certificates [puppet] - 10https://gerrit.wikimedia.org/r/685556 [18:42:01] !log tgr@deploy1002 Synchronized php-1.37.0-wmf.3/extensions/GrowthExperiments/modules/homepage/addlink/AddLinkArticleTarget.js: Backport: [[gerrit:685483|Prevent edit notices from appearing (T281960)]] (duration: 01m 08s) [18:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:09] T281960: TypeError: can't access property "setNotices", actionTools.notices is undefined - https://phabricator.wikimedia.org/T281960 [18:42:14] (03CR) 10Andrew-WMDE: [C: 03+1] Enable ReferencePreviews on first wikis CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685554 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [18:42:23] (03CR) 10WMDE-Fisch: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685062 (owner: 10WMDE-Fisch) [18:42:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29416/console" [puppet] - 10https://gerrit.wikimedia.org/r/685556 (owner: 10Jbond) [18:43:20] !log tgr@deploy1002 Synchronized php-1.37.0-wmf.4/extensions/GrowthExperiments/modules/homepage/addlink/AddLinkArticleTarget.js: Backport: [[gerrit:685482|Prevent edit notices from appearing (T281960)]] (duration: 01m 08s) [18:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:41] !log Morning deploys done [18:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:53] (03CR) 10jerkins-bot: [V: 04-1] P:pki::client: fix chain an chained certificates [puppet] - 10https://gerrit.wikimedia.org/r/685556 (owner: 10Jbond) [18:43:57] ^ brennen [18:44:00] tgr_: thanks [18:45:36] (03CR) 10Brennen Bearnes: [C: 03+2] Fix order of joins in SpecialRecentChanges [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685480 (https://phabricator.wikimedia.org/T281981) (owner: 10Brennen Bearnes) [18:47:08] Andrew-WMDE: I get "ext.popups should not even be loaded!" on the browser console when I visit a wiki [18:48:26] tgr_ thank you! [18:48:57] _tgr: Taking a look... [18:49:16] (03PS2) 10Jbond: P:pki::client: fix chain an chained certificates [puppet] - 10https://gerrit.wikimedia.org/r/685556 [18:50:19] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp501[35].eqsin.wmnet [18:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29417/console" [puppet] - 10https://gerrit.wikimedia.org/r/685556 (owner: 10Jbond) [18:51:07] (03CR) 10Cwhite: "Looks good! Some items to consider inline." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/685541 (https://phabricator.wikimedia.org/T274420) (owner: 10Ladsgroup) [18:51:40] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::client: fix chain an chained certificates [puppet] - 10https://gerrit.wikimedia.org/r/685556 (owner: 10Jbond) [18:54:47] (03PS1) 10Jbond: Revert "P:pki::client: fix chain an chained certificates" [puppet] - 10https://gerrit.wikimedia.org/r/685484 [18:54:56] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "P:pki::client: fix chain an chained certificates" [puppet] - 10https://gerrit.wikimedia.org/r/685484 (owner: 10Jbond) [18:56:07] tgr_: hmm, I can't seem to reproduce the error [18:56:11] (03PS1) 10Jbond: P:pki::client: fix chain an chained certificates [puppet] - 10https://gerrit.wikimedia.org/r/685485 [18:58:39] (03PS3) 10Ladsgroup: prometheus: Export number of wikidata errors every hour [puppet] - 10https://gerrit.wikimedia.org/r/685541 (https://phabricator.wikimedia.org/T274420) [18:58:43] (03CR) 10Ladsgroup: prometheus: Export number of wikidata errors every hour (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/685541 (https://phabricator.wikimedia.org/T274420) (owner: 10Ladsgroup) [18:58:49] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.09753 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [18:59:33] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp501[46].eqsin.wmnet [18:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:47] RECOVERY - check_trafficserver_log_fifo_analytics_tls on cp5016 is OK: OK: read 8 bytes as expected https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:59:55] is the puppet issue known? [19:00:04] brennen and liw: That opportune time is upon us again. Time for a MediaWiki train - American+European Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210505T1900). [19:00:21] jynus: that was me sorry i thuhgt i had rverted before triggering the alert [19:00:38] as long as it is known, no harm done :-) [19:01:02] known/WIP/etc [19:01:18] yes its me :) [19:01:21] !log 1.37.0-wmf.4 train status (T281145): deploying patch for T282038 and then rolling forward to group1. [19:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:33] T282038: PHP Deprecated: Caller from LinkBatch::doQuery (for Skin::preloadExistence) ignored an error originally raised from SpecialRecentChanges::doMainQuery: [1054] Unknown column 'actor_user' in 'on clause' (10.64.0.44) - https://phabricator.wikimedia.org/T282038 [19:01:33] T281145: 1.37.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T281145 [19:01:53] (03PS2) 10Jbond: P:pki::client: fix chain an chained certificates [puppet] - 10https://gerrit.wikimedia.org/r/685485 [19:02:20] (03CR) 10Cwhite: [C: 03+2] prometheus: Export number of wikidata errors every hour [puppet] - 10https://gerrit.wikimedia.org/r/685541 (https://phabricator.wikimedia.org/T274420) (owner: 10Ladsgroup) [19:02:26] Andrew-WMDE: seems to be some sort of ResourceLoader cache invalidation problem [19:02:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29418/console" [puppet] - 10https://gerrit.wikimedia.org/r/685485 (owner: 10Jbond) [19:03:24] tgr_: anything i need to be aware of there? [19:03:37] 10SRE, 10Traffic, 10Patch-For-Review: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 (10BBlack) [19:03:49] 10SRE, 10Traffic: provision more machines for eqsin caches - https://phabricator.wikimedia.org/T275046 (10BBlack) 05Open→03Resolved a:03BBlack These are all pooled now and slowly filling their caches. Optimistically closing this task for now! [19:04:15] (03PS2) 10Legoktm: mailman3: Copy "info" field over manually [puppet] - 10https://gerrit.wikimedia.org/r/685212 (https://phabricator.wikimedia.org/T281933) [19:05:06] brennen: ResourceLoader gets caching wrong sometimes after backports. As long as it doesn't break anything visibly it shouldn't matter. If you are about to do a full scap, it definitely doesn't matter. [19:05:06] (03PS1) 10Zabe: Add ptwiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685557 (https://phabricator.wikimedia.org/T281925) [19:05:22] (03CR) 10Legoktm: mailman3: Copy "info" field over manually (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685212 (https://phabricator.wikimedia.org/T281933) (owner: 10Legoktm) [19:05:28] (03CR) 10Legoktm: [C: 03+2] mailman3: Copy "info" field over manually [puppet] - 10https://gerrit.wikimedia.org/r/685212 (https://phabricator.wikimedia.org/T281933) (owner: 10Legoktm) [19:06:20] ack, thx. [19:06:36] (03CR) 10Ssingh: "(Commenting on a merged patch as the issue persists.)" [puppet] - 10https://gerrit.wikimedia.org/r/685515 (owner: 10Ssingh) [19:08:11] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Mailman3 import script is unnecessarily truncating list descriptions - https://phabricator.wikimedia.org/T281933 (10Legoktm) 05Open→03Resolved [19:08:15] (03Merged) 10jenkins-bot: Fix order of joins in SpecialRecentChanges [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685480 (https://phabricator.wikimedia.org/T281981) (owner: 10Brennen Bearnes) [19:08:53] tgr_: so I changed a few user preferences and was able to reproduce the problem [19:08:57] It looks like it's some superfluous logging which we will fix in a follow up [19:09:30] thx! [19:10:21] !log starting migration of public mailing lists in group b and c to mailman3 (T280322) [19:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:31] (03CR) 10Dzahn: [C: 03+2] "Thank you, Stevie Shirley" [puppet] - 10https://gerrit.wikimedia.org/r/682124 (owner: 10Aklapper) [19:10:32] T280322: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 [19:10:59] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0005875 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [19:11:19] (03PS2) 10Dzahn: phabricator weekly changes email: List dashboard panel changes [puppet] - 10https://gerrit.wikimedia.org/r/682124 (owner: 10Aklapper) [19:13:17] tgr_: Thanks for the deploy and letting me know! [19:14:21] !log brennen@deploy1002 Synchronized php-1.37.0-wmf.4/includes/specials: Backport: [[gerrit:685480|Fix order of joins in SpecialRecentChanges (T281981)]] (duration: 01m 08s) [19:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:30] T281981: Special:RecentChanges with userExpLevel=newcomer causes Fatal exception of type "Wikimedia\Rdbms\DBQueryError": Unknown column 'actor_user' - https://phabricator.wikimedia.org/T281981 [19:16:03] !log disable puppet: rolling out change (685485) which affects all hosts [19:16:05] !log brennen@deploy1002 Synchronized php-1.37.0-wmf.4/tests/phpunit/includes: Backport: [[gerrit:685480|Fix order of joins in SpecialRecentChanges (T281981)]] (duration: 01m 10s) [19:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:50] !log ignore the last log message will wait for deploy to finish [19:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:24] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: improve new mailing list admin notifications - https://phabricator.wikimedia.org/T281987 (10Legoktm) I filed https://gitlab.com/mailman/mailman/-/issues/890 [19:17:32] (03PS1) 10Brennen Bearnes: group1 wikis to 1.37.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685561 [19:17:34] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.37.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685561 (owner: 10Brennen Bearnes) [19:17:43] (03PS1) 10Herron: udplog: repoint CNAME to new hosts mwlog[12]002 [dns] - 10https://gerrit.wikimedia.org/r/685562 (https://phabricator.wikimedia.org/T224565) [19:18:30] (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685561 (owner: 10Brennen Bearnes) [19:19:59] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.4 [19:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:07] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.4 (duration: 01m 07s) [19:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:28] (03PS2) 10Herron: logstash101[012]: prep for reimaging [puppet] - 10https://gerrit.wikimedia.org/r/685090 (https://phabricator.wikimedia.org/T281266) [19:22:41] (03PS1) 10Zabe: Use ptwiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685563 (https://phabricator.wikimedia.org/T281925) [19:26:38] (03CR) 10Volans: "I did test this on cumin2001, because unable to test it on cumin2002 right now because of missing base bullseye image." [puppet] - 10https://gerrit.wikimedia.org/r/685537 (https://phabricator.wikimedia.org/T276589) (owner: 10Volans) [19:30:35] (03CR) 10Herron: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/685090 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [19:36:40] (03CR) 10Herron: [C: 03+2] add mwlog[12]002 to profile::dumps::rsync_internal_clients [puppet] - 10https://gerrit.wikimedia.org/r/676997 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [19:36:46] (03PS2) 10Herron: add mwlog[12]002 to profile::dumps::rsync_internal_clients [puppet] - 10https://gerrit.wikimedia.org/r/676997 (https://phabricator.wikimedia.org/T224565) [19:40:58] (03PS2) 10Herron: point wikimania scholarships to mwlog1002 [puppet] - 10https://gerrit.wikimedia.org/r/676995 (https://phabricator.wikimedia.org/T224565) [19:44:42] (03CR) 10Herron: [C: 03+2] point wikimania scholarships to mwlog1002 [puppet] - 10https://gerrit.wikimedia.org/r/676995 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [19:51:13] (03PS2) 10Herron: udplog: repoint CNAME to new hosts mwlog[12]002 [dns] - 10https://gerrit.wikimedia.org/r/685562 (https://phabricator.wikimedia.org/T224565) [19:52:56] brennen: is the train complete? [19:53:00] (03PS1) 10Ladsgroup: mailman3: Change owner email from root@ to listadmins-owner [puppet] - 10https://gerrit.wikimedia.org/r/685567 [19:53:16] jbond42: yeah, looks stable on group1. [19:53:38] ok great thanks, my change is fairly minor but didn;t want to add noise [19:53:49] !log disable puppet: rolling out change (685485) which affects all hosts [19:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:21] (03CR) 10Herron: [C: 03+2] udplog: repoint CNAME to new hosts mwlog[12]002 [dns] - 10https://gerrit.wikimedia.org/r/685562 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [19:56:30] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::client: fix chain an chained certificates [puppet] - 10https://gerrit.wikimedia.org/r/685485 (owner: 10Jbond) [19:59:19] !log re-enable puppet post 685485 [19:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] chrisalbon and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210505T2000). [20:07:28] (03PS1) 10Jbond: P:tlsproxy::envoy: use chained path wit cfssl [puppet] - 10https://gerrit.wikimedia.org/r/685568 [20:08:15] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29419/console" [puppet] - 10https://gerrit.wikimedia.org/r/685568 (owner: 10Jbond) [20:09:53] (03PS2) 10Jbond: P:tlsproxy::envoy: use chained path wit cfssl [puppet] - 10https://gerrit.wikimedia.org/r/685568 [20:10:51] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29420/console" [puppet] - 10https://gerrit.wikimedia.org/r/685568 (owner: 10Jbond) [20:11:35] PROBLEM - SSH on phab2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:11:53] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:tlsproxy::envoy: use chained path wit cfssl [puppet] - 10https://gerrit.wikimedia.org/r/685568 (owner: 10Jbond) [20:12:00] (03PS1) 10Ssingh: aptrepo: add a component for knot-dnsutils [puppet] - 10https://gerrit.wikimedia.org/r/685571 (https://phabricator.wikimedia.org/T252132) [20:12:03] tries that mgmt interface [20:21:49] (03PS1) 10Jbond: O:debmonioter: Switch back to sslcert ssl provider [puppet] - 10https://gerrit.wikimedia.org/r/685572 (https://phabricator.wikimedia.org/T281673) [20:22:45] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29421/console" [puppet] - 10https://gerrit.wikimedia.org/r/685572 (https://phabricator.wikimedia.org/T281673) (owner: 10Jbond) [20:24:03] (03PS2) 10Jbond: O:debmonioter: Switch back to sslcert ssl provider [puppet] - 10https://gerrit.wikimedia.org/r/685572 (https://phabricator.wikimedia.org/T281673) [20:25:32] (03PS3) 10Jbond: O:debmonioter: Switch back to sslcert ssl provider [puppet] - 10https://gerrit.wikimedia.org/r/685572 (https://phabricator.wikimedia.org/T281673) [20:26:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29423/console" [puppet] - 10https://gerrit.wikimedia.org/r/685572 (https://phabricator.wikimedia.org/T281673) (owner: 10Jbond) [20:27:31] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:debmonioter: Switch back to sslcert ssl provider [puppet] - 10https://gerrit.wikimedia.org/r/685572 (https://phabricator.wikimedia.org/T281673) (owner: 10Jbond) [20:32:18] (03PS3) 10Jbond: P:trafficserver::backend: Use a trusted CA file outside of /etc/ssl/certs [puppet] - 10https://gerrit.wikimedia.org/r/685495 (https://phabricator.wikimedia.org/T281673) [20:32:55] (03PS4) 10Jbond: hiera - cp1077: test CA bundle with pki and puppet ca certs [puppet] - 10https://gerrit.wikimedia.org/r/685496 (https://phabricator.wikimedia.org/T281673) [20:33:55] (03CR) 10Herron: [C: 03+2] deploy logster_alarm to mwlog1002 [puppet] - 10https://gerrit.wikimedia.org/r/676996 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [20:35:28] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install frdev1002 - https://phabricator.wikimedia.org/T282054 (10RobH) [20:37:29] (03PS1) 10Urbanecm: UserIdentityValue: Introduce convenience static factory methods [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/685587 (https://phabricator.wikimedia.org/T281972) [20:37:31] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install frdev1002 - https://phabricator.wikimedia.org/T282054 (10RobH) [20:37:59] (03PS1) 10Urbanecm: UserIdentityValue: Introduce convenience static factory methods [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685588 (https://phabricator.wikimedia.org/T281972) [20:38:08] (03PS1) 10Urbanecm: Cross-wiki block should pass correct wiki blocker [extensions/CentralAuth] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/685589 (https://phabricator.wikimedia.org/T277687) [20:38:23] (03PS1) 10Urbanecm: Cross-wiki block should pass correct wiki blocker [extensions/CentralAuth] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685590 (https://phabricator.wikimedia.org/T277687) [20:38:38] (03PS5) 10Jbond: hiera - cp1077: test CA bundle with pki and puppet ROOT ca certs [puppet] - 10https://gerrit.wikimedia.org/r/685496 (https://phabricator.wikimedia.org/T281673) [20:39:01] (03PS4) 10Jbond: P:traffic::backend: update the source of the ATS trusted ca bundle [puppet] - 10https://gerrit.wikimedia.org/r/685497 (https://phabricator.wikimedia.org/T281673) [20:39:17] 10SRE, 10observability, 10Patch-For-Review: Migrate mwlog/udp2log servers to Buster - https://phabricator.wikimedia.org/T224565 (10herron) Along with migrating these hosts to buster I've deployed an updated config to make mwlog more of a multi-datacenter service. Logs that arrive on (e.g. mwlog1002:8420) ar... [20:39:48] jouncebot: now [20:39:48] For the next 0 hour(s) and 20 minute(s): MediaWiki train - American+European Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210505T1900) [20:39:49] For the next 1 hour(s) and 20 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210505T2000) [20:39:50] (03PS5) 10Jbond: P:traffic::backend: update the source of the ATS trusted ca bundle [puppet] - 10https://gerrit.wikimedia.org/r/685497 (https://phabricator.wikimedia.org/T281673) [20:41:01] (03CR) 10Urbanecm: [C: 03+2] UserIdentityValue: Introduce convenience static factory methods [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685588 (https://phabricator.wikimedia.org/T281972) (owner: 10Urbanecm) [20:41:05] (03CR) 10Urbanecm: [C: 03+2] UserIdentityValue: Introduce convenience static factory methods [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/685587 (https://phabricator.wikimedia.org/T281972) (owner: 10Urbanecm) [20:41:11] (03CR) 10Urbanecm: [C: 03+2] Cross-wiki block should pass correct wiki blocker [extensions/CentralAuth] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/685589 (https://phabricator.wikimedia.org/T277687) (owner: 10Urbanecm) [20:41:13] (03CR) 10Urbanecm: [C: 03+2] Cross-wiki block should pass correct wiki blocker [extensions/CentralAuth] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685590 (https://phabricator.wikimedia.org/T277687) (owner: 10Urbanecm) [20:41:52] (03PS6) 10Jbond: P:trafficserver::backend: update the source of the ATS trusted ca bundle [puppet] - 10https://gerrit.wikimedia.org/r/685497 (https://phabricator.wikimedia.org/T281673) [20:42:00] (03PS3) 10Jbond: trafficserver: Fix cacert_(dirpath|filename) usage [puppet] - 10https://gerrit.wikimedia.org/r/685503 (https://phabricator.wikimedia.org/T281673) (owner: 10Vgutierrez) [20:42:11] (03PS3) 10Jbond: trafficserver: Clear outbound TLS cacert_path for cp4026 and cp4032 [puppet] - 10https://gerrit.wikimedia.org/r/685504 (https://phabricator.wikimedia.org/T281673) (owner: 10Vgutierrez) [20:42:23] (03PS4) 10Jbond: P:trafficserver::backend: Use a trusted CA file outside of /etc/ssl/certs [puppet] - 10https://gerrit.wikimedia.org/r/685495 (https://phabricator.wikimedia.org/T281673) [20:42:33] (03PS6) 10Jbond: hiera - cp1077: test CA bundle with pki and puppet ROOT ca certs [puppet] - 10https://gerrit.wikimedia.org/r/685496 (https://phabricator.wikimedia.org/T281673) [20:42:41] (03PS7) 10Jbond: P:trafficserver::backend: update the source of the ATS trusted ca bundle [puppet] - 10https://gerrit.wikimedia.org/r/685497 (https://phabricator.wikimedia.org/T281673) [20:45:28] (03PS1) 10Jbond: O:debmonitor::server: Switch debmonitor.wikimedia.org ssl to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/685576 (https://phabricator.wikimedia.org/T281673) [20:46:01] 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install fran2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T282056 (10RobH) [20:46:30] 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install fran2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T282056 (10RobH) [20:49:14] (03CR) 10Dzahn: "@Aklapper, in this case just merged and letting it happen. You don't need an immediate run because you have SQL result anyways, right? don" [puppet] - 10https://gerrit.wikimedia.org/r/682124 (owner: 10Aklapper) [20:55:15] 10SRE, 10Wikimedia-Mailing-lists, 10translatewiki.net: Add mailman-templates to translatewiki.net - https://phabricator.wikimedia.org/T282022 (10Ladsgroup) p:05Triage→03Medium [20:55:29] 10SRE, 10Wikimedia-Mailing-lists: Make customized Mailman3 templates translatable - https://phabricator.wikimedia.org/T282018 (10Ladsgroup) p:05Triage→03Medium [20:56:06] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Mailman3 "New subscription request to" template line wraps, breaking long links - https://phabricator.wikimedia.org/T282044 (10Ladsgroup) p:05Triage→03Medium Borderline High. Feel free to change. [20:56:14] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: improve new mailing list admin notifications - https://phabricator.wikimedia.org/T281987 (10Ladsgroup) p:05Triage→03Medium [20:58:31] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:07:15] (03Restored) 10Dzahn: admin: update email address for shell user Alangi Derick [puppet] - 10https://gerrit.wikimedia.org/r/685189 (https://phabricator.wikimedia.org/T281564) (owner: 10Dzahn) [21:08:07] (03Merged) 10jenkins-bot: UserIdentityValue: Introduce convenience static factory methods [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685588 (https://phabricator.wikimedia.org/T281972) (owner: 10Urbanecm) [21:08:13] (03Merged) 10jenkins-bot: UserIdentityValue: Introduce convenience static factory methods [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/685587 (https://phabricator.wikimedia.org/T281972) (owner: 10Urbanecm) [21:08:16] (03Merged) 10jenkins-bot: Cross-wiki block should pass correct wiki blocker [extensions/CentralAuth] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/685589 (https://phabricator.wikimedia.org/T277687) (owner: 10Urbanecm) [21:08:18] (03Merged) 10jenkins-bot: Cross-wiki block should pass correct wiki blocker [extensions/CentralAuth] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685590 (https://phabricator.wikimedia.org/T277687) (owner: 10Urbanecm) [21:08:31] (03CR) 10Ssingh: [C: 03+1] "+1, email matches what was requested on the task." [puppet] - 10https://gerrit.wikimedia.org/r/685189 (https://phabricator.wikimedia.org/T281564) (owner: 10Dzahn) [21:10:08] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Deployment shell for derick - https://phabricator.wikimedia.org/T281564 (10Dzahn) @xSavitar Could you take a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/685189 ? That is updating the email address associated with the ex... [21:10:59] (03CR) 10Awight: "I'm confused—doesn't this patch *promote* Reference Previews to a full-default feature on labs? The comment and commit message make it so" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685451 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [21:13:13] RECOVERY - SSH on phab2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:13:37] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Deployment shell for derick - https://phabricator.wikimedia.org/T281564 (10xSavitar) @Dzahn, I think I should just use the **xsavitar.wiki@aol.com** one as it's what I'm using here on Gerrit & on Wikitech too. Sorry about the confusion th... [21:13:54] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Deployment shell for derick - https://phabricator.wikimedia.org/T281564 (10xSavitar) [21:14:52] Pchelolo: it didn't help :/. [21:15:17] [f1250ea2-917f-40c0-a14e-a74752bcf02c] 2021-05-05 21:15:10: Fatal exception of type "InvalidArgumentException" is what i get. [21:15:43] aka DB connection domain 'banwiki' does not match 'metawiki' [21:15:47] (03PS1) 10Ladsgroup: prometheus: Migrate node_puppet_agent cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/685581 (https://phabricator.wikimedia.org/T273673) [21:15:50] (03PS1) 10Ladsgroup: prometheus: Migrate node_gdnsd cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/685582 (https://phabricator.wikimedia.org/T273673) [21:15:52] (03PS1) 10Ladsgroup: prometheus: Migrate node_file_count cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/685583 (https://phabricator.wikimedia.org/T273673) [21:16:31] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:17:03] (03CR) 10jerkins-bot: [V: 04-1] prometheus: Migrate node_gdnsd cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/685582 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [21:20:21] (03PS2) 10Ladsgroup: prometheus: Migrate node_gdnsd cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/685582 (https://phabricator.wikimedia.org/T273673) [21:21:49] (03PS2) 10Ladsgroup: prometheus: Migrate node_file_count cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/685583 (https://phabricator.wikimedia.org/T273673) [21:21:49] (03CR) 10Ssingh: [C: 03+1] "+1, uid matches and the email address was updated in https://phabricator.wikimedia.org/T281564#7064017." [puppet] - 10https://gerrit.wikimedia.org/r/685190 (https://phabricator.wikimedia.org/T281564) (owner: 10Dzahn) [21:21:49] (03CR) 10Awight: Enable ReferencePreviews on first wikis CommonSettings (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685554 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [21:24:08] (03Abandoned) 10Dzahn: admin: update email address for shell user Alangi Derick [puppet] - 10https://gerrit.wikimedia.org/r/685189 (https://phabricator.wikimedia.org/T281564) (owner: 10Dzahn) [21:27:35] (03CR) 10Legoktm: [C: 03+1] "Will deploy when you're done with today's migrations." [puppet] - 10https://gerrit.wikimedia.org/r/685567 (owner: 10Ladsgroup) [21:28:12] (03PS3) 10Dzahn: admin: upgrade derick from ldap_only to deployer [puppet] - 10https://gerrit.wikimedia.org/r/685190 (https://phabricator.wikimedia.org/T281564) [21:29:08] !log urbanecm@deploy1002 sync-file aborted: 8ffb52d5cad9e003696200b9cd3e957ab26bc868: UserIdentityValue: Introduce convenience static factory methods (T281972) (duration: 00m 04s) [21:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:25] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.3/includes/user/UserIdentityValue.php: 8ffb52d5cad9e003696200b9cd3e957ab26bc868: UserIdentityValue: Introduce convenience static factory methods (T281972) (duration: 01m 11s) [21:30:26] (03CR) 10Ladsgroup: "Yes, my ideal solution would be to have a alias for this. Something like mailman-owner@ and then it redirects to people set in private rep" [puppet] - 10https://gerrit.wikimedia.org/r/685567 (owner: 10Ladsgroup) [21:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:01] (03PS2) 10Awight: Enable ReferencePreviews as full default on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685554 (https://phabricator.wikimedia.org/T271206) (owner: 10WMDE-Fisch) [21:32:13] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.4/includes/user/UserIdentityValue.php: f189c4627cfc692fb743160030a5e5ab92df1485: UserIdentityValue: Introduce convenience static factory methods (T281972) (duration: 01m 09s) [21:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:21] (03CR) 10D3r1ck01: [C: 03+1] "LGTM! All information looks correct! 💥" [puppet] - 10https://gerrit.wikimedia.org/r/685190 (https://phabricator.wikimedia.org/T281564) (owner: 10Dzahn) [21:33:05] (03CR) 10Volans: [C: 03+1] "Great! Ship it!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) (owner: 10CRusnov) [21:33:34] (03CR) 10Dzahn: [C: 03+1] "> Yes, my ideal solution would be to have a alias for this. Something like mailman-owner@ and then it redirects to people set in private r" [puppet] - 10https://gerrit.wikimedia.org/r/685567 (owner: 10Ladsgroup) [21:34:12] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.3/extensions/CentralAuth/includes/CentralAuthUser.php: 6526884848d0bb88c83cec2c6b39461542e21ef6: Cross-wiki block should pass correct wiki blocker (T281972) (duration: 01m 08s) [21:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:28] (03CR) 10Dzahn: [C: 03+2] admin: upgrade derick from ldap_only to deployer [puppet] - 10https://gerrit.wikimedia.org/r/685190 (https://phabricator.wikimedia.org/T281564) (owner: 10Dzahn) [21:35:33] (03PS4) 10Dzahn: admin: upgrade derick from ldap_only to deployer [puppet] - 10https://gerrit.wikimedia.org/r/685190 (https://phabricator.wikimedia.org/T281564) [21:37:28] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.4/extensions/CentralAuth/includes/CentralAuthUser.php: 52b134ed84c1c8ef5fcd6927f03567879553d31c: Cross-wiki block should pass correct wiki blocker (T281972) (duration: 01m 09s) [21:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:52] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Deployment shell for derick - https://phabricator.wikimedia.org/T281564 (10Dzahn) [21:44:18] 10SRE, 10ops-codfw, 10Discovery, 10Discovery-Search (Current work): elastic2033 without bootable devices available - https://phabricator.wikimedia.org/T281621 (10RKemper) a:05elukey→03RKemper [21:45:43] !log mailing lists: approved Alangi Derick's pending request for membership in ops mailing list (is becoming deployer) T281309 [21:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:52] T281309: Deployment training request for **xSavitar** - https://phabricator.wikimedia.org/T281309 [21:52:29] RECOVERY - dump of es5 in codfw on alert1001 is OK: Last dump for es5 at codfw (es2025.codfw.wmnet) taken on 2021-05-05 09:42:44 (1722 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [22:00:56] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be100[12] - https://phabricator.wikimedia.org/T276637 (10Jclark-ctr) moss-be1001 B4 U30 port13 id5345 moss-be1002 C2 U21 port33 id5344 [22:00:58] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be100[12] - https://phabricator.wikimedia.org/T276637 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [22:05:05] !log pushing puppet run on all bastion hosts [22:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:39] !log welcome new deployer derick - user created on deploy1002 and bastions (T281564) [22:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:23] (03PS1) 10DannyS712: Fix centering of as-of label [extensions/GlobalWatchlist] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685595 [22:17:31] 10SRE, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T280668 (10Jclark-ctr) found drive from decom`d server. Host is no longer showing any failed drive [22:24:42] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Deployment shell for derick - https://phabricator.wikimedia.org/T281564 (10Dzahn) 05Open→03Resolved We talked on IRC and I confirm Derick could succesfully connect to deploy1002 via a bastion. Resolving! [22:25:57] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 184161520 and 17 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:28:35] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2844064 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:46:33] RECOVERY - dump of es5 in eqiad on alert1001 is OK: Last dump for es5 at eqiad (es1025.eqiad.wmnet) taken on 2021-05-05 09:38:38 (1722 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [23:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210505T2300). [23:00:04] DannyS712: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:08] here [23:00:23] I can deploy today. [23:00:49] 10SRE: Gaining access to MaxMind account associated with noc@wikimedia.org - https://phabricator.wikimedia.org/T282066 (10odimitrijevic) [23:01:01] (03CR) 10Urbanecm: [C: 03+2] Fix centering of as-of label [extensions/GlobalWatchlist] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685595 (owner: 10DannyS712) [23:05:22] (03Merged) 10jenkins-bot: Fix centering of as-of label [extensions/GlobalWatchlist] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685595 (owner: 10DannyS712) [23:07:49] DannyS712: can you test it via mwdebug1001, please? [23:08:33] tested, works! [23:08:36] great [23:09:01] syncing [23:10:50] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.4/extensions/GlobalWatchlist/modules/SpecialGlobalWatchlist.display.css: 4947241f876234aabc578409c3691fb791c8f715: Fix centering of as-of label (duration: 01m 08s) [23:10:51] RECOVERY - dump of es4 in codfw on alert1001 is OK: Last dump for es4 at codfw (es2022.codfw.wmnet) taken on 2021-05-05 09:42:44 (1744 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [23:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:02] DannyS712: done [23:11:04] anything else? [23:11:15] nope :) [23:12:04] thanks for the help [23:12:35] any time [23:18:46] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/685581 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [23:19:11] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/685582 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [23:20:35] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/685583 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [23:27:01] PROBLEM - Host wdqs2007 is DOWN: PING CRITICAL - Packet loss = 100% [23:27:37] ^ that will be me doing some upgrade on the node [23:33:25] RECOVERY - Host wdqs2007 is UP: PING OK - Packet loss = 0%, RTA = 31.73 ms [23:34:31] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10RKemper) `elastic2043` seems to have PSU problems, which caused it to randomly reboot: ` racadm>>racadm getsel Record: 1 Date/Time: 04... [23:34:41] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10RKemper) 05Resolved→03Open [23:35:18] !log T281621 T281327 [Elastic] Banned `elastic2033` and `elastic2043` from the Cirrussearch Elasticsearch clusters [23:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:28] T281621: elastic2033 without bootable devices available - https://phabricator.wikimedia.org/T281621 [23:35:28] T281327: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 [23:39:37] PROBLEM - Check systemd state on wdqs2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:41:42] ACKNOWLEDGEMENT - MD RAID on wdqs2007 is CRITICAL: CRITICAL: State: degraded, Active: 7, Working: 7, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T282068 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [23:41:46] 10SRE, 10ops-codfw: Degraded RAID on wdqs2007 - https://phabricator.wikimedia.org/T282068 (10ops-monitoring-bot) [23:44:43] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10Papaul) Before BIOS: 2.4.8 IDRAC: 4.00 After BIOS: 2.9.3 IDRAC: 4.22 [23:49:29] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10Papaul) 05Open→03Resolved @RKemper firmware upgrade complete on the host. Resolving the task for now if we still have the same iss... [23:50:05] RECOVERY - Check systemd state on wdqs2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:53:06] (03PS1) 10Ladsgroup: lists: Rename Rename mailing lists eliso, and eliso-anoncoj [puppet] - 10https://gerrit.wikimedia.org/r/685618 (https://phabricator.wikimedia.org/T281686) [23:59:15] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [23:59:58] (03CR) 10Legoktm: [C: 03+2] lists: Rename Rename mailing lists eliso, and eliso-anoncoj [puppet] - 10https://gerrit.wikimedia.org/r/685618 (https://phabricator.wikimedia.org/T281686) (owner: 10Ladsgroup)