[00:00:16] PROBLEM - mediawiki-installation DSH group on mw2158 is CRITICAL: Host mw2158 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [00:01:21] ^ that's me [00:01:25] jouncebot: next [00:01:26] In 82 hour(s) and 28 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200316T1030) [00:02:20] PROBLEM - mediawiki-installation DSH group on mw2172 is CRITICAL: Host mw2172 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [00:02:46] PROBLEM - mediawiki-installation DSH group on mw2168 is CRITICAL: Host mw2168 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [00:04:45] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [00:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:51] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [00:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:57] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: codfw: decom at least 15 appservers(mw2158 through mw2172) in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10ops-monitoring-bot) Icinga downtime for 12:00:00 set by dzahn@cumin1001 on 15 host(s) and thei... [00:08:15] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw217[1-2].codfw.wmnet [00:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:39] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw216[0-9].codfw.wmnet [00:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:54] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw215[8-9].codfw.wmnet [00:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:32] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: codfw: decom at least 15 appservers(mw2158 through mw2172) in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10Dzahn) >>! In T247018#5966345, @Dzahn wrote: > mw2158 through mw2172 are permanently depooled... [00:16:32] PROBLEM - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 144.4 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [00:18:56] PROBLEM - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 176.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [00:20:30] !log reload prometheus@ops on prometheus1003 [00:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:16] that check seems new and has no runbook link or explanation what the issue actually is? [00:24:18] PROBLEM - Old JVM GC check - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 130.2 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [00:27:06] ACKNOWLEDGEMENT - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 179 gt 100 daniel_zahn https://phabricator.wikimedia.org/T231517 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [00:27:06] ACKNOWLEDGEMENT - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 210.5 gt 100 daniel_zahn https://phabricator.wikimedia.org/T231517 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [00:27:06] ACKNOWLEDGEMENT - Old JVM GC check - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 134.2 gt 100 daniel_zahn https://phabricator.wikimedia.org/T231517 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [00:27:06] ACKNOWLEDGEMENT - Old JVM GC check - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is CRITICAL: 107.8 gt 100 daniel_zahn https://phabricator.wikimedia.org/T231517 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [00:30:15] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: 2020-03-01) rack/setup/install htmldumper1001.eqiad.wmnet. - https://phabricator.wikimedia.org/T245567 (10Dzahn) 05Resolvedβ†’03Open [00:30:22] 10Operations, 10Traffic, 10observability: some Prometheis not scraping the full set of targets - https://phabricator.wikimedia.org/T246860 (10colewhite) It appears a reload does resolve the issue, but it takes some time for Prometheus to fetch and store an update. I used `kill -HUP ` to reload. The lo... [00:31:39] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: 2020-03-01) rack/setup/install htmldumper1001.eqiad.wmnet. - https://phabricator.wikimedia.org/T245567 (10Dzahn) There are a lot of Icinga alerts about all the things on htmldumper1001 for some reason: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host... [00:32:18] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: 2020-03-01) rack/setup/install htmldumper1001.eqiad.wmnet. - https://phabricator.wikimedia.org/T245567 (10Dzahn) I suspected it just needs restart of nagios-nrpe-server but i also can't SSH to it. I get asked for password. [00:33:04] over 40 CRITs again [00:33:10] ACKNOWLEDGEMENT - Check size of conntrack table on htmldumper1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.206: Connection reset by peer daniel_zahn https://phabricator.wikimedia.org/T245567 https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [00:33:10] ACKNOWLEDGEMENT - Check systemd state on htmldumper1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.206: Connection reset by peer daniel_zahn https://phabricator.wikimedia.org/T245567 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:10] ACKNOWLEDGEMENT - Check the NTP synchronisation status of timesyncd on htmldumper1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.206: Connection reset by peer daniel_zahn https://phabricator.wikimedia.org/T245567 https://wikitech.wikimedia.org/wiki/NTP [00:33:10] ACKNOWLEDGEMENT - Check whether ferm is active by checking the default input chain on htmldumper1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.206: Connection reset by peer daniel_zahn https://phabricator.wikimedia.org/T245567 https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [00:33:10] ACKNOWLEDGEMENT - Check whether microcode mitigations for CPU vulnerabilities are applied on htmldumper1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.206: Connection reset by peer daniel_zahn https://phabricator.wikimedia.org/T245567 https://wikitech.wikimedia.org/wiki/Microcode [00:33:10] ACKNOWLEDGEMENT - DPKG on htmldumper1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.206: Connection reset by peer daniel_zahn https://phabricator.wikimedia.org/T245567 https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [00:33:10] ACKNOWLEDGEMENT - Disk space on htmldumper1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.206: Connection reset by peer daniel_zahn https://phabricator.wikimedia.org/T245567 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=htmldumper1001&var-datasource=eqiad+prometheus/ops [00:33:11] ACKNOWLEDGEMENT - IPMI Sensor Status on htmldumper1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.206: Connection reset by peer daniel_zahn https://phabricator.wikimedia.org/T245567 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [00:33:11] ACKNOWLEDGEMENT - Long running screen/tmux on htmldumper1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.206: Connection reset by peer daniel_zahn https://phabricator.wikimedia.org/T245567 https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [00:33:12] ACKNOWLEDGEMENT - MD RAID on htmldumper1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.206: Connection reset by peer daniel_zahn https://phabricator.wikimedia.org/T245567 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:33:12] ACKNOWLEDGEMENT - configured eth on htmldumper1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.206: Connection reset by peer daniel_zahn https://phabricator.wikimedia.org/T245567 https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [00:33:13] ACKNOWLEDGEMENT - dhclient process on htmldumper1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.206: Connection reset by peer daniel_zahn https://phabricator.wikimedia.org/T245567 https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [00:33:13] ACKNOWLEDGEMENT - puppet last run on htmldumper1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.206: Connection reset by peer daniel_zahn https://phabricator.wikimedia.org/T245567 https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:36:08] we won't see actual signal in the noise if we keep having dozens of unhandled things at all times. please downtime or ACK [00:38:02] 10Operations, 10Traffic, 10serviceops-radar: Increase in esams/eqsin cache_text network traffic since 2020-03-10 11:42 UTC - https://phabricator.wikimedia.org/T247583 (10CDanis) [00:38:09] 10Operations, 10Traffic, 10serviceops-radar: Increase in esams/eqsin cache_text network traffic since 2020-03-10 11:42 UTC - https://phabricator.wikimedia.org/T247583 (10CDanis) p:05Triageβ†’03Low [00:41:50] ACKNOWLEDGEMENT - Host kafka-jumbo1006 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T247561 [00:49:44] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2001.codfw.wmnet - https://phabricator.wikimedia.org/T246779 (10Papaul) [00:53:28] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2002.codfw.wmnet - https://phabricator.wikimedia.org/T246756 (10Papaul) [01:01:48] RECOVERY - mediawiki-installation DSH group on mw2158 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [01:03:45] 10Operations, 10Traffic, 10observability: some Prometheis not scraping the full set of targets - https://phabricator.wikimedia.org/T246860 (10CDanis) Nice catch in the logs! I'm guessing we need to increase some of the inotify tunables. Probably this one: ` βœ”οΈ cdanis@prometheus1003.eqiad.wmnet ~ πŸ•£πŸΊ cat /p... [01:03:54] RECOVERY - mediawiki-installation DSH group on mw2172 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [01:04:20] RECOVERY - mediawiki-installation DSH group on mw2168 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [01:13:29] (03CR) 10Jforrester: "This isn't the right way to change interwikis, sorry." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577740 (https://phabricator.wikimedia.org/T227053) (owner: 10Fomafix) [01:26:25] (03PS1) 10Papaul: DNS: Remove mgmt asset tag for lvs200[1-6] [dns] - 10https://gerrit.wikimedia.org/r/579451 [01:28:12] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt asset tag for lvs200[1-6] [dns] - 10https://gerrit.wikimedia.org/r/579451 (owner: 10Papaul) [01:29:40] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission lvs2001.codfw.wmnet - https://phabricator.wikimedia.org/T246779 (10Papaul) [01:29:57] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission lvs2001.codfw.wmnet - https://phabricator.wikimedia.org/T246779 (10Papaul) 05Openβ†’03Resolved Complete [01:31:05] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission lvs2002.codfw.wmnet - https://phabricator.wikimedia.org/T246756 (10Papaul) [01:31:30] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission lvs2002.codfw.wmnet - https://phabricator.wikimedia.org/T246756 (10Papaul) 05Openβ†’03Resolved Complete [01:31:40] RECOVERY - Long running screen/tmux on netbox1001 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [01:32:05] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission lvs2003.codfw.wmnet - https://phabricator.wikimedia.org/T246334 (10Papaul) [01:32:17] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission lvs2003.codfw.wmnet - https://phabricator.wikimedia.org/T246334 (10Papaul) 05Openβ†’03Resolved Complete [01:32:42] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission lvs2004.codfw.wmnet - https://phabricator.wikimedia.org/T246669 (10Papaul) [01:32:52] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission lvs2004.codfw.wmnet - https://phabricator.wikimedia.org/T246669 (10Papaul) 05Openβ†’03Resolved Complete [01:33:25] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission lvs2005.codfw.wmnet - https://phabricator.wikimedia.org/T246666 (10Papaul) [01:33:41] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission lvs2005.codfw.wmnet - https://phabricator.wikimedia.org/T246666 (10Papaul) 05Openβ†’03Resolved Complete [01:34:14] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission lvs2006.codfw.wmnet - https://phabricator.wikimedia.org/T246329 (10Papaul) [01:34:26] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission lvs2006.codfw.wmnet - https://phabricator.wikimedia.org/T246329 (10Papaul) 05Openβ†’03Resolved Complete [01:40:52] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [01:41:23] 10Operations, 10DC-Ops, 10decommission: decommission WMF6144 (old pay-lvs2001.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T247571 (10Papaul) ` [edit interfaces interface-range disabled] member "ge-[0-1]/0/13" { ... } + member "ge-[0-1]/0/2"; [edit interfaces interface-range vlan-payments]... [01:42:07] 10Operations, 10DC-Ops, 10decommission: decommission WMF6144 (old pay-lvs2001.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T247571 (10Papaul) [01:43:04] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [01:44:14] 10Operations, 10DC-Ops, 10decommission: decommission WMF6149 (old pay-lvs2002.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T247572 (10Papaul) ` edit interfaces interface-range disabled] member "ge-[0-1]/0/2" { ... } + member "ge-[0-1]/0/4"; [edit interfaces interface-range vlan-payments] -... [01:44:53] 10Operations, 10DC-Ops, 10decommission: decommission WMF6149 (old pay-lvs2002.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T247572 (10Papaul) [01:45:08] RECOVERY - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [01:52:21] (03PS1) 10Papaul: DNS Remove mgmt DNS for frpig2001 old and fix asset tag for new frpig2001 [dns] - 10https://gerrit.wikimedia.org/r/579452 [01:53:51] (03CR) 10Papaul: [C: 03+2] DNS Remove mgmt DNS for frpig2001 old and fix asset tag for new frpig2001 [dns] - 10https://gerrit.wikimedia.org/r/579452 (owner: 10Papaul) [01:55:25] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission WMF6147 (old frpig2001.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T246824 (10Papaul) 05Openβ†’03Resolved Complete [02:21:20] (03PS2) 10Herron: logstash: add new SSD hosts to ELK7 cluster with disktype attr "ssd" [puppet] - 10https://gerrit.wikimedia.org/r/579340 (https://phabricator.wikimedia.org/T247376) [02:30:30] (03PS3) 10Herron: logstash: add new SSD hosts to ELK7 cluster with disktype attr "ssd" [puppet] - 10https://gerrit.wikimedia.org/r/579340 (https://phabricator.wikimedia.org/T247376) [02:35:15] (03PS4) 10Herron: logstash: add new SSD hosts to ELK7 cluster with disktype attr "ssd" [puppet] - 10https://gerrit.wikimedia.org/r/579340 (https://phabricator.wikimedia.org/T247376) [02:38:41] (03PS1) 10CDanis: prom@ops: increase inotify watches in core DCs [puppet] - 10https://gerrit.wikimedia.org/r/579453 (https://phabricator.wikimedia.org/T246860) [02:42:32] (03CR) 10CDanis: "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1003/21416/" [puppet] - 10https://gerrit.wikimedia.org/r/579453 (https://phabricator.wikimedia.org/T246860) (owner: 10CDanis) [02:54:40] PROBLEM - Freshness of OCSP Stapling files -ATS-TLS- on cp4024 is CRITICAL: CRITICAL: File /var/cache/ocsp/globalsign-2019-rsa-unified.ocsp is more than 259500 secs old! https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [03:12:08] (03PS5) 10Herron: logstash: add new SSD hosts to ELK7 cluster with disktype attr "ssd" [puppet] - 10https://gerrit.wikimedia.org/r/579340 (https://phabricator.wikimedia.org/T247376) [03:12:39] (03PS1) 10Holger Knust: changeprop: Add readiness and liveness check delay [deployment-charts] - 10https://gerrit.wikimedia.org/r/579458 (https://phabricator.wikimedia.org/T213193) [03:15:46] PROBLEM - Freshness of OCSP Stapling files -ATS-TLS- on cp2014 is CRITICAL: CRITICAL: File /var/cache/ocsp/globalsign-2019-rsa-unified.ocsp is more than 259500 secs old! https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [03:24:08] PROBLEM - Freshness of OCSP Stapling files -ATS-TLS- on cp2018 is CRITICAL: CRITICAL: File /var/cache/ocsp/globalsign-2019-rsa-unified.ocsp is more than 259500 secs old! https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [03:24:18] RECOVERY - Old JVM GC check - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [03:26:40] RECOVERY - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 69.15 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [03:29:59] (03PS6) 10Herron: logstash: add new SSD hosts to ELK7 cluster with disktype attr "ssd" [puppet] - 10https://gerrit.wikimedia.org/r/579340 (https://phabricator.wikimedia.org/T247376) [03:31:33] 10Operations, 10Traffic: tracking task: Globalsign OCSP unhappiness 2020-03-12 - https://phabricator.wikimedia.org/T247584 (10CDanis) [03:31:55] (03PS1) 10CDanis: switch esams & eqsin to lets-encrypt; globalsign OCSP unhappy [puppet] - 10https://gerrit.wikimedia.org/r/579459 (https://phabricator.wikimedia.org/T247584) [03:35:58] (03PS7) 10Herron: logstash: add new SSD hosts to ELK7 cluster with disktype attr "ssd" [puppet] - 10https://gerrit.wikimedia.org/r/579340 (https://phabricator.wikimedia.org/T247376) [03:39:29] 10Operations, 10Traffic, 10Patch-For-Review: tracking task: Globalsign OCSP unhappiness 2020-03-12 - https://phabricator.wikimedia.org/T247584 (10CDanis) Sample error log: ` Mar 12 05:42:01 cp3050 CRON[9853]: (root) CMD (/usr/local/sbin/update-ocsp-all 2>&1 | logger -t update-ocsp-all) [...] Mar 12 05:42:46... [03:45:40] (03PS1) 10Herron: assign codfw logstash ssd hosts role::insetup [puppet] - 10https://gerrit.wikimedia.org/r/579461 (https://phabricator.wikimedia.org/T234854) [03:49:00] (03CR) 10Herron: [C: 03+2] "PCC looks good https://puppet-compiler.wmflabs.org/compiler1003/21422/" [puppet] - 10https://gerrit.wikimedia.org/r/579461 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [03:51:36] PROBLEM - Freshness of OCSP Stapling files -ATS-TLS- on cp4028 is CRITICAL: CRITICAL: File /var/cache/ocsp/globalsign-2019-ecdsa-unified.ocsp is more than 259500 secs old! https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [03:56:12] (03PS1) 10KartikMistry: apertium-br-fr: Fix FTBFS with new apertium [debs/contenttranslation/apertium-br-fr] - 10https://gerrit.wikimedia.org/r/579463 [03:56:46] (03PS8) 10Herron: logstash: add new SSD hosts to ELK7 cluster with disktype attr "ssd" [puppet] - 10https://gerrit.wikimedia.org/r/579340 (https://phabricator.wikimedia.org/T247376) [03:58:38] (03CR) 10CDanis: "AFAICT, PCC LG: https://puppet-compiler.wmflabs.org/compiler1002/21420/" [puppet] - 10https://gerrit.wikimedia.org/r/579459 (https://phabricator.wikimedia.org/T247584) (owner: 10CDanis) [04:10:01] (03PS2) 10KartikMistry: apertium-br-fr: Fix FTBFS with new apertium [debs/contenttranslation/apertium-br-fr] - 10https://gerrit.wikimedia.org/r/579463 (https://phabricator.wikimedia.org/T247585) [04:15:23] (03CR) 10Herron: [C: 04-2] "PCC is looking good at this point https://puppet-compiler.wmflabs.org/compiler1001/21426/" [puppet] - 10https://gerrit.wikimedia.org/r/579340 (https://phabricator.wikimedia.org/T247376) (owner: 10Herron) [04:41:34] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [04:49:32] 10Operations, 10Traffic, 10Patch-For-Review: tracking task: Globalsign OCSP unhappiness 2020-03-12 - https://phabricator.wikimedia.org/T247584 (10CDanis) p:05Triageβ†’03High [04:51:47] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [06:03:47] 10Operations, 10Traffic, 10Patch-For-Review: tracking task: Globalsign OCSP unhappiness 2020-03-12 - https://phabricator.wikimedia.org/T247584 (10Vgutierrez) This seems to be triggered by the outage reported by globalsign in https://www.globalsign.com/en/status: `Updated 12 March 2020, 5:25 pm EDT We are st... [06:05:45] !log triggering OCSP response updates in esams - T247584 [06:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:52] T247584: tracking task: Globalsign OCSP unhappiness 2020-03-12 - https://phabricator.wikimedia.org/T247584 [06:12:45] !log triggering OCSP response updates in eqsin - T247584 [06:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:50] T247584: tracking task: Globalsign OCSP unhappiness 2020-03-12 - https://phabricator.wikimedia.org/T247584 [06:16:17] !log triggering OCSP response updates in eqiad,codfw and ulsfo - T247584 [06:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:07] RECOVERY - Freshness of OCSP Stapling files -ATS-TLS- on cp4024 is OK: OK https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [06:17:31] RECOVERY - Freshness of OCSP Stapling files -ATS-TLS- on cp4028 is OK: OK https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [06:18:23] RECOVERY - Freshness of OCSP Stapling files -ATS-TLS- on cp2014 is OK: OK https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [06:19:27] RECOVERY - Freshness of OCSP Stapling files -ATS-TLS- on cp2018 is OK: OK https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [06:20:41] 10Operations, 10Traffic, 10Patch-For-Review: tracking task: Globalsign OCSP unhappiness 2020-03-12 - https://phabricator.wikimedia.org/T247584 (10Vgutierrez) 05Openβ†’03Resolved a:03Vgutierrez [06:21:14] (03Abandoned) 10Vgutierrez: switch esams & eqsin to lets-encrypt; globalsign OCSP unhappy [puppet] - 10https://gerrit.wikimedia.org/r/579459 (https://phabricator.wikimedia.org/T247584) (owner: 10CDanis) [06:25:13] (03CR) 10Guozr.im: "> Patch Set 1:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [06:55:54] (03CR) 10Jcrespo: "> Patch Set 1:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [07:13:42] 10Operations, 10Analytics, 10DC-Ops, 10netops: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10elukey) https://librenms.wikimedia.org/graphs/to=1584082800/id=12085 https://librenms.wikimedia.org/device/device=149/tab=port/port=12086/ stat1005 and kafka-jumbo1006 are in the sa... [07:49:30] 10Operations, 10Analytics, 10DC-Ops, 10netops: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10elukey) I did some tests and the two hosts are definitely related. I logged as root on both via mgmt console and turned off their interfaces, and the stat1005's broadcast traffic wen... [07:57:39] (03PS1) 10KartikMistry: apertium-cat-ita: Fix FTBFS and update to 0.2.1 release [debs/contenttranslation/apertium-ca-it] - 10https://gerrit.wikimedia.org/r/579509 (https://phabricator.wikimedia.org/T247585) [08:01:39] (03CR) 10jerkins-bot: [V: 04-1] apertium-cat-ita: Fix FTBFS and update to 0.2.1 release [debs/contenttranslation/apertium-ca-it] - 10https://gerrit.wikimedia.org/r/579509 (https://phabricator.wikimedia.org/T247585) (owner: 10KartikMistry) [08:05:07] (03PS2) 10KartikMistry: apertium-cat-ita: Fix FTBFS and update to 0.2.1 release [debs/contenttranslation/apertium-ca-it] - 10https://gerrit.wikimedia.org/r/579509 (https://phabricator.wikimedia.org/T247585) [08:24:08] (03PS2) 10Muehlenhoff: Remove obsolete kafka[12]00[1-3] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/579287 [08:28:52] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete kafka[12]00[1-3] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/579287 (owner: 10Muehlenhoff) [08:36:23] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:36:37] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:37:07] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:37:58] a link eqiad <-> codfw seems to have gone down [08:38:01] 10Operations, 10Analytics, 10DC-Ops, 10netops: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10elukey) The other host in D1, the rack of stat1005 and jumbo1006 is kafka-jumbo1008, one of the new ones: https://netbox.wikimedia.org/dcim/devices/2510/ [08:38:08] s/link/transit/ [08:39:49] (03PS1) 10Muehlenhoff: Set kafka-main[12]00[4-5]-main as role::insetup [puppet] - 10https://gerrit.wikimedia.org/r/579512 [08:40:58] (03PS3) 10Ema: atskafka: add puppet module [puppet] - 10https://gerrit.wikimedia.org/r/579247 (https://phabricator.wikimedia.org/T247497) [08:41:00] (03PS1) 10Ema: cache: add atskafka webrequest test instance [puppet] - 10https://gerrit.wikimedia.org/r/579513 (https://phabricator.wikimedia.org/T247497) [08:41:02] (03PS1) 10Ema: cache: test atskafka webrequest on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/579514 (https://phabricator.wikimedia.org/T247497) [08:41:29] jynus: seems to be the Telia link, from what I can see BFD reports the session down only on the eqiad side [08:41:49] (03Abandoned) 10Ema: Define a test pipeline with Blubber [software/atskafka] - 10https://gerrit.wikimedia.org/r/579283 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [08:41:50] strange, it said maintenance was earlier in the day [08:42:34] Start Date and Time: 2020-Mar-13 06:00 UTC [08:42:35] End Date and Time: 2020-Mar-13 10:00 UTC [08:42:48] for the link that shows problems [08:45:18] then whoever added it to maintenance did it wrong [08:45:20] :-D [08:46:18] it was only added 6-8 [08:46:37] no worries, then [08:46:46] should come up soon [08:50:50] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/579269 (https://phabricator.wikimedia.org/T247509) (owner: 10Jbond) [08:51:04] _joe_: good morning ;) I could use gcc / libc-dev / libssl-dev to be added to the python-build containers. That is to build pycrypto, a requirement of Zuul ( https://gerrit.wikimedia.org/r/#/c/579290/ ) ;) [08:51:17] _joe_: or maybe there is another way [08:51:52] (03PS3) 10Elukey: admin: simplify and document some analytics posix groups [puppet] - 10https://gerrit.wikimedia.org/r/579228 (https://phabricator.wikimedia.org/T246578) [08:52:02] <_joe_> hashar: yes, add then in a build container [08:53:13] _joe_: do you mean an intermediate container that is based on the python-build one? If so do you have an example? ;) [08:53:32] <_joe_> hashar: multi stage builds [08:54:10] (03CR) 10Elukey: "> JFTR; in modules/openldap/files/cross-validate-accounts.py we have" [puppet] - 10https://gerrit.wikimedia.org/r/579228 (https://phabricator.wikimedia.org/T246578) (owner: 10Elukey) [08:55:01] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:56:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/579373 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [08:56:57] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:56:59] _joe_: eeeek. We went copy pasting what has been done for debmonitor/deploy , so merely reused python-build [08:57:09] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:57:19] now I see operations/software/homer has some lib dependencies. I will follow that pattern [08:57:19] ;) [08:58:02] <_joe_> hashar: oh I see [08:58:12] <_joe_> sorry I misunderstood your use-case [08:58:33] that is to convert zuul deployment to be scap v2 based. It depends on paramikio < pycrypto [08:59:21] then homer/deploy does use an intermediate docker build to inject the missing libs / gcc etc https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/homer/deploy/+/master/Dockerfile.build [08:59:44] this way we can easily add whatever package we need without having to get the python-build containers to be rebuild [09:02:01] (03PS4) 10Ema: atskafka: add puppet module [puppet] - 10https://gerrit.wikimedia.org/r/579247 (https://phabricator.wikimedia.org/T247497) [09:02:03] (03PS2) 10Ema: cache: add atskafka webrequest test instance [puppet] - 10https://gerrit.wikimedia.org/r/579513 (https://phabricator.wikimedia.org/T247497) [09:02:05] (03PS2) 10Ema: cache: test atskafka webrequest on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/579514 (https://phabricator.wikimedia.org/T247497) [09:03:18] (03CR) 10jerkins-bot: [V: 04-1] cache: add atskafka webrequest test instance [puppet] - 10https://gerrit.wikimedia.org/r/579513 (https://phabricator.wikimedia.org/T247497) (owner: 10Ema) [09:03:21] _joe_: I would assume it gives more freedom to have the dependencies defined in the software deploy repository rather in the python3-build-* containers. I am going to follow the pattern used for homer [09:04:04] (03CR) 10Ema: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/579513 (https://phabricator.wikimedia.org/T247497) (owner: 10Ema) [09:06:39] (03CR) 10Muehlenhoff: OpenStack/Queens/Stretch, avoid upgrading systemd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579289 (https://phabricator.wikimedia.org/T247013) (owner: 10Andrew Bogott) [09:09:38] (03PS5) 10Ema: atskafka: add puppet module [puppet] - 10https://gerrit.wikimedia.org/r/579247 (https://phabricator.wikimedia.org/T247497) [09:09:40] (03PS3) 10Ema: cache: add atskafka webrequest test instance [puppet] - 10https://gerrit.wikimedia.org/r/579513 (https://phabricator.wikimedia.org/T247497) [09:09:42] (03PS3) 10Ema: cache: test atskafka webrequest on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/579514 (https://phabricator.wikimedia.org/T247497) [09:12:15] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10MoritzMuehlenhoff) [09:14:16] (03CR) 10Ema: "pcc looks sane: https://puppet-compiler.wmflabs.org/compiler1002/21430/" [puppet] - 10https://gerrit.wikimedia.org/r/579514 (https://phabricator.wikimedia.org/T247497) (owner: 10Ema) [09:14:35] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10MoritzMuehlenhoff) [09:17:56] !log installing perl updates from Stretch point release [09:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:32] 10Operations: apt config on planet1001 would install systemd from backports - https://phabricator.wikimedia.org/T247592 (10MoritzMuehlenhoff) [09:33:32] (03CR) 10Alexandros Kosiaris: [C: 03+1] services_proxy: use http endpoint for eventgate-analytics [puppet] - 10https://gerrit.wikimedia.org/r/579345 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto) [09:40:52] (03CR) 10Giuseppe Lavagetto: [C: 03+2] services_proxy: use http endpoint for eventgate-analytics [puppet] - 10https://gerrit.wikimedia.org/r/579345 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto) [09:55:14] <_joe_> !log running puppet across appservers to switch to http for eventgate-analytics T247484 [09:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:20] T247484: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 [10:09:06] !log upload trafficserver 8.0.6-1wm3 to apt.wm.o (buster) - T245616 [10:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:12] T245616: Provide a simple and automated SSL Ticket key generation system for ATS - https://phabricator.wikimedia.org/T245616 [10:17:12] (03CR) 10QEDK: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/578609 (https://phabricator.wikimedia.org/T201491) (owner: 10QEDK) [10:24:02] (03CR) 10Elukey: [C: 03+1] cache: add atskafka webrequest test instance [puppet] - 10https://gerrit.wikimedia.org/r/579513 (https://phabricator.wikimedia.org/T247497) (owner: 10Ema) [10:24:04] (03CR) 10Elukey: [C: 03+1] cache: test atskafka webrequest on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/579514 (https://phabricator.wikimedia.org/T247497) (owner: 10Ema) [10:26:44] !log installing python-werkzeug security updates [10:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:58] (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron: l3_agent: relocate depedency on Package resource [puppet] - 10https://gerrit.wikimedia.org/r/579526 [10:32:52] 10Operations: Integrate Stretch 9.12 point update - https://phabricator.wikimedia.org/T244695 (10MoritzMuehlenhoff) [10:34:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: neutron: l3_agent: relocate depedency on Package resource [puppet] - 10https://gerrit.wikimedia.org/r/579526 (owner: 10Arturo Borrero Gonzalez) [10:36:47] (03CR) 10Gehel: "This looks like a NOOP from PCC: https://puppet-compiler.wmflabs.org/compiler1002/21431/" [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [10:51:18] 10Operations, 10Analytics, 10User-Elukey: notebook1003:/srv/ 2% disk space left - https://phabricator.wikimedia.org/T224682 (10elukey) 05Openβ†’03Resolved We added jupyterhub to stat1004 and stat1006 and we'll move people with big homes, it should help long term. [11:08:41] 10Operations, 10Analytics, 10DC-Ops, 10netops: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10elukey) I checked the last changes happened yesterday on the switch via: ` elukey@asw2-d-eqiad> show system rollback compare 3 0 [edit interfaces interface-range vlan-private1-d-eqi... [11:20:58] (03CR) 10Jbond: "see comment re: lookup_options" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579340 (https://phabricator.wikimedia.org/T247376) (owner: 10Herron) [11:21:53] (03CR) 10Jbond: [C: 03+2] puppet agent: add parameter to change certificate_revocation behaviour [puppet] - 10https://gerrit.wikimedia.org/r/579268 (https://phabricator.wikimedia.org/T247509) (owner: 10Jbond) [11:22:03] (03CR) 10Jbond: [C: 03+2] puppetmaster: add parameter to control ssl_verify_depth [puppet] - 10https://gerrit.wikimedia.org/r/579269 (https://phabricator.wikimedia.org/T247509) (owner: 10Jbond) [11:26:21] (03CR) 10Jbond: [C: 03+1] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/579228 (https://phabricator.wikimedia.org/T246578) (owner: 10Elukey) [11:27:49] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10jbond) [11:30:32] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/579373 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [11:51:26] (03PS1) 10Giuseppe Lavagetto: Switch eventgate-analytics back to direct connection in eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579531 (https://phabricator.wikimedia.org/T247484) [11:58:05] (03CR) 10Alexandros Kosiaris: [C: 04-1] Switch eventgate-analytics back to direct connection in eqiad (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579531 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto) [12:01:33] (03PS1) 10Muehlenhoff: sre.cassandra.roll-restart: Add more aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/579533 [12:01:48] (03PS2) 10Arturo Borrero Gonzalez: openstack: l3_agent: introduce dmz_cidr-only l3 agent custom hack [puppet] - 10https://gerrit.wikimedia.org/r/579259 (https://phabricator.wikimedia.org/T247505) [12:02:57] (03Abandoned) 10Hashar: python-build: add gcc etc + libssl-dev [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/579290 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar) [12:03:15] 10Operations, 10Analytics, 10DC-Ops, 10netops: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10akosiaris) I 've had a look as well. I 've checked that the mac address of kafka-jumbo1006 is indeed the one the switch learns and indeed that's true. I 've bounced the port as well... [12:06:10] (03CR) 10jerkins-bot: [V: 04-1] openstack: l3_agent: introduce dmz_cidr-only l3 agent custom hack [puppet] - 10https://gerrit.wikimedia.org/r/579259 (https://phabricator.wikimedia.org/T247505) (owner: 10Arturo Borrero Gonzalez) [12:07:49] (03PS3) 10Arturo Borrero Gonzalez: openstack: l3_agent: introduce dmz_cidr-only l3 agent custom hack [puppet] - 10https://gerrit.wikimedia.org/r/579259 (https://phabricator.wikimedia.org/T247505) [12:09:50] (03PS4) 10Arturo Borrero Gonzalez: openstack: l3_agent: introduce dmz_cidr-only l3 agent custom hack [puppet] - 10https://gerrit.wikimedia.org/r/579259 (https://phabricator.wikimedia.org/T247505) [12:09:54] (03PS2) 10Alexandros Kosiaris: Switch eventgate-analytics back to direct connection in eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579531 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto) [12:11:43] (03CR) 10Alexandros Kosiaris: [C: 03+2] Switch eventgate-analytics back to direct connection in eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579531 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto) [12:12:46] (03Merged) 10jenkins-bot: Switch eventgate-analytics back to direct connection in eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579531 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto) [12:13:51] (03CR) 10Arturo Borrero Gonzalez: "PCC https://puppet-compiler.wmflabs.org/compiler1002/21433/" [puppet] - 10https://gerrit.wikimedia.org/r/579259 (https://phabricator.wikimedia.org/T247505) (owner: 10Arturo Borrero Gonzalez) [12:15:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: l3_agent: introduce dmz_cidr-only l3 agent custom hack [puppet] - 10https://gerrit.wikimedia.org/r/579259 (https://phabricator.wikimedia.org/T247505) (owner: 10Arturo Borrero Gonzalez) [12:15:58] (03CR) 10Alexandros Kosiaris: [C: 03+2] cpjobqueue: Add jobrunner_host & videoscaler_host to deployment vars [puppet] - 10https://gerrit.wikimedia.org/r/577677 (https://phabricator.wikimedia.org/T246371) (owner: 10Clarakosi) [12:16:00] !log akosiaris@deploy1001 Synchronized wmf-config/ProductionServices.php: (no justification provided) (duration: 01m 16s) [12:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:19] (03CR) 10Ssingh: [C: 03+2] First release of the cescout project [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/576070 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [12:16:39] (03CR) 10Ssingh: [V: 03+2 C: 03+2] First release of the cescout project [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/576070 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [12:18:59] (03CR) 10Hnowlan: [C: 03+1] sre.cassandra.roll-restart: Add more aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/579533 (owner: 10Muehlenhoff) [12:21:45] 10Operations, 10Traffic, 10serviceops-radar: Increase in esams/eqsin cache_text network traffic since 2020-03-10 11:42 UTC - https://phabricator.wikimedia.org/T247583 (10jcrespo) I've made a sanity check, in addition to the ones cdanis did, and verified that indeed the number of http requests to those dcs do... [12:25:51] (03CR) 10Muehlenhoff: [C: 03+2] sre.cassandra.roll-restart: Add more aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/579533 (owner: 10Muehlenhoff) [12:42:03] (03CR) 10Muehlenhoff: [C: 04-1] admin: simplify and document some analytics posix groups (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/579228 (https://phabricator.wikimedia.org/T246578) (owner: 10Elukey) [12:48:31] !log Password reset for SUL User:FuduBot (T247601) [12:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:36] T247601: Password reset for User:FuduBot - https://phabricator.wikimedia.org/T247601 [12:50:42] 10Operations, 10Wikimedia-Mailing-lists: Email to WikimediaUA mailing list from base-w[at]yandex.ru does not get delivered - https://phabricator.wikimedia.org/T247603 (10Base) [12:52:04] 10Operations, 10Wikimedia-Mailing-lists: Email to WikimediaUA mailing list from base-w[at]yandex.ru does not get delivered - https://phabricator.wikimedia.org/T247603 (10Base) If it helps, here are the headers of one of the emails that wasn't delivered: ` Received: from mxback17g.mail.yandex.net (localhost [12... [13:11:53] 10Operations: Deploy the cescout package (censorship monitoring) - https://phabricator.wikimedia.org/T247273 (10ssingh) [13:24:35] 10Operations, 10Traffic, 10Core Platform Team (Icebox), 10Core Platform Team Workboards (Clinic Duty Team): Have Varnish set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976 (10AMooney) [13:32:27] 10Operations, 10Traffic, 10Core Platform Team (Icebox): Have Varnish set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976 (10CCicalese_WMF) [13:33:49] 10Operations, 10Graphoid, 10serviceops, 10Core Platform Team (Icebox): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10AMooney) [13:34:08] 10Operations, 10Release Pipeline, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10AMooney) [13:43:53] (03CR) 10Herron: [C: 04-2] logstash: add new SSD hosts to ELK7 cluster with disktype attr "ssd" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579340 (https://phabricator.wikimedia.org/T247376) (owner: 10Herron) [13:45:20] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10MoritzMuehlenhoff) [13:54:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need by: ASAP) rack/setup/install fran1001 - https://phabricator.wikimedia.org/T245554 (10Jgreen) [13:56:01] (03CR) 10Elukey: admin: simplify and document some analytics posix groups (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/579228 (https://phabricator.wikimedia.org/T246578) (owner: 10Elukey) [13:57:36] (03PS4) 10Elukey: admin: simplify and document some analytics posix groups [puppet] - 10https://gerrit.wikimedia.org/r/579228 (https://phabricator.wikimedia.org/T246578) [13:58:10] (03CR) 10Herron: [C: 03+1] Set kafka-main[12]00[4-5]-main as role::insetup [puppet] - 10https://gerrit.wikimedia.org/r/579512 (owner: 10Muehlenhoff) [13:58:48] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM now" [puppet] - 10https://gerrit.wikimedia.org/r/579228 (https://phabricator.wikimedia.org/T246578) (owner: 10Elukey) [14:07:00] (03PS2) 10Andrew Bogott: OpenStack/Queens/Stretch, avoid upgrading systemd [puppet] - 10https://gerrit.wikimedia.org/r/579289 (https://phabricator.wikimedia.org/T247013) [14:11:36] 10Operations: persistent cronspam from Cron Daemon - https://phabricator.wikimedia.org/T247608 (10Jgreen) [14:17:38] (03CR) 10Herron: "Thanks for this!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [14:18:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need by: ASAP) rack/setup/install fran1001 - https://phabricator.wikimedia.org/T245554 (10Jgreen) [x] fixed BIOS serial port setting to redirect console output after boot [14:23:02] (03PS1) 10Elukey: role::analytics_cluster::hadoop::worker: use G1GC for Yarn [puppet] - 10https://gerrit.wikimedia.org/r/579573 [14:25:25] (03PS2) 10Elukey: role::analytics_cluster::hadoop::worker: use G1GC for Yarn [puppet] - 10https://gerrit.wikimedia.org/r/579573 [14:26:59] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::hadoop::worker: use G1GC for Yarn [puppet] - 10https://gerrit.wikimedia.org/r/579573 (owner: 10Elukey) [14:29:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need by: ASAP) rack/setup/install fran1001 - https://phabricator.wikimedia.org/T245554 (10Jgreen) @Cmjohnson DHCP request appear to be in the incorrect vlan (frack-administration-eqiad), can you please confirm the port settings to make sure both... [14:36:02] (03CR) 10Herron: [C: 03+1] "This was solved with the elk7 ES template. Should we abandon? Or go for the belt-and-suspenders approach of preventing the type conflict " [puppet] - 10https://gerrit.wikimedia.org/r/576908 (https://phabricator.wikimedia.org/T239090) (owner: 10Cwhite) [14:36:58] (03CR) 10Herron: [C: 03+1] "Same comment here -- solved with the elk7 ES template. Should we abandon? Or go for the belt-and-suspenders approach of preventing the ty" [puppet] - 10https://gerrit.wikimedia.org/r/576910 (https://phabricator.wikimedia.org/T239458) (owner: 10Cwhite) [14:44:45] 10Operations, 10Traffic: tracking task: Globalsign OCSP unhappiness 2020-03-12 - https://phabricator.wikimedia.org/T247584 (10Vgutierrez) for future reference, OCSP response update can be triggered like this: ` sudo -i cumin -b1 'A:cp-eqiad' "/usr/local/sbin/update-ocsp-all 2>&1 | logger -t update-ocsp-all" ` [14:49:26] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack/Queens/Stretch, avoid upgrading systemd [puppet] - 10https://gerrit.wikimedia.org/r/579289 (https://phabricator.wikimedia.org/T247013) (owner: 10Andrew Bogott) [14:51:43] 10Operations, 10Traffic: tracking task: Globalsign OCSP unhappiness 2020-03-12 - https://phabricator.wikimedia.org/T247584 (10CDanis) Thanks Valentin! I also made some edits on wikitech: https://wikitech.wikimedia.org/w/index.php?title=HTTPS%2FUnified_Certificates&type=revision&diff=1860120&oldid=1773414 [14:52:30] (03PS1) 10Jgreen: restore A records for frpig2001.frack.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/579576 [14:55:38] (03CR) 10Jgreen: [C: 03+2] restore A records for frpig2001.frack.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/579576 (owner: 10Jgreen) [14:57:24] !log T247586 βœ”οΈ cdanis@grafana1002.eqiad.wmnet ~ πŸ•₯β˜• sudo systemctl restart apache2.service [14:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:30] T247586: WebPageTest and WebPageReplay data unreachable in Grafana - https://phabricator.wikimedia.org/T247586 [14:59:11] (03PS9) 10Herron: logstash: add new SSD hosts to ELK7 cluster with disktype attr "ssd" [puppet] - 10https://gerrit.wikimedia.org/r/579340 (https://phabricator.wikimedia.org/T247376) [15:01:16] (03PS1) 10Jbond: pick_nodes: add ability to pick nodes based on a puppet class [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/579579 (https://phabricator.wikimedia.org/T245288) [15:07:00] (03PS1) 10Giuseppe Lavagetto: Revert "services_proxy: use http endpoint for eventgate-analytics" [puppet] - 10https://gerrit.wikimedia.org/r/579582 [15:10:17] (03PS10) 10Herron: logstash: add new SSD hosts to ELK7 cluster with disktype attr "ssd" [puppet] - 10https://gerrit.wikimedia.org/r/579340 (https://phabricator.wikimedia.org/T247376) [15:12:36] jbond42: I'll give that a closer look today but thank you for doing it! huge improvement [15:14:41] PROBLEM - Unmerged changes on repository puppet on labtestpuppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:15:05] rlazarus: awesome thanks, always hard to get reviews on that repo so much appreciated [15:16:54] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "services_proxy: use http endpoint for eventgate-analytics" [puppet] - 10https://gerrit.wikimedia.org/r/579582 (owner: 10Giuseppe Lavagetto) [15:17:28] ^^ those unmerged changes is a known issue? [15:18:13] vgutierrez: labtestpuppetmaster2001 is probably arturo or jeh working on Neutron stuff? [15:18:29] (03PS2) 10Muehlenhoff: Set kafka-main[12]00[4-5]-main as role::insetup [puppet] - 10https://gerrit.wikimedia.org/r/579512 [15:18:34] please downtime the affected services then :) [15:19:03] I'm not doing any operation on that server [15:19:43] RECOVERY - Unmerged changes on repository puppet on labtestpuppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:20:05] that server is currently unable to git fetch from gerrit.wm.o [15:21:22] <_joe_> vgutierrez: it was a change by andrewbogott [15:21:26] <_joe_> I just merged it [15:21:52] ack [15:22:01] _joe_: thanks! But is that the same thing that vgutierrez was asking about? [15:23:01] <_joe_> andrewbogott: I think that's what triggered that alert [15:23:11] ok [15:23:41] (03CR) 10Muehlenhoff: [C: 03+2] Set kafka-main[12]00[4-5]-main as role::insetup [puppet] - 10https://gerrit.wikimedia.org/r/579512 (owner: 10Muehlenhoff) [15:27:45] (03PS11) 10Herron: logstash: add new SSD hosts to ELK7 cluster with disktype attr "ssd" [puppet] - 10https://gerrit.wikimedia.org/r/579340 (https://phabricator.wikimedia.org/T247376) [15:28:35] <_joe_> !log switch envoy logging to debug on mw2231 [15:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:18] (03CR) 10Alexandros Kosiaris: "> @Alex, I was wondering if you had any feedback on Petr's comment. Are you OK" [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust) [15:32:20] (03PS1) 10Volans: gen-zones: transliterate commit message to ASCII [dns] - 10https://gerrit.wikimedia.org/r/579586 [15:34:00] 10Operations, 10Puppet, 10User-Joe: Disable hiera autolookups - https://phabricator.wikimedia.org/T181971 (10jbond) [[ https://tickets.puppetlabs.com/browse/PUP-6576| data_binding_terminus ]] is now deprecated. > > > data_binding_terminus > > When automatic class parameter lookup was still young, we incl... [15:35:23] (03CR) 10Volans: "This is the full diff of the generated zonefiles between current master and this patch:" [dns] - 10https://gerrit.wikimedia.org/r/579586 (owner: 10Volans) [15:35:57] (03CR) 10Herron: [C: 04-2] "https://puppet-compiler.wmflabs.org/compiler1003/21437/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579340 (https://phabricator.wikimedia.org/T247376) (owner: 10Herron) [15:39:28] 10Puppet, 10User-jbond: hiera_lookup: Allow query against checkout of labs/private in addition to checkout of operations/puppet - https://phabricator.wikimedia.org/T216647 (10jbond) [15:39:54] 10Puppet, 10User-jbond: hiera_lookup: Allow query against checkout of labs/private in addition to checkout of operations/puppet - https://phabricator.wikimedia.org/T216647 (10jbond) Just came across this by accident, what exactly do you mean. can you provide some examples? [15:40:51] (03CR) 10CRusnov: [C: 03+1] "Hah woops. This seems semi-reasonable however we should make sure to fix it properly if possible in the future" [dns] - 10https://gerrit.wikimedia.org/r/579586 (owner: 10Volans) [15:41:18] 10Puppet, 10SRE-tools, 10Python3-Porting, 10User-jbond: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10jbond) [15:41:39] (03CR) 10Volans: pick_nodes: add ability to pick nodes based on a puppet class (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/579579 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond) [15:44:08] 10Puppet, 10User-jbond: hiera_lookup: Allow query against checkout of labs/private in addition to checkout of operations/puppet - https://phabricator.wikimedia.org/T216647 (10crusnov) Ah so the use case is trying to figure out the visibility-cone of hiera keys from the utils/hiera_lookup. We always place false... [15:44:25] 10Puppet, 10User-jbond: hiera_lookup: Allow query against checkout of labs/private in addition to checkout of operations/puppet - https://phabricator.wikimedia.org/T216647 (10crusnov) p:05Triageβ†’03Lowest [15:47:40] (03CR) 10Volans: "> Patch Set 1: Code-Review+1" [dns] - 10https://gerrit.wikimedia.org/r/579586 (owner: 10Volans) [15:57:50] (03PS1) 10Hashar: zuul: provision the scap repository [puppet] - 10https://gerrit.wikimedia.org/r/579587 (https://phabricator.wikimedia.org/T215458) [15:58:12] (03CR) 10jerkins-bot: [V: 04-1] zuul: provision the scap repository [puppet] - 10https://gerrit.wikimedia.org/r/579587 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar) [15:59:43] (03CR) 10Hashar: "CI fails due to the other change." [puppet] - 10https://gerrit.wikimedia.org/r/579587 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar) [16:01:03] (03CR) 10Mstyles: "> Patch Set 3:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [16:01:29] (03CR) 10Ppchelko: "> So, to recap, it seems like we are forced to go down the 2 chart route. That's fine, as long as we are all aware as to why and what the " [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust) [16:02:18] !log powercycle kafka-jumbo1006 after switch port changed - T247561 [16:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:24] T247561: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 [16:04:30] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/579587 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar) [16:04:35] !log rebooting labstore1007 for first cycle of upgrades T224583 [16:04:38] RECOVERY - Host kafka-jumbo1006 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [16:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:40] T224583: Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 [16:05:03] (03CR) 10Andrew Bogott: OpenStack/Queens/Stretch, avoid upgrading systemd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579289 (https://phabricator.wikimedia.org/T247013) (owner: 10Andrew Bogott) [16:06:23] ah! [16:06:31] hello kafka-jumbo! [16:08:04] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:08:37] 10Puppet, 10User-jbond: hiera_lookup: Allow query against checkout of labs/private in addition to checkout of operations/puppet - https://phabricator.wikimedia.org/T216647 (10jbond) ahh i see, i can have a think about that however its much better to use the puppet lookup command on the puppet masters (this wo... [16:08:41] (03PS1) 10Andrew Bogott: OpenStack/Queens/Stretch, move priority to 501 [puppet] - 10https://gerrit.wikimedia.org/r/579589 (https://phabricator.wikimedia.org/T247013) [16:09:52] (03CR) 10CRusnov: [C: 03+1] "> Patch Set 1:" [dns] - 10https://gerrit.wikimedia.org/r/579586 (owner: 10Volans) [16:10:01] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack/Queens/Stretch, move priority to 501 [puppet] - 10https://gerrit.wikimedia.org/r/579589 (https://phabricator.wikimedia.org/T247013) (owner: 10Andrew Bogott) [16:12:12] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:12:31] (03CR) 10Hnowlan: changeprop: Add readiness and liveness check delay (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/579458 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [16:13:11] (03CR) 10Jbond: pick_nodes: add ability to pick nodes based on a puppet class (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/579579 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond) [16:14:26] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:14:35] (03CR) 10Jbond: "> Patch Set 2:" (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/579579 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond) [16:15:54] !log herron@cumin1001 START - Cookbook sre.hosts.downtime [16:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:25] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:40] 10Operations, 10Traffic: Track WMF owned non-canonical domains - https://phabricator.wikimedia.org/T247618 (10Vgutierrez) [16:18:55] 10Operations, 10Traffic: Track WMF owned non-canonical domains - https://phabricator.wikimedia.org/T247618 (10Vgutierrez) p:05Triageβ†’03Medium [16:19:45] (03PS1) 10Jbond: data_admin: port to python3 [puppet] - 10https://gerrit.wikimedia.org/r/579591 (https://phabricator.wikimedia.org/T247364) [16:20:03] (03PS1) 10Ssingh: Add support for Travis CI and Codecov [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/579592 [16:20:23] (03PS1) 10CRusnov: Update Netbox to v2.7.10-wmf [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/579593 [16:20:28] (03CR) 10Volans: "> Patch Set 1:" [dns] - 10https://gerrit.wikimedia.org/r/579586 (owner: 10Volans) [16:21:10] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1006 is CRITICAL: 2.36e+08 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [16:21:57] 10Operations, 10Traffic: Track WMF owned non-canonical domains - https://phabricator.wikimedia.org/T247618 (10Vgutierrez) [16:24:11] (03CR) 10RLazarus: pick_nodes: add ability to pick nodes based on a puppet class (033 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/579579 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond) [16:26:31] (03CR) 10Muehlenhoff: [C: 03+1] "The patch is fine, but is this used at all? It doesn't seem to be referenced anywhere in puppet.git?" [puppet] - 10https://gerrit.wikimedia.org/r/579591 (https://phabricator.wikimedia.org/T247364) (owner: 10Jbond) [16:27:53] (03PS2) 10Krinkle: Drop the 'pp_stage0' and 'pp_stage1' dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579083 (owner: 10Jforrester) [16:28:16] (03CR) 10Krinkle: [C: 03+2] Drop the 'pp_stage0' and 'pp_stage1' dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579083 (owner: 10Jforrester) [16:28:21] (03PS2) 10Krinkle: Drop the 'top6-wikipedia' dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579084 (owner: 10Jforrester) [16:28:25] (03CR) 10Krinkle: [C: 03+2] Drop the 'top6-wikipedia' dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579084 (owner: 10Jforrester) [16:28:59] (03CR) 10CRusnov: [C: 03+1] "> Patch Set 1:" [dns] - 10https://gerrit.wikimedia.org/r/579586 (owner: 10Volans) [16:29:42] (03Merged) 10jenkins-bot: Drop the 'pp_stage0' and 'pp_stage1' dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579083 (owner: 10Jforrester) [16:29:57] 10Operations, 10SRE-tools, 10Traffic, 10Continuous-Integration-Config, and 5 others: Integrate automated DNS snippets into CI - https://phabricator.wikimedia.org/T243362 (10crusnov) [16:30:06] (03Merged) 10jenkins-bot: Drop the 'top6-wikipedia' dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579084 (owner: 10Jforrester) [16:30:44] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/579593 (owner: 10CRusnov) [16:31:59] (03PS4) 10Krinkle: Stop defining 'wikipedia-english', 'wikipedia-e-acute', 'wikipedia-cyrillic', 'wikipedia-devanagari' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575366 (owner: 10Jforrester) [16:32:09] (03CR) 10Krinkle: [C: 03+1] Stop defining 'wikipedia-english', 'wikipedia-e-acute', 'wikipedia-cyrillic', 'wikipedia-devanagari' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575366 (owner: 10Jforrester) [16:33:34] 10Operations, 10SRE-tools, 10Traffic, 10Continuous-Integration-Config, and 5 others: Integrate automated DNS snippets into CI - https://phabricator.wikimedia.org/T243362 (10crusnov) Summary of status: We have decided that modifying the CI image itself is unnecessary in light of incidental changes to deplo... [16:34:22] (03Abandoned) 10CRusnov: gen-zones.py: Add variable insertion [dns] - 10https://gerrit.wikimedia.org/r/568683 (https://phabricator.wikimedia.org/T243362) (owner: 10CRusnov) [16:35:51] (03CR) 10Jforrester: [C: 03+1] "Oh, yeah, this should be good to go now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575366 (owner: 10Jforrester) [16:35:55] (03PS4) 10Mstyles: kibana: refactor kibana role to kibana profile [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) [16:37:07] !log krinkle@deploy1001 Synchronized wmf-config/config/: If4d17082f, Iadba5b01b (duration: 01m 11s) [16:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:40] (03CR) 10Jforrester: Add wikidata.beta.wmflabs.org to beta csp (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578183 (owner: 10Brian Wolff) [16:40:04] (03CR) 10Jforrester: "This should probably be split into the IS change and the CS change to be more deploy-safe, but as long as you absolutely definitely sync I" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570396 (https://phabricator.wikimedia.org/T243106) (owner: 10Ppchelko) [16:40:08] (03PS3) 10Jbond: pick_nodes: add ability to pick nodes based on a puppet class [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/579579 (https://phabricator.wikimedia.org/T245288) [16:40:29] (03CR) 10Ssingh: [C: 03+2] "Merging without review for now as there is no change in functionality." [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/579592 (owner: 10Ssingh) [16:40:46] (03CR) 10Jbond: "thanks for the quick review" (033 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/579579 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond) [16:41:15] (03CR) 10Jforrester: "Anything I can do to help move this stack forward?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572928 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [16:42:06] (03PS1) 10Andrew Bogott: OpenStack/Queens/Stretch: fix pinning rule [puppet] - 10https://gerrit.wikimedia.org/r/579597 (https://phabricator.wikimedia.org/T247013) [16:43:38] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack/Queens/Stretch: fix pinning rule [puppet] - 10https://gerrit.wikimedia.org/r/579597 (https://phabricator.wikimedia.org/T247013) (owner: 10Andrew Bogott) [16:51:32] !log rebooting labstore1007 for stretch upgrade T224583 [16:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:38] T224583: Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 [16:52:20] (03CR) 10Krinkle: [C: 03+2] "Confirmed no matches outside the repo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575366 (owner: 10Jforrester) [16:52:41] Krinkle: Scattering of "wmf.23 i/s/SiteList.php:155 PHP Notice: Undefined index: gomwiki" etc. [16:53:38] (03Merged) 10jenkins-bot: Stop defining 'wikipedia-english', 'wikipedia-e-acute', 'wikipedia-cyrillic', 'wikipedia-devanagari' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575366 (owner: 10Jforrester) [16:54:50] (03PS2) 10Hnowlan: changeprop: Add readiness and liveness check delay [deployment-charts] - 10https://gerrit.wikimedia.org/r/579458 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [16:55:32] James_F: where? [16:56:03] https://logstash.wikimedia.org/goto/706a283d9aca84a50e39563a1b9f024f [16:56:03] got it [16:56:23] started 7AM this morning? [16:56:49] looks like a wikibase/SiteList issue, not sure. [16:56:56] Right, unrelated. OK. [16:57:08] I just saw it in trace and thought I should flag. [16:57:13] yeah right on [16:57:46] (03CR) 10Herron: [C: 03+1] kibana: refactor kibana role to kibana profile [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [16:58:49] !log krinkle@deploy1001 Synchronized wmf-config/config/: Ibe16d5f09 (duration: 01m 10s) [16:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:55] (03CR) 10Hnowlan: [C: 03+2] changeprop: Add readiness and liveness check delay [deployment-charts] - 10https://gerrit.wikimedia.org/r/579458 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [16:59:14] (03Merged) 10jenkins-bot: changeprop: Add readiness and liveness check delay [deployment-charts] - 10https://gerrit.wikimedia.org/r/579458 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [16:59:26] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 (10Bstorm) labstore1007 is serving NFS and web fine on stretch now. [16:59:43] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 (10Bstorm) Mind you, clients are still all pointed at 1006. Now to try buster. [17:00:13] !log krinkle@deploy1001 Synchronized dblists/: If4d17082f, Iadba5b01b, Ibe16d5f09 (duration: 01m 07s) [17:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:23] PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need by: ASAP) rack/setup/install fran1001 - https://phabricator.wikimedia.org/T245554 (10Cmjohnson) updated the vlan from administration to fundraising [17:03:59] RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:02] (03PS3) 10Dzahn: squid: remove obsolete hierarchy_stoplist config directive [puppet] - 10https://gerrit.wikimedia.org/r/579373 (https://phabricator.wikimedia.org/T224576) [17:08:00] (03CR) 10Dzahn: [C: 03+2] squid: remove obsolete hierarchy_stoplist config directive [puppet] - 10https://gerrit.wikimedia.org/r/579373 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [17:08:05] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker [17:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:28] ^ this should stop some cronspam you might have seen from squid [17:09:52] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [17:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:29] (03PS5) 10Dzahn: installserver: allow configuring squid as absent in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/579106 (https://phabricator.wikimedia.org/T224576) [17:15:10] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/21391/" [puppet] - 10https://gerrit.wikimedia.org/r/579106 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [17:17:02] (03CR) 10Dzahn: "this removes the service from the jessie servers. DNS CNAME has already been switched 2 days ago and logs are empty" [puppet] - 10https://gerrit.wikimedia.org/r/579106 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [17:20:14] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) [17:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:12] !log removed squid from install1002/install2002 (formerly webproxy.(eqiad|codfw).wmnet until 2 days ago, replaced by install1003/install2003) T224576 [17:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:17] T224576: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 [17:25:32] PROBLEM - Squid on install2002 is CRITICAL: connect to address 208.80.153.53 and port 8080: Connection refused https://wikitech.wikimedia.org/wiki/HTTP_proxy [17:35:14] (03PS1) 10Thcipriani: Gerrit: apache proxy not pooled [puppet] - 10https://gerrit.wikimedia.org/r/579601 (https://phabricator.wikimedia.org/T246763) [17:45:57] icinga-wm: ^ well.. yea.. i stopped that but puppet should also remove monitoring if i do [17:46:00] will fix that [17:46:39] (03PS1) 10Thcipriani: Integration Cluster: update gitcache nightly [puppet] - 10https://gerrit.wikimedia.org/r/579602 [17:51:30] PROBLEM - Squid on install1002 is CRITICAL: connect to address 208.80.154.22 and port 8080: Connection refused https://wikitech.wikimedia.org/wiki/HTTP_proxy [17:53:13] ^I am guessing that is outdated monitoring, right? [17:54:38] (03PS1) 10Hnowlan: changeprop: new release [deployment-charts] - 10https://gerrit.wikimedia.org/r/579604 (https://phabricator.wikimedia.org/T213193) [17:54:49] 10Operations, 10ops-codfw, 10Traffic: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10Papaul) [17:55:14] ACKNOWLEDGEMENT - Squid on install1002 is CRITICAL: connect to address 208.80.154.22 and port 8080: Connection refused daniel_zahn replaced by 1003 https://wikitech.wikimedia.org/wiki/HTTP_proxy [17:55:22] 10Operations, 10LDAP-Access-Requests: Allow LDAP access to superset dashboards for Moushira Elamrawy - https://phabricator.wikimedia.org/T242000 (10Moushira) Hi @akosiaris, so the new expiry date is Dec 31, 2020. Thanks! [17:55:26] jynus: it's true but it's a bug that it did not get removed together with the package and config [17:55:39] ok, not a worry then [17:55:59] jynus: will make a puppet fix to remove monitoring when service is set to absent in Hiera. and yes, nothing to worry [17:56:06] it is replaced by the same service on 1003 [17:56:17] yep, I would only worry if it was a production impact [17:56:26] (03CR) 10Holger Knust: [C: 03+2] changeprop: new release [deployment-charts] - 10https://gerrit.wikimedia.org/r/579604 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [17:56:38] not an issue if a followup is needed on a known change [17:56:43] it's not. but thanks for seeing these [17:56:45] (03Merged) 10jenkins-bot: changeprop: new release [deployment-charts] - 10https://gerrit.wikimedia.org/r/579604 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [17:57:26] ACKNOWLEDGEMENT - Squid on install2002 is CRITICAL: connect to address 208.80.153.53 and port 8080: Connection refused daniel_zahn replaced by 2003 https://wikitech.wikimedia.org/wiki/HTTP_proxy [17:57:41] i want to disable them to proof that nothing uses it anymore. on Monday or Tuesday we will switch the other services on install [17:57:54] (DHCP/TFTP/apt/..) [17:58:34] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [17:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:25] just one question, I see some other alerts related to kafka lag, do you know if there is some new servers or something? [18:00:49] this looks weird: https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-lag_datasource=eqiad%20prometheus%2Fops&var-mirror_name=main-eqiad_to_jumbo-eqiad&refresh=5m&from=1584100840625&to=1584122440625 [18:01:08] jynus: the other day kafka-jumo1006 suddenly lost networking [18:01:21] and these alerts should be caused by this i think [18:01:33] sure, not as worried about single servers [18:01:37] https://phabricator.wikimedia.org/T247561 [18:01:41] as the general aggregated metrics [18:02:03] ottomata would know more [18:02:03] jynus: I am working on them, it seems related to kafka-jumbo1006 that is now recovering [18:02:07] but I am not super sure [18:02:13] which seem weird in the last 2 hours [18:02:13] ah, that :) [18:02:23] there are 2 servers that went down, the other is stat1005 [18:02:28] it happened only seconds apart [18:02:41] yes yes I am still trying to bring stat1005 up [18:02:42] one is apparently fixed now (switch vlan config) but the other is not? afaik [18:02:48] yeah :( [18:02:57] ok then, as long as there is someone monitoring the general status, no issue [18:03:07] stat1005 is currently stuck booting for (2 of 2) A start job is running for…edia.org (1h 17min 35s / no limit) [18:03:13] didn't want to leave if nobody had noticed that [18:03:18] and I am not sure what the problem is yet [18:03:29] it would "only" impact analytics, and those can be lagged right? [18:03:36] ACKNOWLEDGEMENT - Kafka Broker Replica Max Lag on kafka-jumbo1006 is CRITICAL: 2.405e+08 ge 5e+06 daniel_zahn https://phabricator.wikimedia.org/T236327 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [18:03:46] this one or more? [18:04:31] so the only one that worried me was the general one, the one on icinga [18:04:49] beacuse the otehr are per-server stats, which can be down at times [18:04:53] for maintenance [18:05:38] the mirror maker alert is for analytics yes, mirroring from main to jumbo [18:06:04] what I mean is lag is ok, it is not like on the job queue, that bad things start to happen? [18:06:04] PROBLEM - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [18:08:10] See: https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-lag_datasource=eqiad%20prometheus%2Fops&var-mirror_name=main-eqiad_to_jumbo-eqiad&refresh=5m&from=1584100840625&to=1584122440625&fullscreen&panelId=5 [18:10:43] jynus: mirror maker is basically a tool to replicate topics from one kafka cluster to the other. In this case, "main" is the cluster of the job queue, and jumbo is analytics. We mirror some topics to jumbo in order to process/store them later [18:10:59] the lag in there is the mirror maker consumer on jumbo that is struggling a bit [18:11:28] ok, so I hope things improve over the weekend or when the host gets up again [18:11:50] some of those are already going down [18:11:57] (03CR) 10RLazarus: pick_nodes: add ability to pick nodes based on a puppet class (032 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/579579 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond) [18:12:17] I hope so as well, jumbo1006 is recovering nicely but it will take time since it was down for a lot of hours [18:12:31] (03CR) 10Hashar: "Good idea! However mediawiki/core has bunch of branches that would not be updated by using a simple 'git fetch'. I have step on that a f" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579602 (owner: 10Thcipriani) [18:12:48] (03CR) 10Hashar: "Forgot: I definitely like the idea." [puppet] - 10https://gerrit.wikimedia.org/r/579602 (owner: 10Thcipriani) [18:14:50] jouncebot: now [18:14:50] No deployments scheduled for the next 64 hour(s) and 15 minute(s) [18:15:04] I'm going to fiddle with mwdebug for a bit [18:15:10] mwdebug1002 [18:18:50] the other ongoing stuff is that varnish hit ration seems low since yestarday [18:18:58] *ratio [18:19:49] although not unusually low [18:22:02] PROBLEM - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 189.2 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [18:29:20] PROBLEM - Old JVM GC check - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 239.6 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [18:36:43] (03PS2) 10Guozr.im: CuminExecution: Capture Exception cumin.transports.WorkerError [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) [18:36:45] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [18:39:21] (03CR) 10Guozr.im: "> Patch Set 1:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [18:46:58] (03PS1) 10Andrew Bogott: Horizon policy.json for nova: disable resizing [puppet] - 10https://gerrit.wikimedia.org/r/579606 (https://phabricator.wikimedia.org/T247573) [18:47:40] (03CR) 10Volans: "Didn't we had something in the TODO to add to the scap config about sessions or similar?" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/579593 (owner: 10CRusnov) [18:48:22] (03CR) 10CRusnov: "> Patch Set 1:" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/579593 (owner: 10CRusnov) [18:51:22] !log test increase fs.inotify.max_user_watches on prometheus2004 [18:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:08] (03CR) 10Cwhite: [C: 03+2] "Tested manually on prometheus2004. The watcher errors are not present after tweaking this setting." [puppet] - 10https://gerrit.wikimedia.org/r/579453 (https://phabricator.wikimedia.org/T246860) (owner: 10CDanis) [18:55:36] shdubsh: \o/ [18:55:44] shdubsh: are you going to merge or should I? [18:55:58] cdanis: merging now [18:56:03] great, thanks [19:01:14] (03PS1) 10MarcoAurelio: Add .gitreview [software/ncmonitor] - 10https://gerrit.wikimedia.org/r/579608 (https://phabricator.wikimedia.org/T247619) [19:01:43] (03PS2) 10MarcoAurelio: Add .gitreview [software/ncmonitor] - 10https://gerrit.wikimedia.org/r/579608 (https://phabricator.wikimedia.org/T247619) [19:01:58] (03CR) 10MarcoAurelio: [V: 03+2 C: 03+2] Add .gitreview [software/ncmonitor] - 10https://gerrit.wikimedia.org/r/579608 (https://phabricator.wikimedia.org/T247619) (owner: 10MarcoAurelio) [19:04:58] (03PS2) 10Andrew Bogott: Horizon policy.json for nova: disable resizing [puppet] - 10https://gerrit.wikimedia.org/r/579606 (https://phabricator.wikimedia.org/T247573) [19:05:00] 10Operations, 10Traffic: Track WMF owned non-canonical domains - https://phabricator.wikimedia.org/T247618 (10MarcoAurelio) [19:05:37] (03CR) 10jerkins-bot: [V: 04-1] Horizon policy.json for nova: disable resizing [puppet] - 10https://gerrit.wikimedia.org/r/579606 (https://phabricator.wikimedia.org/T247573) (owner: 10Andrew Bogott) [19:11:26] 10Operations, 10Analytics, 10DC-Ops, 10netops: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10elukey) Chris moved the servers to different ports, and for kafka-jumbo1006 it helped, since it is now serving traffic. stat1005 is still suffering of the same issue though. [19:11:59] (03PS1) 10Dzahn: installserver: disable squid monitoring if service is absent [puppet] - 10https://gerrit.wikimedia.org/r/579610 (https://phabricator.wikimedia.org/T224576) [19:14:23] (03CR) 10Dzahn: [C: 03+2] installserver: disable squid monitoring if service is absent [puppet] - 10https://gerrit.wikimedia.org/r/579610 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [19:18:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need by: ASAP) rack/setup/install fran1001 - https://phabricator.wikimedia.org/T245554 (10Jgreen) a:05Jclark-ctrβ†’03Jgreen [19:19:21] vgutierrez ya tienes tu repo nueva [19:19:40] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: rack/setup/install fran1001 - https://phabricator.wikimedia.org/T245554 (10Jgreen) [19:19:51] ❀️ gracias! [19:20:40] las autoridades sanitarias advierten... :-) [19:22:19] 10Operations: persistent cronspam from Cron Daemon - https://phabricator.wikimedia.org/T247608 (10Dzahn) a:03Dzahn Probably happens only since install1002 is currently syncing to more than one server. Soon when install1002/2002 will be removed it will be back to just sync betw... [19:22:37] 10Operations: persistent cronspam from Cron Daemon - https://phabricator.wikimedia.org/T247608 (10Dzahn) p:05Triageβ†’03Medium [19:22:39] 10Operations, 10Traffic, 10observability: some Prometheis not scraping the full set of targets - https://phabricator.wikimedia.org/T246860 (10colewhite) a:03colewhite [19:26:30] (03PS1) 10Cwhite: profile: increase inotify instances in core DCs [puppet] - 10https://gerrit.wikimedia.org/r/579613 (https://phabricator.wikimedia.org/T246860) [19:28:39] cdanis: it doesnt like like watches was enough. on prometheus1003, several reloads didn't clear the errors. upping max_user_instances seems to have done the trick though [19:29:09] heh, nod [19:29:34] happy to +1 that puppet change [19:30:03] (03CR) 10CDanis: [C: 03+1] profile: increase inotify instances in core DCs [puppet] - 10https://gerrit.wikimedia.org/r/579613 (https://phabricator.wikimedia.org/T246860) (owner: 10Cwhite) [19:30:09] thx! [19:30:15] thank you! [19:30:37] (03CR) 10Cwhite: [C: 03+2] profile: increase inotify instances in core DCs [puppet] - 10https://gerrit.wikimedia.org/r/579613 (https://phabricator.wikimedia.org/T246860) (owner: 10Cwhite) [19:30:51] !log rebooting labstore1007 for upgrade to buster T224583 [19:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:57] T224583: Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 [19:31:28] the squid icinga checks that alerted earlier are now removed entirely (not just silenced..gone) [19:31:53] based on the service being absented or not [19:32:33] (03CR) 10Brian Wolff: Add wikidata.beta.wmflabs.org to beta csp (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578183 (owner: 10Brian Wolff) [19:34:14] 10Operations, 10netops, 10cloud-services-team (Kanban): New network request for CloudVPS CODFW instances transport - https://phabricator.wikimedia.org/T247633 (10JHedden) [19:35:26] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [19:38:43] 10Operations, 10Traffic, 10observability, 10Patch-For-Review: some Prometheis not scraping the full set of targets - https://phabricator.wikimedia.org/T246860 (10colewhite) [[ https://github.com/prometheus/prometheus/issues/3446 | Found a related issue. ]] We've upped max_user_instances and max_user_watch... [19:41:38] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 (10Bstorm) [19:55:13] 10Operations, 10Traffic, 10observability, 10Patch-For-Review: some Prometheus not scraping the full set of targets - https://phabricator.wikimedia.org/T246860 (10Krinkle) [19:57:23] 10Operations, 10Traffic, 10observability, 10Patch-For-Review: some Prometheis not scraping the full set of targets - https://phabricator.wikimedia.org/T246860 (10CDanis) [19:57:45] 10Operations, 10Traffic, 10observability, 10Patch-For-Review: some Prometheis not scraping the full set of targets - https://phabricator.wikimedia.org/T246860 (10CDanis) @Krinkle please see https://phabricator.wikimedia.org/T246860#5964089 :) [20:00:30] (03PS1) 10Bstorm: Revert "dumps-distribution: move all traffic to labstore1006" [puppet] - 10https://gerrit.wikimedia.org/r/579616 [20:00:58] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:12] (03PS2) 10Bstorm: Revert "dumps-distribution: move all traffic to labstore1006" [puppet] - 10https://gerrit.wikimedia.org/r/579616 (https://phabricator.wikimedia.org/T224583) [20:02:19] !log stat1005 - ip link set en01 down ; ip link set en01 up (T247561) [20:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:24] T247561: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 [20:02:30] (03CR) 10jerkins-bot: [V: 04-1] Revert "dumps-distribution: move all traffic to labstore1006" [puppet] - 10https://gerrit.wikimedia.org/r/579616 (https://phabricator.wikimedia.org/T224583) (owner: 10Bstorm) [20:03:46] (03PS3) 10Bstorm: Revert "dumps-distribution: move all traffic to labstore1006" [puppet] - 10https://gerrit.wikimedia.org/r/579616 (https://phabricator.wikimedia.org/T224583) [20:05:47] (03CR) 10Bstorm: [C: 03+2] Revert "dumps-distribution: move all traffic to labstore1006" [puppet] - 10https://gerrit.wikimedia.org/r/579616 (https://phabricator.wikimedia.org/T224583) (owner: 10Bstorm) [20:07:08] PROBLEM - Check systemd state on labstore1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:10:56] 10Operations, 10Traffic, 10observability, 10Patch-For-Review: some Prometheis not scraping the full set of targets - https://phabricator.wikimedia.org/T246860 (10Krinkle) Thanks... I guess that's bound to cause confusion, but I will try to be one of the people that does know and remembers. [20:11:06] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:12:00] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002 [20:14:20] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003 [20:14:58] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [20:15:25] papaul: ^ looks like you fixed all these indirectly [20:15:34] by fixing networking for kafka-jumbo/stat1005 ? [20:18:44] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:24:53] mutante: still working [20:25:29] papaul: ok! [20:26:16] (03PS1) 10Bstorm: dumps-distribution: Fix smartmontools so it doesn't trigger alerts [puppet] - 10https://gerrit.wikimedia.org/r/579621 (https://phabricator.wikimedia.org/T224583) [20:27:30] RECOVERY - Check systemd state on labstore1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:29:52] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [20:29:54] 10Operations, 10SRE-Access-Requests: Requesting access to event logging data in hive for joewalsh - https://phabricator.wikimedia.org/T247636 (10JoeWalsh) [20:31:14] 10Operations, 10SRE-Access-Requests: Requesting access to event logging data in hive for joewalsh - https://phabricator.wikimedia.org/T247636 (10dr0ptp4kt) Approved. [20:33:45] (03CR) 10Bstorm: "This is live-hacked into labstore1007 now. It's apparently a bad interaction between systemd and smartd that is eventually fixed?" [puppet] - 10https://gerrit.wikimedia.org/r/579621 (https://phabricator.wikimedia.org/T224583) (owner: 10Bstorm) [20:48:43] (03PS3) 10Dzahn: add hiera keys for parsoid-php on deployment-parsoid11 [puppet] - 10https://gerrit.wikimedia.org/r/576493 (https://phabricator.wikimedia.org/T246854) [20:48:49] (03CR) 10Jhedden: [C: 03+1] dumps-distribution: Fix smartmontools so it doesn't trigger alerts [puppet] - 10https://gerrit.wikimedia.org/r/579621 (https://phabricator.wikimedia.org/T224583) (owner: 10Bstorm) [20:50:42] (03PS3) 10DannyS712: Fix typos (boostrap -> bootstrap) [puppet] - 10https://gerrit.wikimedia.org/r/566323 (https://phabricator.wikimedia.org/T201491) [20:51:43] (03CR) 10Bstorm: [C: 03+2] dumps-distribution: Fix smartmontools so it doesn't trigger alerts [puppet] - 10https://gerrit.wikimedia.org/r/579621 (https://phabricator.wikimedia.org/T224583) (owner: 10Bstorm) [20:54:18] (03CR) 10Alex Monk: "The SSH key was fine (at least when I looked) - just added the role that lets the deployment server actually log into the user account in " [puppet] - 10https://gerrit.wikimedia.org/r/579233 (https://phabricator.wikimedia.org/T246854) (owner: 10Hashar) [20:55:15] (03Abandoned) 10Hashar: beta: pull out deployment-parsoid11 [puppet] - 10https://gerrit.wikimedia.org/r/579233 (https://phabricator.wikimedia.org/T246854) (owner: 10Hashar) [20:55:16] Krenair: that is what I was looking for :) [20:55:31] James_F: seems like krenair fixed up deployment-parsoid11 as a deployment target. It missed role::beta::mediawiki \o/ [20:56:34] yeah we were chatting about it in -cloud just now [20:56:48] just running scap now to check all is fine, I already tested SSH myself though [20:56:48] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:57:42] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1006 is OK: (C)5e+06 ge (W)1e+06 ge 1.905e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [21:00:07] hashar: Krenair: about to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/576493/3/hieradata/cloud/eqiad1/deployment-prep/hosts/deployment-parsoid11 [21:00:14] it's copy/paste from Horizon right now [21:00:49] (03CR) 10Dzahn: [C: 03+2] add hiera keys for parsoid-php on deployment-parsoid11 [puppet] - 10https://gerrit.wikimedia.org/r/576493 (https://phabricator.wikimedia.org/T246854) (owner: 10Dzahn) [21:00:54] ack [21:02:31] Krenair: in a perfect world we could avoid adding "role::beta::mediawiki" but i guess too many things are different [21:02:41] in case you know when that was added [21:02:42] (03Abandoned) 10DCausse: [WIP] [logstash] Add a way to move some data to debug_blob [puppet] - 10https://gerrit.wikimedia.org/r/392591 (https://phabricator.wikimedia.org/T180051) (owner: 10DCausse) [21:03:45] mutante, eh that role is literally just to add a labs specific thing that's necessary due to the differences between prod's admin module vs. labs' LDAP + security::access::conf stuff [21:04:20] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:05:01] (03PS1) 10Bstorm: dumps-distribution: move all NFS traffic to labstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/579626 (https://phabricator.wikimedia.org/T224583) [21:05:03] (03PS1) 10Bstorm: dumps-distribution: switch which host does acme [puppet] - 10https://gerrit.wikimedia.org/r/579627 (https://phabricator.wikimedia.org/T224583) [21:06:27] Krenair: ACK, i see. that's not much code at all. so close ... [21:06:56] Clearly we need to reimplement the admin module for beta cluster magically as an LDAP read. ;-) [21:07:04] much better than before but maybe we can fold that into an "if $realm" and then it's perfect [21:07:47] (03PS1) 10Bstorm: dumps-distribution: set the TTL to 5M for dumps.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/579628 (https://phabricator.wikimedia.org/T224583) [21:07:58] no, but we could include that stuff when we know it's in cloud and avoid multiple roles [21:08:14] I dunno [21:08:17] probably should be in scap? [21:08:27] I think we prefer an extra little role+profile compared to a realm branch [21:08:34] it's just about 4 lines [21:08:47] at least that's the impression I got when setting up acme-chief in cloud vps [21:08:58] i dunno, the "extra little role" lead us to the situation on 10 [21:09:09] it's so close now thouhg [21:09:13] (03CR) 10Bstorm: [C: 03+2] dumps-distribution: set the TTL to 5M for dumps.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/579628 (https://phabricator.wikimedia.org/T224583) (owner: 10Bstorm) [21:09:22] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:09:25] removed stuff from Horizon hiera [21:09:45] running puppet on parsoid11 [21:10:28] -request_terminate_timeout = 201 [21:10:28] +request_terminate_timeout = 240 [21:10:28] [21:10:45] -opcache.memory_consumption = 1024 [21:10:45] +opcache.memory_consumption = 300 [21:10:51] sigh.. should have stayed the same [21:10:59] why.. it was copy/paste [21:11:48] (03PS1) 10Andrew Bogott: Horizon: update nova_policy to conform with the current API policy.json [puppet] - 10https://gerrit.wikimedia.org/r/579632 (https://phabricator.wikimedia.org/T247573) [21:11:59] nothing is broken though [21:12:02] PROBLEM - Old JVM GC check - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [21:12:43] (03Abandoned) 10Andrew Bogott: Horizon policy.json for nova: disable resizing [puppet] - 10https://gerrit.wikimedia.org/r/579606 (https://phabricator.wikimedia.org/T247573) (owner: 10Andrew Bogott) [21:12:46] (03Abandoned) 10Andrew Bogott: Horizon: update nova_policy to conform with the current API policy.json [puppet] - 10https://gerrit.wikimedia.org/r/579632 (https://phabricator.wikimedia.org/T247573) (owner: 10Andrew Bogott) [21:13:37] (03PS1) 10Andrew Bogott: nova policy.json: remove all obsolete v2 policy rules [puppet] - 10https://gerrit.wikimedia.org/r/579634 (https://phabricator.wikimedia.org/T247573) [21:13:39] (03PS1) 10Andrew Bogott: nova policy.json: replace 'admin_or_member' with 'admin_or_owner' [puppet] - 10https://gerrit.wikimedia.org/r/579635 (https://phabricator.wikimedia.org/T247573) [21:13:41] (03PS1) 10Andrew Bogott: nova policy.json: sort all policy rules [puppet] - 10https://gerrit.wikimedia.org/r/579636 (https://phabricator.wikimedia.org/T247573) [21:13:43] (03PS1) 10Andrew Bogott: nova policy.json: Remove all redundant policies [puppet] - 10https://gerrit.wikimedia.org/r/579637 (https://phabricator.wikimedia.org/T247573) [21:13:45] (03PS1) 10Andrew Bogott: nova policy.json: only permit admin user to resize VMs [puppet] - 10https://gerrit.wikimedia.org/r/579638 (https://phabricator.wikimedia.org/T247573) [21:13:47] (03PS1) 10Andrew Bogott: nova policy.json: require projectadmin for delete/rebuild/reboot [puppet] - 10https://gerrit.wikimedia.org/r/579639 [21:13:49] (03PS1) 10Andrew Bogott: Horizon: update nova_policy to conform with the current API policy.json [puppet] - 10https://gerrit.wikimedia.org/r/579640 (https://phabricator.wikimedia.org/T247573) [21:14:52] (03CR) 10Dzahn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/576493 (https://phabricator.wikimedia.org/T246854) (owner: 10Dzahn) [21:15:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1008: SMART disk alert - https://phabricator.wikimedia.org/T245815 (10Jclark-ctr) [21:16:56] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:17:15] 10Operations, 10LDAP-Access-Requests: Allow LDAP access to superset dashboards for Moushira Elamrawy - https://phabricator.wikimedia.org/T242000 (10Dzahn) 05Stalledβ†’03Open a:05Moushiraβ†’03Dzahn Thanks @Moushira! Updating it in the repository. [21:17:22] (03PS1) 10Bstorm: dumps-distribution: fail over to labstore1007 for dumps.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/579642 (https://phabricator.wikimedia.org/T224583) [21:18:46] 10Operations, 10LDAP-Access-Requests: Allow LDAP access to superset dashboards for Moushira Elamrawy - https://phabricator.wikimedia.org/T242000 (10Moushira) Thanks @Dzahn. [21:19:24] (03CR) 10Bstorm: [C: 03+2] dumps-distribution: move all NFS traffic to labstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/579626 (https://phabricator.wikimedia.org/T224583) (owner: 10Bstorm) [21:19:59] (03PS1) 10Krinkle: wgConf: Remove unused 'fullLoadCallback' property assignment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579643 (https://phabricator.wikimedia.org/T169821) [21:21:12] (03CR) 10Krinkle: "There are some calls to $wgConf->loadFullData() in code search, which is fine. The method exists and does nothing by default. As part of s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579643 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [21:22:24] hrm, anyone know anything about this spike of OrderedStreamingForkController notices? [21:22:31] (03PS1) 10Dzahn: admins: add expiry date and contact to Moushirael [puppet] - 10https://gerrit.wikimedia.org/r/579644 (https://phabricator.wikimedia.org/T242000) [21:22:40] brennen: hmm, thats probably me? It's used by a maint script i'm running [21:23:14] ebernhardson: ah - i see about 3k of these in logstash around 21:06 [21:23:45] brennen: ahh, hmm. Looks like if we kill the parent process all the children log errors for awhile before dieing perhaps :S [21:24:11] all those should be it trying to write to the pipe back to the parent process [21:25:44] ebernhardson: makes sense - just wanted to make sure i wasn't seeing some new breakage in wmf.23 just surfacing [21:25:51] thanks! [21:31:51] (03CR) 10Dzahn: [C: 03+2] admins: add expiry date and contact to Moushirael [puppet] - 10https://gerrit.wikimedia.org/r/579644 (https://phabricator.wikimedia.org/T242000) (owner: 10Dzahn) [21:31:59] (03PS2) 10Dzahn: admins: add expiry date and contact to Moushirael [puppet] - 10https://gerrit.wikimedia.org/r/579644 (https://phabricator.wikimedia.org/T242000) [21:33:29] (03PS1) 10Krinkle: wgConf: Assign $wgLocalDatabases normally instead of by-ref [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579645 (https://phabricator.wikimedia.org/T169821) [21:34:24] (03PS2) 10Andrew Bogott: nova policy.json: replace 'admin_or_member' with 'admin_or_owner' [puppet] - 10https://gerrit.wikimedia.org/r/579635 (https://phabricator.wikimedia.org/T247573) [21:34:26] (03PS2) 10Andrew Bogott: nova policy.json: sort all policy rules [puppet] - 10https://gerrit.wikimedia.org/r/579636 (https://phabricator.wikimedia.org/T247573) [21:34:29] (03PS2) 10Andrew Bogott: nova policy.json: only permit admin user to resize VMs [puppet] - 10https://gerrit.wikimedia.org/r/579638 (https://phabricator.wikimedia.org/T247573) [21:34:30] (03PS2) 10Andrew Bogott: nova policy.json: require projectadmin for delete/rebuild/reboot [puppet] - 10https://gerrit.wikimedia.org/r/579639 [21:34:32] (03PS2) 10Andrew Bogott: Horizon: update nova_policy to conform with the current API policy.json [puppet] - 10https://gerrit.wikimedia.org/r/579640 (https://phabricator.wikimedia.org/T247573) [21:34:34] (03PS1) 10Andrew Bogott: nova policy: convert from json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/579647 [21:35:05] (03PS1) 10Krinkle: wgConf: Move wgLocalDatabases to CommonSettings.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579648 [21:35:07] (03PS1) 10Krinkle: wgConf: Move wgLocalDatabases to CommonSettings.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579649 [21:35:13] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Allow LDAP access to superset dashboards for Moushira Elamrawy - https://phabricator.wikimedia.org/T242000 (10Dzahn) 05Openβ†’03Resolved [21:38:55] 10Operations, 10SRE-Access-Requests: Requesting access to event logging data in hive for joewalsh - https://phabricator.wikimedia.org/T247636 (10Dzahn) [21:40:00] 10Operations, 10SRE-Access-Requests: Requesting access to event logging data in hive for joewalsh - https://phabricator.wikimedia.org/T247636 (10Dzahn) a:03ArielGlenn [21:40:08] 10Operations, 10SRE-Access-Requests: Requesting access to event logging data in hive for joewalsh - https://phabricator.wikimedia.org/T247636 (10Dzahn) p:05Triageβ†’03Medium [21:40:50] 10Operations: apt config on planet1001 would install systemd from backports - https://phabricator.wikimedia.org/T247592 (10Dzahn) a:03Dzahn [21:43:17] mutante, hey [21:43:22] 10Operations, 10Analytics, 10DC-Ops, 10netops: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10Dzahn) Fixed by @Papaul for kafka-jumbo1006. We saw recoveries for kafka lag on other machines all at once. [21:43:26] (03PS2) 10Andrew Bogott: nova policy.json: remove all obsolete v2 policy rules [puppet] - 10https://gerrit.wikimedia.org/r/579634 (https://phabricator.wikimedia.org/T247573) [21:43:27] mutante, things are broken on deployment-parsoid11 [21:43:28] (03PS3) 10Andrew Bogott: nova policy.json: replace 'admin_or_member' with 'admin_or_owner' [puppet] - 10https://gerrit.wikimedia.org/r/579635 (https://phabricator.wikimedia.org/T247573) [21:43:30] (03PS3) 10Andrew Bogott: nova policy.json: sort all policy rules [puppet] - 10https://gerrit.wikimedia.org/r/579636 (https://phabricator.wikimedia.org/T247573) [21:43:32] (03PS2) 10Andrew Bogott: nova policy.json: Remove all redundant policies [puppet] - 10https://gerrit.wikimedia.org/r/579637 (https://phabricator.wikimedia.org/T247573) [21:43:34] (03PS3) 10Andrew Bogott: nova policy.json: only permit admin user to resize VMs [puppet] - 10https://gerrit.wikimedia.org/r/579638 (https://phabricator.wikimedia.org/T247573) [21:43:36] (03PS3) 10Andrew Bogott: nova policy.json: require projectadmin for delete/rebuild/reboot [puppet] - 10https://gerrit.wikimedia.org/r/579639 [21:43:38] (03PS3) 10Andrew Bogott: Horizon: update nova_policy to conform with the current API policy.json [puppet] - 10https://gerrit.wikimedia.org/r/579640 (https://phabricator.wikimedia.org/T247573) [21:43:40] (03PS2) 10Andrew Bogott: nova policy: convert from json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/579647 [21:43:42] mutante, it turns out your the file was missing the .yaml extension [21:43:43] Krenair: they weren't when i ran puppet [21:43:52] the new file* [21:43:52] oh [21:43:58] so it doesn't get loaded [21:44:15] will upload a fix shortly [21:44:33] i see, thanks. i can fix it too [21:45:37] (03PS1) 10Alex Monk: Follow-up Ice6a9431: Fix filename of deployment-parsoid11 hieradata [puppet] - 10https://gerrit.wikimedia.org/r/579650 (https://phabricator.wikimedia.org/T246854) [21:45:57] mutante, ^ [21:46:33] (03CR) 10Dzahn: [V: 03+2 C: 03+2] Follow-up Ice6a9431: Fix filename of deployment-parsoid11 hieradata [puppet] - 10https://gerrit.wikimedia.org/r/579650 (https://phabricator.wikimedia.org/T246854) (owner: 10Alex Monk) [21:46:52] Krenair: merged on master. tx [21:47:10] PROBLEM - Check systemd state on logstash2029 is CRITICAL: connect to address 10.192.48.140 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:47:12] PROBLEM - dhclient process on logstash2029 is CRITICAL: connect to address 10.192.48.140 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [21:47:32] PROBLEM - Check size of conntrack table on logstash2029 is CRITICAL: connect to address 10.192.48.140 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:48:36] RECOVERY - Old JVM GC check - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is OK: (C)100 gt (W)80 gt 53.25 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [21:49:02] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:49:24] logstash2029 is me, it is reimaging but the downtime didn't seem to take [21:51:14] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:59:34] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:59:46] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10Papaul) We have wrong mgmt password on all 3 nodes [22:01:50] (03PS2) 10Krinkle: wgConf: Move wgLocalDatabases to CommonSettings.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579648 [22:01:52] (03PS2) 10Krinkle: wgConf: Move wgLocalDatabases to CommonSettings.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579649 [22:01:54] (03PS1) 10Krinkle: Move $wgConf to CommonSettings.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579651 (https://phabricator.wikimedia.org/T169821) [22:01:56] (03PS1) 10Krinkle: Move $wgConf to CommonSettings.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579652 (https://phabricator.wikimedia.org/T169821) [22:01:58] (03PS1) 10Krinkle: [WIP] Remove use of the $globals tmp cache file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579653 (https://phabricator.wikimedia.org/T169821) [22:02:48] (03CR) 10Jforrester: [C: 03+1] wgConf: Remove unused 'fullLoadCallback' property assignment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579643 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [22:02:50] PROBLEM - parsoid on scandium is CRITICAL: connect to address 10.64.48.94 and port 8142: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [22:03:10] cscott: ^ testing, right? [22:03:19] (03CR) 10Jforrester: [C: 03+1] "Definitely a PHP4-ism." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579645 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [22:04:23] (03CR) 10Jforrester: "Great." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579652 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [22:05:37] (03CR) 10Jforrester: "Interesting." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579653 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [22:05:39] (03CR) 10Krinkle: "This is only WIP because I'm scared shitless I missed something and want myself and James to ponder over it a few days at least." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579653 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [22:06:39] (03CR) 10Jforrester: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579653 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [22:07:05] @seen arlolra [22:07:05] mutante: Last time I saw arlolra they were talking in the channel, they are still in the channel #mediawiki-parsoid at 3/13/2020 4:03:07 PM (6h3m57s ago) [22:07:16] wm-bot, very helpful :) [22:07:35] 10Operations, 10Analytics, 10DC-Ops, 10netops: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10Papaul) I have @Jclark-ctr repalce the cable to stat1005 same issue. I have him also disconnect the cable while i was looking at the switch the interface went from up up to up down a... [22:09:02] (03CR) 10Bstorm: [C: 03+2] dumps-distribution: fail over to labstore1007 for dumps.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/579642 (https://phabricator.wikimedia.org/T224583) (owner: 10Bstorm) [22:09:32] James_F: even before my recent optimisations, the cache miss scenario was fast enough that the +10ms hit wasn't "too bad" for a few rare prod requests. the compilation step from the 100s of YAML files though, will almost certainly be too slow to run on demand, and more generally I'd rather embrace build steps more to keep prod (c)leaner. The scap step as you mentioned on the mtime task makes a lot of sense to me. It also naturally [22:09:32] avoids the mtime issue because running it on the CLI from deploy1001 means there is no opcache to be concerned about (plus, YAML doesn't have opcache). [22:09:47] (03PS2) 10Bstorm: dumps-distribution: switch which host does acme [puppet] - 10https://gerrit.wikimedia.org/r/579627 (https://phabricator.wikimedia.org/T224583) [22:10:04] * James_F nods. [22:10:10] s/scap step/ build step before calling scap/ - either is fine by me [22:10:36] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:10:40] but yeah, this'll be the focus starting from April between us/you/releng/sre [22:10:46] Krinkle: Mostly I'm thinking that we should fix the non-local uses of wgConf so that we can make it single-wiki, rather than put effort into focussing on wgConf. [22:10:55] * James_F nods. [22:12:02] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:12:32] Right yeah, if we only had to load the current wiki in most cases, that would save a little bit of memory and procesing cost. I don't oppose it. But it does mean that we'd have to extend or break the SiteConfig interface in core to accomodate some other way of loading the config on-demand for those use cases. We can certainly do that, and also check with other wiki farmers to see what they'd like from that and/or have already done [22:12:32] to solve it on their end. [22:12:44] ACKNOWLEDGEMENT - parsoid on scandium is CRITICAL: connect to address 10.64.48.94 and port 8142: Connection refused daniel_zahn tests https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [22:12:53] For now though I'm hopeful we might be able to side-step it in a way that makes it fully orthogonal with the static config goal. [22:13:24] e.g. not somethign we'd have to change if/when we do do that change. [22:13:42] * James_F nods. [22:16:51] (03CR) 10Bstorm: [C: 03+2] dumps-distribution: switch which host does acme [puppet] - 10https://gerrit.wikimedia.org/r/579627 (https://phabricator.wikimedia.org/T224583) (owner: 10Bstorm) [22:21:14] !log downtimed labstore1006 for upgrades T224583 [22:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:21] T224583: Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 [22:26:40] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 (10Bstorm) Noting here as well: when doing these in-place upgrades, facter 3 causes a funny thing to happen wher... [22:27:34] !log rebooting labstore1006 T224583 [22:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:39] T224583: Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 [22:29:42] RECOVERY - parsoid on scandium is OK: HTTP OK: HTTP/1.1 200 OK - 1535 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [22:29:58] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:37:36] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:45:13] !log herron@cumin1001 START - Cookbook sre.hosts.downtime [22:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:10] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:35] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:57:49] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:59:35] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10Dzahn) [23:01:09] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:03:20] (03CR) 10Dzahn: [C: 03+2] phabricator: add warning about /srv/dumps [puppet] - 10https://gerrit.wikimedia.org/r/565403 (owner: 10Dzahn) [23:04:49] (03Abandoned) 10Dzahn: gerrit: refactor, move java setup to separate class [puppet] - 10https://gerrit.wikimedia.org/r/548554 (owner: 10Dzahn) [23:06:43] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:07:11] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:12:36] (03PS2) 10Dzahn: wikimania_scholarships: remove port 80 firewall hole [puppet] - 10https://gerrit.wikimedia.org/r/572350 [23:12:44] !log rebooting labstore1006 for upgrade to stretch T224583 [23:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:50] T224583: Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 [23:16:40] (03CR) 10Dzahn: [C: 03+2] "this is harmless because the server has port 80 open from 3 separate rules so nothing will happen until we get to the last one" [puppet] - 10https://gerrit.wikimedia.org/r/572350 (owner: 10Dzahn) [23:23:38] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:24:30] RECOVERY - ElasticSearch shard size check - 9243 on search.svc.eqiad.wmnet is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [23:27:32] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:32:08] RECOVERY - Old JVM GC check - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [23:32:12] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:37:36] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:42:04] 10Operations: decom racktables? - https://phabricator.wikimedia.org/T247646 (10Dzahn) [23:44:15] 10Operations: decom racktables? - https://phabricator.wikimedia.org/T247646 (10Dzahn) [23:45:08] 10Operations, 10Wikimedia-Logstash, 10observability: Move wikimania-scholarships from udp2log to syslog - https://phabricator.wikimedia.org/T215499 (10Dzahn) [23:45:46] 10Operations, 10Wikimedia-Logstash, 10observability: Move wikimania-scholarships from udp2log to syslog - https://phabricator.wikimedia.org/T215499 (10Dzahn) Added parent task which may mean we don't keep scholarships. [23:47:50] 10Operations, 10Wikimedia-Mailing-lists: Request new mailing list for Myanmar Wikimedia Community User Group - https://phabricator.wikimedia.org/T247647 (10Ninjastrikers) [23:49:43] 10Operations: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 (10Dzahn) [23:49:44] RECOVERY - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 76.27 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [23:49:55] 10Operations: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 (10Dzahn) a:03Dzahn [23:50:13] 10Operations: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 (10Dzahn) [23:51:30] 10Operations: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 (10Dzahn) [23:54:44] RECOVERY - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is OK: (C)100 gt (W)80 gt 74.24 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [23:55:55] 10Operations, 10serviceops: upgrade people.wikimedia.org backend to buster - https://phabricator.wikimedia.org/T247649 (10Dzahn) [23:56:02] 10Operations, 10serviceops: upgrade people.wikimedia.org backend to buster - https://phabricator.wikimedia.org/T247649 (10Dzahn) a:03Dzahn [23:56:24] 10Operations, 10serviceops: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 (10Dzahn) [23:59:49] 10Operations, 10serviceops: replace bromine and vega with buster VMs - https://phabricator.wikimedia.org/T247650 (10Dzahn) [23:59:55] 10Operations, 10serviceops: replace bromine and vega with buster VMs - https://phabricator.wikimedia.org/T247650 (10Dzahn) a:03Dzahn