[00:01:03] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2372.codfw.wmnet ` The log can be found in `/var/log... [00:03:34] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2373.codfw.wmnet ` The log can be found in `/var/log... [00:15:42] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [00:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:30] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [00:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:05] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [00:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:29] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [00:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:50] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2372.codfw.wmnet'] ` and were **ALL** successful. [00:24:40] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2374.codfw.wmnet ` The log can be found in `/var/log... [00:25:39] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 36 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:26:17] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2373.codfw.wmnet'] ` and were **ALL** successful. [00:28:22] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2375.codfw.wmnet ` The log can be found in `/var/log... [00:34:47] (03CR) 10Jforrester: [C: 03+1] "Should we try to align exactly with the "real" version of this, in https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/tools/scap/+/m" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566411 (owner: 10Legoktm) [00:39:34] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [00:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:55] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [00:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:15] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [00:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:36] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [00:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:00] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2374.codfw.wmnet'] ` and were **ALL** successful. [00:47:27] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2376.codfw.wmnet ` The log can be found in `/var/log... [00:49:32] PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:50:21] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2375.codfw.wmnet'] ` and were **ALL** successful. [00:50:26] PROBLEM - configured eth on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [00:50:30] PROBLEM - DPKG on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [00:50:36] PROBLEM - puppet last run on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:50:40] PROBLEM - Disk space on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [00:51:24] PROBLEM - Check the last execution of search-drop-query-clicks on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:53:04] PROBLEM - dhclient process on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [00:59:14] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10cloud-services-team (Hardware): decommission labstore2003.codfw.wmnet and labstore2004.codfw.wmnet - https://phabricator.wikimedia.org/T243319 (10Papaul) [01:01:23] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [01:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:46] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [01:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:40] Why do we still have the `files` directory at the top of puppet.git? [01:07:31] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2376.codfw.wmnet'] ` and were **ALL** successful. [01:11:58] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) |Servers Rack D3| Ready for service| |mw2366|Yes| |mw2367|Yes| |mw2368|Yes| |mw2369|Yes| |mw2370|Yes| |mw2371|Yes| |mw2372|Yes| |mw2373|Yes| |mw2374|Ye... [01:13:37] PROBLEM - Check size of conntrack table on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [01:14:54] (03PS1) 10Alex Monk: More fixes for codfw1dev puppet [puppet] - 10https://gerrit.wikimedia.org/r/574136 (https://phabricator.wikimedia.org/T242607) [01:16:29] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:17:34] RECOVERY - dhclient process on stat1007 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [01:17:40] RECOVERY - Disk space on stat1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [01:17:48] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:18:02] RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [01:18:11] RECOVERY - Check size of conntrack table on stat1007 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [01:19:40] RECOVERY - configured eth on stat1007 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [01:21:26] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:22:44] RECOVERY - Check the last execution of search-drop-query-clicks on stat1007 is OK: OK: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:24:06] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 36 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:29:38] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:32:30] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 35 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:39:10] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 37 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:45:24] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:46:24] RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [01:48:25] Krenair: because nobody wanted to touch all SSL certs at once .. i guess [01:48:32] mostly because of files/ssl [01:48:37] :/ [01:49:04] does it specifically cause issues in cloud? [01:50:27] (03PS2) 10Dzahn: site: add mw2366-mw2376 with spare role [puppet] - 10https://gerrit.wikimedia.org/r/574124 (https://phabricator.wikimedia.org/T241852) [01:53:58] !log starting new ganeti VMs apt1001 and apt2001 for OS install (WIP, not prod) [01:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:56] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 37 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:13:01] !log ganeti - removing instances apt1001/apt2001 again, starting over [02:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:15:11] mutante, nah I just thought we would've gotten rid of that by now [02:15:51] Krenair: ok.. and .. yea... [02:16:28] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [02:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:26] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [02:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:18:18] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 37 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:19:10] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:21:43] ^ per linked docs this isn't a case where it's widespread failure. it's just exactly one or 2 probes above alert threshold [02:24:14] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:24:24] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 34 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:25:20] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 35 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:40:13] (03PS1) 10Dzahn: installserver: temp remove new APT servers to avoid cron spam [puppet] - 10https://gerrit.wikimedia.org/r/574139 [02:41:25] (03CR) 10Dzahn: [C: 03+2] installserver: temp remove new APT servers to avoid cron spam [puppet] - 10https://gerrit.wikimedia.org/r/574139 (owner: 10Dzahn) [02:46:01] ok.. they are not ready yet but disabled the cron spam about it. [02:46:04] and off again [03:06:08] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:12:20] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 35 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:37:43] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [03:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:41:35] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [03:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:01:39] 10Operations, 10Commons, 10MediaWiki-File-management, 10Traffic, 10User-DannyS712: 500 error when viewing a file - https://phabricator.wikimedia.org/T245904 (10DannyS712) [04:02:21] 10Operations, 10Commons, 10MediaWiki-File-management, 10Traffic, 10User-DannyS712: 500 error when viewing a file - https://phabricator.wikimedia.org/T245904 (10DannyS712) Tried again, got 503: ` Request from [snip] via cp4025 frontend, Varnish XID 325012919 Error: 503, Backend fetch failed at Sat, 22 Feb... [04:11:46] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:42:26] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 35 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:04:24] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 36 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:10:30] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:11:10] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:17:20] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 34 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:06:44] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [06:06:44] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [06:18:00] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 36 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:24:12] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 528 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:38:24] (03PS1) 10CDanis: ripe atlas alerts: allow more ipv6 failures [puppet] - 10https://gerrit.wikimedia.org/r/574145 [09:39:03] (03PS2) 10Matěj Suchánek: Synchronize and fix DisableQueryPageUpdate for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573969 [09:40:18] (03CR) 10jerkins-bot: [V: 04-1] Synchronize and fix DisableQueryPageUpdate for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573969 (owner: 10Matěj Suchánek) [11:11:15] (03PS1) 10MarcoAurelio: Fix parent permissions; inherit from `operations/software` [software/nss-dnsdc] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/574147 [11:12:32] PROBLEM - MegaRAID on analytics1044 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:12:35] ACKNOWLEDGEMENT - MegaRAID on analytics1044 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T245910 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:12:48] 10Operations, 10ops-eqiad: Degraded RAID on analytics1044 - https://phabricator.wikimedia.org/T245910 (10ops-monitoring-bot) [11:21:14] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on analytics1044 - https://phabricator.wikimedia.org/T245910 (10Volans) [11:39:31] 10Operations, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for gr.wikimedia.org - https://phabricator.wikimedia.org/T245912 (10MarcoAurelio) [11:41:05] (03PS1) 10MarcoAurelio: WIP [dns] - 10https://gerrit.wikimedia.org/r/574150 [11:43:09] (03PS2) 10MarcoAurelio: wikimedia: Add new records for gr.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/574150 (https://phabricator.wikimedia.org/T245911) [11:48:38] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for gr.wikimedia.org - https://phabricator.wikimedia.org/T245912 (10MarcoAurelio) [11:49:37] (03PS2) 10Volans: sre.ganeti.makevm: refactor for new spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/571999 (https://phabricator.wikimedia.org/T231068) [11:51:35] (03PS1) 10MarcoAurelio: prod_sites: Apache configuration for gr.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/574151 [11:51:56] (03PS2) 10Volans: ganeti: add logging for GntInstance actions [software/spicerack] - 10https://gerrit.wikimedia.org/r/571997 (https://phabricator.wikimedia.org/T231068) [11:51:58] (03PS3) 10Volans: ganeti: add VM creation capability [software/spicerack] - 10https://gerrit.wikimedia.org/r/571998 (https://phabricator.wikimedia.org/T231068) [11:52:00] (03PS1) 10Volans: spicerack: add support for HTTP proxy [software/spicerack] - 10https://gerrit.wikimedia.org/r/574152 [11:52:19] (03PS2) 10MarcoAurelio: prod_sites: Apache configuration for gr.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/574151 (https://phabricator.wikimedia.org/T245911) [11:52:35] (03CR) 10Volans: "done" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/571997 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [11:52:45] (03CR) 10Volans: "reply inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/571998 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [11:58:04] (03PS2) 10CDanis: spicerack: add support for HTTP proxy [software/spicerack] - 10https://gerrit.wikimedia.org/r/574152 (owner: 10Volans) [11:59:55] PROBLEM - Hadoop NodeManager on analytics1073 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [12:11:25] (03PS3) 10Volans: spicerack: add support for HTTP proxy [software/spicerack] - 10https://gerrit.wikimedia.org/r/574152 [12:16:48] (03PS1) 10Volans: spicerack: add http_proxy to config file [puppet] - 10https://gerrit.wikimedia.org/r/574155 [12:17:13] (03CR) 10Volans: "Needed for I9337011fae6b2bd8f8986ed4d0949334a69057de" [puppet] - 10https://gerrit.wikimedia.org/r/574155 (owner: 10Volans) [12:32:44] (03PS1) 10CDanis: cdanis dotfiles: Singapore time [puppet] - 10https://gerrit.wikimedia.org/r/574156 [12:36:34] (03CR) 10CDanis: [C: 03+2] cdanis dotfiles: Singapore time [puppet] - 10https://gerrit.wikimedia.org/r/574156 (owner: 10CDanis) [14:11:23] RECOVERY - Hadoop NodeManager on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [14:17:52] just restarted the node manager --^ [14:18:00] weird errors, saved logs, will check on monday [14:18:02] (03PS1) 10Elukey: Add hiera overrides for analytics1044 after disk failure [puppet] - 10https://gerrit.wikimedia.org/r/574169 (https://phabricator.wikimedia.org/T245910) [14:19:17] (03CR) 10Elukey: [C: 03+2] Add hiera overrides for analytics1044 after disk failure [puppet] - 10https://gerrit.wikimedia.org/r/574169 (https://phabricator.wikimedia.org/T245910) (owner: 10Elukey) [14:21:04] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: Degraded RAID on analytics1044 - https://phabricator.wikimedia.org/T245910 (10elukey) The host is going to be refreshed this fiscal, I'd just avoid to use the disk for the moment. [14:27:25] (03CR) 10Andrew Bogott: [C: 03+2] More fixes for codfw1dev puppet [puppet] - 10https://gerrit.wikimedia.org/r/574136 (https://phabricator.wikimedia.org/T242607) (owner: 10Alex Monk) [14:42:30] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/574150 (https://phabricator.wikimedia.org/T245911) (owner: 10MarcoAurelio) [14:42:59] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/574151 (https://phabricator.wikimedia.org/T245911) (owner: 10MarcoAurelio) [14:46:39] (03CR) 10Dvorapa: "See also https://bitbucket.org/stoneleaf/enum34/issues/27/enum34-118-broken" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573969 (owner: 10Matěj Suchánek) [14:53:47] (03PS1) 10Andrew Bogott: designatemakedomain: make region aware [puppet] - 10https://gerrit.wikimedia.org/r/574172 [15:28:00] (03PS1) 10Andrew Bogott: wmfkeystonehooks: make region-aware [puppet] - 10https://gerrit.wikimedia.org/r/574176 [16:18:47] (03CR) 10Volans: [C: 03+1] "LGTM, let's try it out next week." [puppet] - 10https://gerrit.wikimedia.org/r/573976 (https://phabricator.wikimedia.org/T245511) (owner: 10Filippo Giunchedi) [16:35:05] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:37:15] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:46:01] James_F: you there? [16:49:15] Urbanecm: maybe you? [16:49:44] in a meeting [16:49:46] what's the matter? [16:50:01] Urbanecm: dblists [16:50:06] but not urgent [16:50:16] please don't let me interrupt you [16:50:18] if you'll be here in an hour, the meeting should be over [16:52:03] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:54:09] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:56:54] (03PS1) 10MarcoAurelio: [WIP] Initial configuration for gr.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574184 (https://phabricator.wikimedia.org/T245911) [16:57:56] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Initial configuration for gr.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574184 (https://phabricator.wikimedia.org/T245911) (owner: 10MarcoAurelio) [16:59:23] Amir1: works like a charm now, thanks [17:02:15] (03PS2) 10MarcoAurelio: [WIP] Initial configuration for gr.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574184 (https://phabricator.wikimedia.org/T245911) [17:02:23] Daimona: thank you for making a ticket otherwise I would have missed it [17:02:50] Yeah I waited a bit in case it was transient [17:02:53] But it wasn't [17:04:51] Is zuul working for anyone? [17:05:11] Looks like there's a big chunk of patches to be tested but no one is being tested atm? [17:05:42] just picked the -prio queue [17:06:01] It's certainly stuck [17:06:18] And probably it can't consume the queue [17:06:19] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Initial configuration for gr.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574184 (https://phabricator.wikimedia.org/T245911) (owner: 10MarcoAurelio) [17:07:19] oofff [17:07:32] tox failing for a non-python change, nice [17:08:31] (03PS3) 10MarcoAurelio: [WIP] Initial configuration for gr.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574184 (https://phabricator.wikimedia.org/T245911) [17:09:50] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Initial configuration for gr.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574184 (https://phabricator.wikimedia.org/T245911) (owner: 10MarcoAurelio) [17:12:39] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:29:37] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:35:51] (03PS4) 10MarcoAurelio: [WIP] Initial configuration for gr.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574184 (https://phabricator.wikimedia.org/T245911) [17:37:08] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Initial configuration for gr.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574184 (https://phabricator.wikimedia.org/T245911) (owner: 10MarcoAurelio) [17:38:48] ... [17:38:49] (03PS5) 10MarcoAurelio: [WIP] Initial configuration for gr.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574184 (https://phabricator.wikimedia.org/T245911) [17:40:16] hauskatze: what was your question? [17:40:25] (assuming the jenkins failure :-) [17:40:52] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Initial configuration for gr.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574184 (https://phabricator.wikimedia.org/T245911) (owner: 10MarcoAurelio) [17:41:10] Urbanecm: this **** of a patch keeps failing: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/574184/ [17:41:19] I ran the relevant composer commands [17:42:12] I had to fake a wikiversions.json entry [17:42:21] but didn't committed that one [17:42:37] composer test fails for me due to some weird 'lacks .tiff file' [17:43:03] TimelineTest::testTimelineFontFileEexists [17:44:20] looking [17:44:42] (03PS6) 10MarcoAurelio: [WIP] Initial configuration for gr.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574184 (https://phabricator.wikimedia.org/T245911) [17:45:03] I'm re-adding the wikiversions entry [17:45:11] see if that stops CI from complaining [17:45:58] probably [17:46:10] CI seems to be happy! [17:46:20] pff [17:46:23] silly failure [17:46:29] indeed [17:46:33] wikiversions.json will cause a merge conflict [17:46:37] yes [17:46:53] but that's...easy to solve I guess [17:47:03] I'll review your patch later today [17:47:05] thanks for it! [17:50:18] np [17:50:56] Amir1: if you're still around, could you please take a look at https://phabricator.wikimedia.org/p/Agnes12353/? Seems spam, I'd suggest deleting picture and blurb [17:51:33] I couldn't find any phab admin earlier today [17:52:42] Daimona: disabled now [17:52:54] Thanks [17:52:57] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:52:58] I'm not sure if I can delete the description though [17:53:06] Uhm [17:53:23] Daimona: for disabling, you can use https://tools.wmflabs.org/phab-ban/ - you should have the necessary permissions [17:53:28] I seem to recall Andre doing that [17:53:32] (03CR) 10MarcoAurelio: "So it looks like this new way to build the config requires a wikiversions.json entry no matter what. However this will for sure cause a me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574184 (https://phabricator.wikimedia.org/T245911) (owner: 10MarcoAurelio) [17:53:34] Andre has shell access [17:53:36] I can't [17:53:46] Urbanecm: Yes, I do, but for this one I wanted to ask for complete deletion [17:53:47] yup, Andre has shell access [17:53:53] Ooooh yes, maybe it must be done via shell [17:54:08] (Which is meh) [17:54:33] yes, you need shell to alter user profiles [17:54:40] :( [17:54:48] Daimona: fill a task against #phabricator i guess [17:54:59] I know it as I had to request something like that in the past [17:55:09] I will [17:55:15] Public task is fine, I guess? [17:55:23] I'd go public [17:56:35] {{Done}} at https://phabricator.wikimedia.org/T245926 [17:56:44] (y) [17:57:07] Spam on phabricator, what else now? [17:57:15] maybe the user can be outright deleted if they've done nothing else [17:57:31] Daimona: comming soon, spam on gerrit [17:57:42] Yeah maybe [17:57:54] Hah sounds fun [17:58:09] I've even seen a pseudo-ransomware attempt on wiki pages! [17:58:39] https://it.wikipedia.org/wiki/Speciale:Contributi/NiCoLeDeVrIeS [17:58:54] It reads "Your account will be blocked unless you install ..." [17:58:57] sounds like fun [17:59:13] Innaccetabile. [17:59:29] *Inaccettabile [17:59:31] grr [17:59:48] Too many doubles :D [18:00:19] heh [18:00:57] (03PS7) 10MarcoAurelio: Initial configuration for gr.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574184 (https://phabricator.wikimedia.org/T245911) [18:01:30] no more WIP [18:01:53] Whip was a brand of washing machine soap now that I remember [18:02:33] wipp* [18:13:36] I wonder if an admin could edit the description (and blank it) [18:19:27] apergos: I checked at my own phab instance, and couldn't find that [18:23:35] I think you would have to do that through the db [18:24:13] apergos: that's what it's done, but requires CLI [18:24:20] *what we've done in the past [18:24:42] gotcha [18:25:03] so maybe if you're in the 'ops' group you could theoretically do it :-) [18:25:25] otherwise, better call andre [18:26:18] hauskatze: I _think_ we normally delete the accounts [18:26:45] Urbanecm: No idea. Deleting accounts is possible but problematic [18:26:49] andre doesn't have root there, and he doesn't do read-write db access [18:27:14] andre is in a group that allows him to run some commands Urbanecm [18:28:13] https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/admin/data/data.yaml$330 [18:29:56] deleting the account may be easier if they've done nothing else though [18:30:32] The account can be deleted and thenblocked on wikitech/mediawiki [18:30:43] which should prevent them from re-creating it on phab [18:31:42] already done on mw [18:31:47] no LDAP one [18:37:41] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:45:08] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573969 (owner: 10Matěj Suchánek) [19:32:59] PROBLEM - Host mw1372 is DOWN: PING CRITICAL - Packet loss = 100% [22:02:54] (03PS6) 10Effie Mouzeli: (WIP) mcrouter: add gutter pool servers in configuration [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) [22:10:45] (03PS1) 10Effie Mouzeli: mcrouter: enable gutter pool config on mwdebug1001 and mwdebug2001 [puppet] - 10https://gerrit.wikimedia.org/r/574200 (https://phabricator.wikimedia.org/T213089) [22:27:26] (03PS7) 10Effie Mouzeli: (WIP) mcrouter: add gutter pool servers in configuration [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) [22:27:46] (03PS2) 10Effie Mouzeli: mcrouter: enable gutter pool config on mwdebug1001 and mwdebug2001 [puppet] - 10https://gerrit.wikimedia.org/r/574200 (https://phabricator.wikimedia.org/T213089) [22:30:20] (03CR) 10jerkins-bot: [V: 04-1] (WIP) mcrouter: add gutter pool servers in configuration [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [22:34:22] (03PS8) 10Effie Mouzeli: (WIP) mcrouter: add gutter pool servers in configuration [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) [22:35:55] (03PS3) 10Effie Mouzeli: mcrouter: enable gutter pool config on mwdebug1001 and mwdebug2001 [puppet] - 10https://gerrit.wikimedia.org/r/574200 (https://phabricator.wikimedia.org/T213089) [22:47:09] 10Operations, 10Beta-Cluster-Infrastructure: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster) - https://phabricator.wikimedia.org/T243226 (10Krenair) 05Open→03Resolved done [23:23:01] (03CR) 10Krinkle: [V: 03+2 C: 03+2] Fix parent permissions; inherit from `operations/software` [software/nss-dnsdc] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/574147 (owner: 10MarcoAurelio) [23:23:25] (03CR) 10Krinkle: [C: 03+1] "I am not wmf/ops, as such I cannot actually merge this." [software/nss-dnsdc] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/574147 (owner: 10MarcoAurelio) [23:24:23] (03CR) 10Paladox: "Needs an admin to re-parent. (Project owners cannot re-parent)" [software/nss-dnsdc] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/574147 (owner: 10MarcoAurelio) [23:25:49] (03PS1) 10Reedy: Remove outdated flaggedrevs.php comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574210 [23:49:10] (03CR) 10Krinkle: [C: 03+1] ":D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574210 (owner: 10Reedy)