[02:13:15] 10Operations, 10ops-eqiad, 10Discovery-Search: Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T236601 (10Peachey88) [03:18:05] PROBLEM - snapshot of s2 in codfw on db1115 is CRITICAL: snapshot for s2 at codfw taken more than 4 days ago: Most recent backup 2019-10-23 02:51:10 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [04:56:35] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [09:55:07] 10Operations, 10ops-ulsfo, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: large number of 504 errors from ulsfo - https://phabricator.wikimedia.org/T236500 (10Bugreporter) [10:01:45] PROBLEM - Host asw2-esams.mgmt.esams.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [10:02:17] PROBLEM - Host bast3004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:02:17] PROBLEM - Host cp3061.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:02:29] PROBLEM - Host cp3062.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:02:29] PROBLEM - Host cp3065.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:02:29] PROBLEM - Host cp3064.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:02:29] PROBLEM - Host cp3063.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:02:55] PROBLEM - Host re0.cr3-esams is DOWN: PING CRITICAL - Packet loss = 100% [10:03:41] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 91.198.174.247, interfaces up: 41, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:04:03] PROBLEM - Host lvs3007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:05:21] PROBLEM - Host ganeti3003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:05:56] Errr [10:06:02] Not good I guess [10:08:35] RECOVERY - Router interfaces on mr1-esams is OK: OK: host 91.198.174.247, interfaces up: 43, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:08:37] RECOVERY - Host re0.cr3-esams is UP: PING OK - Packet loss = 0%, RTA = 88.67 ms [10:08:49] RECOVERY - Host asw2-esams.mgmt.esams.wmnet is UP: PING OK - Packet loss = 0%, RTA = 84.06 ms [10:09:15] RECOVERY - Host cp3062.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.08 ms [10:11:01] RECOVERY - Host ganeti3003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.06 ms [10:11:11] RECOVERY - Host lvs3007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.05 ms [10:13:49] RECOVERY - Host bast3004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.06 ms [10:13:49] RECOVERY - Host cp3061.mgmt is UP: PING OK - Packet loss = 0%, RTA = 83.95 ms [10:14:01] RECOVERY - Host cp3065.mgmt is UP: PING OK - Packet loss = 0%, RTA = 83.99 ms [10:14:01] RECOVERY - Host cp3064.mgmt is UP: PING OK - Packet loss = 0%, RTA = 83.97 ms [10:14:01] RECOVERY - Host cp3063.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.03 ms [10:36:18] 04Critical Alert for device cr3-esams.wikimedia.org - Juniper alarm active [11:26:24] 10Operations, 10ops-ulsfo, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: large number of 504 errors from ulsfo - https://phabricator.wikimedia.org/T236500 (10jijiki) Looks like it started on Friday ~12:00-13:00 UTC https://grafana.wikimedia.org/d/kHk7W6OZz/ats-cluster-view?panelId=1&fullscreen&orgId=... [11:26:29] 10Operations, 10ops-ulsfo, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: large number of 504 errors from ulsfo - https://phabricator.wikimedia.org/T236500 (10jijiki) [12:06:02] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Legacy (Watching / External), and 3 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10dr0ptp4kt) @Yurik good point. I think the guidance for use of the functionality will necessarily need to n... [12:24:26] 10Operations, 10ops-ulsfo, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: large number of 504 errors from ulsfo - https://phabricator.wikimedia.org/T236500 (10ema) In esams, the WDQS query above fails with a java exception after 60s: ` $ curl -v -w '%{time_connect}:%{time_starttransfer}:%{time_total}'... [12:26:54] 10Operations, 10Wikimedia-Etherpad: Disable old Etherpad installation after migrating content to Etherpad Lite installion - https://phabricator.wikimedia.org/T47312 (10Peachey88) [12:28:14] 10Operations, 10Wikimedia-Etherpad: Upgrade EtherPad to 1.1.17 - https://phabricator.wikimedia.org/T31822 (10Peachey88) [12:29:55] 10Operations, 10Wikimedia-Etherpad: Enable tags on Etherpad - https://phabricator.wikimedia.org/T32240 (10Peachey88) [12:57:29] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:35:43] RECOVERY - Check systemd state on elastic1039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:40:27] PROBLEM - Check systemd state on elastic1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:53:27] 10Operations, 10ops-eqiad, 10Discovery-Search: Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T236601 (10Mathew.onipe) This is causing mjolnir deploy directory to become unavailable/missing and also causing puppet to fail. [18:53:31] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission multatuli - https://phabricator.wikimedia.org/T236489 (10Papaul) [18:55:46] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission nescio and maerlant - https://phabricator.wikimedia.org/T236452 (10Papaul) [18:58:04] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission lvs300[1234] - https://phabricator.wikimedia.org/T236451 (10Papaul) [18:59:02] 10Operations, 10DC-Ops, 10decommission: decommission bast3002 - https://phabricator.wikimedia.org/T236329 (10Papaul) [19:01:39] 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10Papaul) [19:03:41] 10Operations, 10ops-esams, 10DNS, 10Traffic: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10Papaul) [19:05:38] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10Papaul) [19:06:51] RECOVERY - Check the Netbox report librenms for fail status. on netbox1001 is OK: librenms.LibreNMS OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:07:39] 10Operations, 10ops-esams, 10decommission: Decommission esams ms-fe / ms-be - https://phabricator.wikimedia.org/T169518 (10Papaul) [19:11:36] 10Operations, 10ops-esams, 10DC-Ops, 10decommission: Decomission amssq31-62 (32 hosts) - https://phabricator.wikimedia.org/T95742 (10Papaul) 05Open→03Resolved Disk wipe complete on all those servers this can be resolved. [19:11:38] 10Operations, 10ops-esams, 10Epic: Remove all decommissioned hardware - https://phabricator.wikimedia.org/T184063 (10Papaul) [19:33:19] (03PS13) 10Paladox: Gerrit: Convert CoC and Privacy links to use the new PolyGerrit extension point [puppet] - 10https://gerrit.wikimedia.org/r/520295 [19:33:34] (03PS6) 10Paladox: Gerrit: Migrate theme to support Polymer 2 [puppet] - 10https://gerrit.wikimedia.org/r/539180 (https://phabricator.wikimedia.org/T227509) [19:50:07] (03Abandoned) 10DannyS712: Partial cleanup of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544982 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [19:50:45] (03PS1) 10DannyS712: Partial cleanup of InitializeSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546369 (https://phabricator.wikimedia.org/T231178) [20:00:52] (03PS2) 10DannyS712: Partial cleanup of InitializeSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546369 (https://phabricator.wikimedia.org/T231178) [20:01:49] (03CR) 10jerkins-bot: [V: 04-1] Partial cleanup of InitializeSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546369 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [20:23:58] (03CR) 10Jon Harald Søby: [C: 03+1] mediawiki::web:prod_sites.pp: Apache config for ge.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [20:35:01] (03CR) 10DannyS712: "Recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546369 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [20:35:37] RECOVERY - Check systemd state on elastic1039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:26:39] (03PS1) 10BryanDavis: cloud vps: Add support for Buster to LXC module [puppet] - 10https://gerrit.wikimedia.org/r/546373 (https://phabricator.wikimedia.org/T236455) [21:33:28] (03CR) 10BryanDavis: [C: 04-1] "Needs testing. I have done these changes manually and seen them work, but I have not applied them via Puppet yet. I will find a project wi" [puppet] - 10https://gerrit.wikimedia.org/r/546373 (https://phabricator.wikimedia.org/T236455) (owner: 10BryanDavis) [21:34:35] (03CR) 10BryanDavis: [C: 04-1] "tgr: Take a look at Ie65a888c484e3db15a50c95dbd30c143c56c7bcc" [puppet] - 10https://gerrit.wikimedia.org/r/546221 (https://phabricator.wikimedia.org/T236455) (owner: 10Gergő Tisza) [22:29:41] 10Puppet, 10Cloud-VPS: geoipupdate missing on buster on Cloud VPS - https://phabricator.wikimedia.org/T236487 (10bd808) The package is in the contrib component upstream: https://packages.debian.org/buster/geoipupdate This hack seemed to work as a temporary workaround: ` $ cat /etc/apt/sources.list.d/debian-co... [23:01:43] (03PS2) 10BryanDavis: cloud vps: Add support for Buster to LXC module [puppet] - 10https://gerrit.wikimedia.org/r/546373 (https://phabricator.wikimedia.org/T236455) [23:03:38] (03PS3) 10BryanDavis: cloud vps: Add support for Buster to LXC module [puppet] - 10https://gerrit.wikimedia.org/r/546373 (https://phabricator.wikimedia.org/T236455) [23:14:52] 10Operations, 10LDAP-Access-Requests: Add bawolff to either NDA or WMF ldap group - https://phabricator.wikimedia.org/T236636 (10Bawolff) [23:29:33] (03CR) 10BryanDavis: [C: 03+1] cloud vps: Add support for Buster to LXC module [puppet] - 10https://gerrit.wikimedia.org/r/546373 (https://phabricator.wikimedia.org/T236455) (owner: 10BryanDavis) [23:31:11] (03CR) 10BryanDavis: [C: 03+1] "Tested via cherry-pick on mwv-puppetmaster.mediawiki-vagrant.eqiad.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/546373 (https://phabricator.wikimedia.org/T236455) (owner: 10BryanDavis) [23:33:58] (03PS1) 10Brian Wolff: Fix up CSP headers for doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/546378 (https://phabricator.wikimedia.org/T213223)