[00:00:22] (03CR) 10Cwhite: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/542251 (owner: 10Filippo Giunchedi) [00:11:01] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [00:12:30] 10Operations, 10DC-Ops: fix IPMI over LAN on certain HP hosts - https://phabricator.wikimedia.org/T235234 (10Dzahn) [00:13:04] 10Operations, 10DC-Ops: fix IPMI over LAN on certain HP hosts - https://phabricator.wikimedia.org/T235234 (10Dzahn) [00:16:58] 10Operations, 10DC-Ops: fix IPMI over LAN on certain HP hosts - https://phabricator.wikimedia.org/T235234 (10Dzahn) [00:27:42] (03PS1) 10Hoo man: No longer use --no-cache when dumping Wikibase entities [puppet] - 10https://gerrit.wikimedia.org/r/542278 [00:30:59] 10Operations, 10DC-Ops: fix IPMI over LAN on certain HP hosts - https://phabricator.wikimedia.org/T235234 (10Dzahn) [00:32:09] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [00:35:51] 10Operations, 10DC-Ops: fix IPMI over LAN on certain HP hosts - https://phabricator.wikimedia.org/T235234 (10Dzahn) codfw db hosts - fixed ms-be eqiad hosts - These brand new installs from T232367 @Robh @Jclark-ctr could you make sure IPMI over LAN is enabled on these? [00:38:15] 10Operations, 10Documentation: Document how to fix IPMI issues on Wikitech - https://phabricator.wikimedia.org/T191956 (10Dzahn) [00:38:18] 10Operations, 10DC-Ops: fix IPMI over LAN on certain HP hosts - https://phabricator.wikimedia.org/T235234 (10Dzahn) [00:38:20] 10Operations, 10observability: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160 (10Dzahn) [00:38:40] 10Operations: IPMI Audit 2018-04 - https://phabricator.wikimedia.org/T193155 (10Dzahn) [00:38:45] 10Operations, 10DC-Ops: fix IPMI over LAN on certain HP hosts - https://phabricator.wikimedia.org/T235234 (10Dzahn) [00:39:54] 10Operations, 10DC-Ops: fix IPMI over LAN on certain HP hosts - https://phabricator.wikimedia.org/T235234 (10Dzahn) a:03Papaul assigning to Papaul per IRC chat (thanks!) [00:49:57] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [01:31:10] 10Operations, 10DC-Ops: fix IPMI over LAN on certain HP hosts - https://phabricator.wikimedia.org/T235234 (10Papaul) [01:32:11] 10Operations, 10DC-Ops: fix IPMI over LAN on certain HP hosts - https://phabricator.wikimedia.org/T235234 (10Papaul) a:05Papaul→03Dzahn [01:32:34] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2058.codfw.wmnet - https://phabricator.wikimedia.org/T229543 (10Papaul) [01:32:51] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2058.codfw.wmnet - https://phabricator.wikimedia.org/T229543 (10Papaul) 05Open→03Resolved Complete [01:32:53] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Papaul) [01:33:10] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2069.codfw.wmnet - https://phabricator.wikimedia.org/T230107 (10Papaul) [01:40:39] RECOVERY - Check the Netbox report librenms for fail status. on netbox1001 is OK: librenms.LibreNMS OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [02:03:28] 10Operations, 10DC-Ops: fix IPMI over LAN on certain HP hosts - https://phabricator.wikimedia.org/T235234 (10Dzahn) [02:10:46] !log gerrit1001 - attempt to manually start replication to github [02:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:13:08] !log gerrit - restart service to ensure last config change is picked up [02:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:32] !log gerrit - "manually" starting replication via ssh command [02:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:21] 10Operations, 10DC-Ops: fix IPMI over LAN on certain HP hosts - https://phabricator.wikimedia.org/T235234 (10Dzahn) a:05Dzahn→03None [02:36:38] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission astatine - https://phabricator.wikimedia.org/T221244 (10Dzahn) The box for production DNS removed is checked but looking at DNS repo it's still there: templates/wikimedia.org:astatine 1H IN A 208.80.155.110 templates/155.80.208.i... [02:37:37] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission astatine - https://phabricator.wikimedia.org/T221244 (10Dzahn) [02:56:29] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 20673816 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:58:07] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 51248 and 66 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:36:25] (03CR) 10Marostegui: "Now that this is merged, should we remove the IP bans, maybe on Monday?" [puppet] - 10https://gerrit.wikimedia.org/r/542153 (owner: 10CDanis) [04:48:04] (03CR) 10Dzahn: "@akosiaris @cdanis first i saw service/services.yaml and wanted to add a new service name, "httpd", to it. to not use "apache2" again per " [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [04:54:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1098:3317 for schema change T233625', diff saved to https://phabricator.wikimedia.org/P9310 and previous config saved to /var/cache/conftool/dbconfig/20191011-045409-marostegui.json [04:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:15] T233625: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 [05:01:49] 10Operations, 10Documentation: Document how to fix IPMI issues on Wikitech - https://phabricator.wikimedia.org/T191956 (10Dzahn) see https://wikitech.wikimedia.org/wiki/Management_Interfaces [05:13:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [05:13:41] (03PS1) 10Marostegui: site.pp: Remove db2056 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/542314 (https://phabricator.wikimedia.org/T230777) [05:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:42] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2056.codfw.wmnet - https://phabricator.wikimedia.org/T230777 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2056.codfw.wmnet` - db2056.codfw.wmnet (**PASS**)... [05:15:06] (03PS1) 10Marostegui: wmnet: Remove db2056 production DNS entries [dns] - 10https://gerrit.wikimedia.org/r/542316 (https://phabricator.wikimedia.org/T230777) [05:16:10] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove db2056 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/542314 (https://phabricator.wikimedia.org/T230777) (owner: 10Marostegui) [05:16:54] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove db2056 production DNS entries [dns] - 10https://gerrit.wikimedia.org/r/542316 (https://phabricator.wikimedia.org/T230777) (owner: 10Marostegui) [05:18:30] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2056.codfw.wmnet - https://phabricator.wikimedia.org/T230777 (10Marostegui) a:05RobH→03Papaul [05:18:47] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2056.codfw.wmnet - https://phabricator.wikimedia.org/T230777 (10Marostegui) Host ready for onsite steps + switch disablement [05:23:46] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:24:53] hello cr2 [05:25:40] (03PS1) 10Vgutierrez: ATS: Add timing request information to ats-tls log [puppet] - 10https://gerrit.wikimedia.org/r/542317 (https://phabricator.wikimedia.org/T234887) [05:25:59] seems Telia transport with eqiad down [05:26:58] |1log rebooting an-conf1001 for serial troubleshooting [05:27:06] !1log rebooting an-conf1001 for serial troubleshooting [05:27:59] !log rebooting an-conf1001 for serial troubleshooting [05:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:43] mmm there was unexpected maintenance but not for that link from what I can read [05:31:02] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:31:28] ah ok and here the other side [05:33:48] 10Operations, 10observability, 10Availability, 10Goal: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10Marostegui) The last issue we had with bacula host itself was some sort of storage degradation/failure, no? Maybe some sort of OS monitoring to catch potential issues on the... [05:34:33] so I don't see any planned maintenance for the circuit, can somebody else triple check? (morning pebcak prevention) [05:39:04] ah no I found the maintenance [05:39:06] - Maintenance window: [05:39:06] Start Date and Time: 2019-Oct-11 04:00 UTC [05:39:06] End Date and Time: 2019-Oct-11 11:00 UTC [05:39:27] ok so it was expected but not in any calendar afaics [05:44:18] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:44:19] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:47:16] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 93, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:02:06] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:02:08] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:05:08] again maintenance --^ [06:08:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2085:3318 for compression - T232446', diff saved to https://phabricator.wikimedia.org/P9311 and previous config saved to /var/cache/conftool/dbconfig/20191011-060814-marostegui.json [06:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:19] T232446: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 [06:13:20] !log Compress tables on db2085:3318 - T232446 [06:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:24] T232446: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 [06:19:50] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:19:52] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:56:36] (03CR) 10Muehlenhoff: [C: 03+1] admin: add jkumalah to analytics-privatedata-users, researchers [puppet] - 10https://gerrit.wikimedia.org/r/542141 (https://phabricator.wikimedia.org/T234433) (owner: 10Herron) [06:57:30] (03CR) 10Muehlenhoff: [C: 03+1] admin: add dedcode to analytics-privatedata-users, researchers [puppet] - 10https://gerrit.wikimedia.org/r/542132 (https://phabricator.wikimedia.org/T234473) (owner: 10Herron) [07:05:49] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to analytics cluster for Martin Gerlach - https://phabricator.wikimedia.org/T232707 (10MGerlach) @MoritzMuehlenhoff opening this again since I cannot access the cluster anymore, e.g. via 'ssh mgerlach@stat1007.eqiad.wmnet' This happended aft... [07:28:31] !log deactivate HE peering on cr1-eqiad for packet loss [07:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:17] !log deactivate HE peering on cr2-eqord for packet loss [07:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:00] !log rollback two previous HE peering deactivate [07:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:32] 10Operations, 10ops-codfw: No microcode updates loaded on puppetmaster2001/2002 after reimage to Buster - https://phabricator.wikimedia.org/T235250 (10MoritzMuehlenhoff) [07:35:41] 10Operations, 10ops-codfw: No microcode updates loaded on puppetmaster2001/2002 after reimage to Buster - https://phabricator.wikimedia.org/T235250 (10MoritzMuehlenhoff) p:05Triage→03Normal [07:45:22] (03PS1) 10Muehlenhoff: Switch labpuppetmaster spares to Buster for microcode/initrd debugging [puppet] - 10https://gerrit.wikimedia.org/r/542320 [07:45:41] (03PS1) 10Ayounsi: PDUs: add model sentry 4 to eqiad b1 and a2 [puppet] - 10https://gerrit.wikimedia.org/r/542321 (https://phabricator.wikimedia.org/T227536) [07:48:48] (03PS2) 10Muehlenhoff: Switch labpuppetmaster spares to Buster for microcode/initrd debugging [puppet] - 10https://gerrit.wikimedia.org/r/542320 [07:51:06] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18852/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/542321 (https://phabricator.wikimedia.org/T227536) (owner: 10Ayounsi) [07:51:43] (03PS3) 10Muehlenhoff: Switch labpuppetmaster spares to Buster for microcode/initrd debugging [puppet] - 10https://gerrit.wikimedia.org/r/542320 [07:55:28] (03CR) 10Muehlenhoff: [C: 03+2] Switch labpuppetmaster spares to Buster for microcode/initrd debugging [puppet] - 10https://gerrit.wikimedia.org/r/542320 (owner: 10Muehlenhoff) [07:55:29] RECOVERY - ps1-b1-eqiad-infeed-load-tower-B-phase-Y on ps1-b1-eqiad is OK: SNMP OK - ps1-b1-eqiad-infeed-load-tower-B-phase-Y 194 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:55:39] RECOVERY - ps1-b1-eqiad-infeed-load-tower-B-phase-Z on ps1-b1-eqiad is OK: SNMP OK - ps1-b1-eqiad-infeed-load-tower-B-phase-Z 318 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:55:47] RECOVERY - ps1-a2-eqiad-infeed-load-tower-B-phase-X on ps1-a2-eqiad is OK: SNMP OK - ps1-a2-eqiad-infeed-load-tower-B-phase-X 367 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:55:55] RECOVERY - ps1-a2-eqiad-infeed-load-tower-A-phase-Y on ps1-a2-eqiad is OK: SNMP OK - ps1-a2-eqiad-infeed-load-tower-A-phase-Y 248 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:56:03] RECOVERY - ps1-b1-eqiad-infeed-load-tower-B-phase-X on ps1-b1-eqiad is OK: SNMP OK - ps1-b1-eqiad-infeed-load-tower-B-phase-X 273 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:57:29] RECOVERY - ps1-a2-eqiad-infeed-load-tower-B-phase-Z on ps1-a2-eqiad is OK: SNMP OK - ps1-a2-eqiad-infeed-load-tower-B-phase-Z 226 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:03:13] RECOVERY - ps1-b1-eqiad-infeed-load-tower-A-phase-X on ps1-b1-eqiad is OK: SNMP OK - ps1-b1-eqiad-infeed-load-tower-A-phase-X 322 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:04:13] RECOVERY - ps1-a2-eqiad-infeed-load-tower-A-phase-X on ps1-a2-eqiad is OK: SNMP OK - ps1-a2-eqiad-infeed-load-tower-A-phase-X 293 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:04:13] RECOVERY - ps1-a2-eqiad-infeed-load-tower-A-phase-Z on ps1-a2-eqiad is OK: SNMP OK - ps1-a2-eqiad-infeed-load-tower-A-phase-Z 321 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:04:13] RECOVERY - ps1-a2-eqiad-infeed-load-tower-B-phase-Y on ps1-a2-eqiad is OK: SNMP OK - ps1-a2-eqiad-infeed-load-tower-B-phase-Y 318 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:04:13] RECOVERY - ps1-b1-eqiad-infeed-load-tower-A-phase-Y on ps1-b1-eqiad is OK: SNMP OK - ps1-b1-eqiad-infeed-load-tower-A-phase-Y 199 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:04:13] RECOVERY - ps1-b1-eqiad-infeed-load-tower-A-phase-Z on ps1-b1-eqiad is OK: SNMP OK - ps1-b1-eqiad-infeed-load-tower-A-phase-Z 346 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:04:37] !log reimaging labpuppetmaster1002 (spare) for some tests related to microcode loading [08:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:08] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10ayounsi) Can I suggest a few modifications to the PDU swap checklist of each task? Mostly to clear out the alerting noise Under: "schedule downtime for the entire list of... [08:18:45] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10wiki_willy) Hi @ayounsi - I talked to a couple other people who had the same concern the other day, and I agree as well...so I started scheduling downtime for the PDU ale... [08:28:29] !log jmm@cumin1001 START - Cookbook sre.hosts.downtime [08:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:45] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:21] !log remove kafka1001-1003 from debmonitor DB (T235125) [08:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:24] T235125: Move kafka200[123] to logstash202[012] - https://phabricator.wikimedia.org/T235125 [08:34:04] !log remove kafka2001-2003 from debmonitor DB (T235125) [08:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:36] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10wiki_willy) I'll dig around a bit and check with Dell to see if we can figure why Com1 and Com2 have to be flipped to get it working. Talked to Luca and wo... [08:40:57] (03CR) 10Gehel: [C: 04-1] "see comments inline" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [08:55:50] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Move parsing of Cumin alias/query outside of a global option [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/542125 (owner: 10Muehlenhoff) [09:08:18] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:12:26] (03CR) 10Gehel: [C: 04-1] "a few more comments inline" (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe) [09:17:25] 10Operations, 10observability, 10Availability, 10Goal: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10akosiaris) > The last issue we had with bacula host itself was some sort of storage degradation/failure, no? Somewhat. A disk in the RAID failed, ending up with the nagios c... [09:20:23] 10Operations, 10serviceops: Increase of varnish-be failed fetches error due to "http format error" - https://phabricator.wikimedia.org/T235254 (10jijiki) [09:22:45] 10Operations, 10observability, 10Availability, 10Goal: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10jcrespo) @fgiunchedi I will either start with such brainstorming or maybe some the technical, foundation layers first (script for checking automation), please make sure to fe... [09:24:04] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [09:25:54] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) @akosiaris We have reached an impass. We should: * Run puppet with the new permissions on the current bacula host, fix any issues found. * P... [09:29:30] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:36:15] (03PS1) 10Muehlenhoff: debdeploy-deploy: Transitions are optional [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/542331 [09:37:02] (03CR) 10Volans: [V: 03+2 C: 03+2] "LGTM, bypassing CI for the sphinx issue with requests." [software/spicerack] - 10https://gerrit.wikimedia.org/r/542090 (https://phabricator.wikimedia.org/T147074) (owner: 10Jbond) [09:37:10] (03PS2) 10Volans: ipmi: The change to subprocess.run() failed to capture stdout [software/spicerack] - 10https://gerrit.wikimedia.org/r/542090 (https://phabricator.wikimedia.org/T147074) (owner: 10Jbond) [09:37:54] (03CR) 10Volans: [V: 03+2 C: 03+2] ipmi: The change to subprocess.run() failed to capture stdout [software/spicerack] - 10https://gerrit.wikimedia.org/r/542090 (https://phabricator.wikimedia.org/T147074) (owner: 10Jbond) [09:38:51] (03CR) 10Alexandros Kosiaris: "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/542153 (owner: 10CDanis) [09:38:56] (03CR) 10jenkins-bot: ipmi: The change to subprocess.run() failed to capture stdout [software/spicerack] - 10https://gerrit.wikimedia.org/r/542090 (https://phabricator.wikimedia.org/T147074) (owner: 10Jbond) [09:42:35] (03PS1) 10Elukey: eventlogging::dependencies: add python3 dependencies [puppet] - 10https://gerrit.wikimedia.org/r/542333 (https://phabricator.wikimedia.org/T233231) [09:43:15] (03PS1) 10Arturo Borrero Gonzalez: toolforge: aptly: add buster-tools repository [puppet] - 10https://gerrit.wikimedia.org/r/542334 (https://phabricator.wikimedia.org/T235059) [09:43:17] (03CR) 10jerkins-bot: [V: 04-1] eventlogging::dependencies: add python3 dependencies [puppet] - 10https://gerrit.wikimedia.org/r/542333 (https://phabricator.wikimedia.org/T233231) (owner: 10Elukey) [09:47:06] (03PS2) 10Elukey: eventlogging::dependencies: add python3 dependencies [puppet] - 10https://gerrit.wikimedia.org/r/542333 (https://phabricator.wikimedia.org/T233231) [09:49:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: aptly: add buster-tools repository [puppet] - 10https://gerrit.wikimedia.org/r/542334 (https://phabricator.wikimedia.org/T235059) (owner: 10Arturo Borrero Gonzalez) [09:52:10] (03CR) 10Elukey: [C: 03+2] eventlogging::dependencies: add python3 dependencies [puppet] - 10https://gerrit.wikimedia.org/r/542333 (https://phabricator.wikimedia.org/T233231) (owner: 10Elukey) [09:52:22] (03PS3) 10Elukey: eventlogging::dependencies: add python3 dependencies [puppet] - 10https://gerrit.wikimedia.org/r/542333 (https://phabricator.wikimedia.org/T233231) [09:52:43] (03CR) 10Muehlenhoff: [C: 03+2] debdeploy-deploy: Transitions are optional [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/542331 (owner: 10Muehlenhoff) [09:56:05] (03PS1) 10Arturo Borrero Gonzalez: toolforge: aptly: add buster-toolsbeta repository [puppet] - 10https://gerrit.wikimedia.org/r/542348 (https://phabricator.wikimedia.org/T235059) [09:56:41] gerrit in trouble? [09:56:43] https://grafana.wikimedia.org/d/Bw2mQ3iWz/gerrit-javamelody?panelId=16&fullscreen&orgId=1 [09:56:47] super slow for me [09:57:45] hashar: --^ [09:58:47] it doesn't load anymore for me [09:59:05] threads are climbing [09:59:07] elukey: looking [10:02:31] !log gerrit: killed a stall SendEmail thread that was holding a lock [10:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: aptly: add buster-toolsbeta repository [puppet] - 10https://gerrit.wikimedia.org/r/542348 (https://phabricator.wikimedia.org/T235059) (owner: 10Arturo Borrero Gonzalez) [10:04:42] thanks! [10:04:42] fun [10:04:48] the deadlock is gone for new requests [10:04:56] but the lock is still held anyway for the other http threads [10:04:58] :\ [10:05:14] should we restart? [10:06:05] (03PS1) 10Elukey: eventlogging::dependencies: remove python3-pykafka dependency [puppet] - 10https://gerrit.wikimedia.org/r/542401 (https://phabricator.wikimedia.org/T233231) [10:06:24] (03CR) 10Alexandros Kosiaris: "It needs to be a string that is referenced in hieradata/common/lvs/configuration.yaml in the corresponding LVS entry under the conftool st" [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [10:06:48] (03PS2) 10Elukey: eventlogging::dependencies: remove python3-pykafka dependency [puppet] - 10https://gerrit.wikimedia.org/r/542401 (https://phabricator.wikimedia.org/T233231) [10:07:25] (03CR) 10Elukey: [C: 03+2] eventlogging::dependencies: remove python3-pykafka dependency [puppet] - 10https://gerrit.wikimedia.org/r/542401 (https://phabricator.wikimedia.org/T233231) (owner: 10Elukey) [10:08:46] hashar: gerrit is still half usable for me :( [10:08:57] yes [10:09:01] going to resrat it [10:11:15] !log Restarting Gerrit # T224448 [10:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:19] T224448: Gerrit account cache has a faulty reentrant lock causing http/sendemail threads to stall completely - https://phabricator.wikimedia.org/T224448 [10:15:54] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [10:17:26] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [10:17:56] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.01156 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:18:00] Oh threads problem again? [10:18:09] !log imported debdeploy 0.0.99.11 for jessie/stretch/buster-wikimedia [10:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:35] paladox: yes :-\ [10:28:54] Ok :( [10:31:27] hashar: we’ll need to get https://gerrit-review.googlesource.com/c/gerrit/+/239436 deployed! [10:33:32] possibly yeah [10:34:58] what I am wondering is that maybe the deadlock occurs way above. Eg in the thread pool executor [10:35:09] so that potentially two task ends up locking it for some reason [10:35:14] and they end up waiting on each other [10:35:24] with the jvm magically not detecting it :\ [10:38:25] paladox: but maybe we can thread a heap dump (that is going to take a while and be large I guess) and then find a way to debug it [10:45:22] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.003613 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:01:06] 10Operations, 10DBA: Switchover s1 primary database master db1067 -> db1083 - 14th Nov 05:00 - 05:30 UTC - https://phabricator.wikimedia.org/T234800 (10Johan) In that case, we don't need to take the less efficient way of writing in Tech News, better to contact the wiki directly. [11:08:21] !log upgrading debdeploy to 0.0.99.11 [11:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:54] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to analytics cluster for Martin Gerlach - https://phabricator.wikimedia.org/T232707 (10MoritzMuehlenhoff) Try running ` ssh-add ~/.ssh/id_ed25519 ` It will ask you for the passphrase of our SSH key. After running doing that, can you retry... [11:33:23] (03PS8) 10Arturo Borrero Gonzalez: toolforge: introduce new proxy role [puppet] - 10https://gerrit.wikimedia.org/r/508560 (https://phabricator.wikimedia.org/T219362) [11:33:30] Urbanecm: you there? [11:33:39] hauskater: yes, how may I help? [11:34:55] Urbanecm: any issues with the job queue? Two global renames refusing to start [11:34:58] both on enwiki [11:35:15] * Urbanecm is opening logstash [11:37:36] Haydenb13's rename seems to be done on enwiki, btw [11:38:00] seems to be temporary issue :) [11:38:04] both are done on my end [11:38:09] hauskater: [11:39:37] * hauskater checks on his [11:39:55] Yup, both seem now to have completed [11:40:00] After ~20 minutes [11:40:02] :) [11:40:07] Busy queue maybe [11:46:15] (03CR) 10Arturo Borrero Gonzalez: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/508560 (https://phabricator.wikimedia.org/T219362) (owner: 10Arturo Borrero Gonzalez) [11:51:06] !log installing unzip security updates on stretch [11:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:24] (03PS1) 10Muehlenhoff: Add library hint for libcaca [puppet] - 10https://gerrit.wikimedia.org/r/542413 [11:57:42] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to analytics cluster for Martin Gerlach - https://phabricator.wikimedia.org/T232707 (10MGerlach) That solved it. Thanks. [12:00:43] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libcaca [puppet] - 10https://gerrit.wikimedia.org/r/542413 (owner: 10Muehlenhoff) [12:22:35] (03CR) 10Lucas Werkmeister (WMDE): "We already have a wmgWikibaseRepoEnableRefTabs setting that’s used to enable ref tabs in beta – it would be better to use that setting for" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514461 (https://phabricator.wikimedia.org/T199197) (owner: 10Mvolz) [12:24:39] !log push firewall policies to pfw3-codfw - T235074 [12:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:51] (03PS1) 10Muehlenhoff: Sort distros in generate-debdeploy-spec [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/542417 [12:25:37] (03PS10) 10Lucas Werkmeister (WMDE): Enable reftabs on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514461 (https://phabricator.wikimedia.org/T199197) (owner: 10Mvolz) [12:25:50] !log push firewall policies to pfw3-eqiad - T235074 [12:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:02] !log installing libcaca security updates on stretch [12:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1098:3317 after schema change T233625', diff saved to https://phabricator.wikimedia.org/P9314 and previous config saved to /var/cache/conftool/dbconfig/20191011-123159-marostegui.json [12:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:03] T233625: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 [12:33:38] !log installing gsoap security updates on stretch [12:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:48] !log installin zsh updates from stretch point release [12:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:59] !log installing libxslt security updates [12:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:44] !log disable SIP ALG on pfw3-codfw - T235150 [12:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:32] !log disable SIP ALG on pfw3-eqiad - T235150 [12:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:23] 10Operations: Integrate Stretch 9.10/9.11 point updates - https://phabricator.wikimedia.org/T232308 (10MoritzMuehlenhoff) [13:01:37] !log installing 4.9.189 Linux update from last stretch point releases (no reboots, deploying the package only at this point) [13:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:40] (03PS9) 10Arturo Borrero Gonzalez: toolforge: introduce new proxy role [puppet] - 10https://gerrit.wikimedia.org/r/508560 (https://phabricator.wikimedia.org/T219362) [13:06:45] (03CR) 10jerkins-bot: [V: 04-1] toolforge: introduce new proxy role [puppet] - 10https://gerrit.wikimedia.org/r/508560 (https://phabricator.wikimedia.org/T219362) (owner: 10Arturo Borrero Gonzalez) [13:09:21] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team: Grant LDAP groups and deployment shell access to Kevin Bazira - https://phabricator.wikimedia.org/T234209 (10Halfak) [13:10:31] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team: Grant LDAP groups and deployment shell access to Kevin Bazira - https://phabricator.wikimedia.org/T234209 (10Halfak) I've updated the task details with some high-level reasoning for the access. If it's not evident, I app... [13:17:05] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team: Grant LDAP groups and deployment shell access to Kevin Bazira - https://phabricator.wikimedia.org/T234209 (10kevinbazira) Thanks @Halfak, @herron, I've signed the L3 agreement document, and below is my user information:... [13:42:40] 10Operations, 10serviceops: Increase of varnish-be failed fetches error due to "http format error" - https://phabricator.wikimedia.org/T235254 (10jijiki) The varnish error rate is back to normal for now, but we should keep an eye for a similar issue in the future. [13:57:02] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:57:03] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:47] !log rebooting cloudbackup2001 [13:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:06] (03PS3) 10Effie Mouzeli: lvs::monitor_services: increase number of tries before MCS is critical [puppet] - 10https://gerrit.wikimedia.org/r/541891 (https://phabricator.wikimedia.org/T229286) [14:17:19] (03CR) 10jerkins-bot: [V: 04-1] lvs::monitor_services: increase number of tries before MCS is critical [puppet] - 10https://gerrit.wikimedia.org/r/541891 (https://phabricator.wikimedia.org/T229286) (owner: 10Effie Mouzeli) [14:22:08] (03PS4) 10Effie Mouzeli: lvs::monitor_services: increase number of tries before MCS is critical [puppet] - 10https://gerrit.wikimedia.org/r/541891 (https://phabricator.wikimedia.org/T229286) [14:25:53] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Paladox) [14:29:05] (03CR) 10Alexandros Kosiaris: [C: 03+1] lvs::monitor_services: increase number of tries before MCS is critical [puppet] - 10https://gerrit.wikimedia.org/r/541891 (https://phabricator.wikimedia.org/T229286) (owner: 10Effie Mouzeli) [14:29:33] (03CR) 10Volans: [C: 03+1] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/542417 (owner: 10Muehlenhoff) [14:31:27] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Sort distros in generate-debdeploy-spec [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/542417 (owner: 10Muehlenhoff) [14:38:48] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:39:53] (03PS1) 10Muehlenhoff: Print spec file name in generate-debdeploy-spec [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/542446 [14:43:18] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Print spec file name in generate-debdeploy-spec [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/542446 (owner: 10Muehlenhoff) [15:00:02] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:02:26] (03CR) 10Aaron Schulz: "A noop is fine. It means that I change the MW default without breaking prod." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521967 (owner: 10Aaron Schulz) [15:34:51] (03CR) 10Cwhite: [C: 03+2] Update probe endpoint to support path and spec_segment [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/541683 (owner: 10Cwhite) [15:35:30] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:35:52] (03PS1) 10Jhedden: openstack: update eqiad1 clients to wikimediacloud auth url [puppet] - 10https://gerrit.wikimedia.org/r/542452 (https://phabricator.wikimedia.org/T223907) [15:36:26] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value [15:36:26] g keys: [image, tfa, mostread] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:36:26] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value [15:36:26] g keys: [tfa, mostread, image] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:36:32] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wikifeeds_8889: Servers kubernetes1002.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:36:36] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value [15:36:36] g keys: [image, mostread, tfa] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:36:42] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:36:42] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:37:04] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wikifeeds_8889: Servers kubernetes1003.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:37:10] PROBLEM - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [15:37:10] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restba [15:37:10] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:37:10] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:37:30] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restba [15:37:30] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [15:37:54] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:37:54] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:37:58] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:37:58] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:37:58] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:37:58] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:38:04] PROBLEM - LVS HTTP IPv4 on wikifeeds.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:39:44] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:39:44] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:39:48] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:39:54] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:39:54] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:39:54] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:40:46] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [15:40:51] (03CR) 10Jhedden: "PCC results: https://puppet-compiler.wmflabs.org/compiler1001/18858/" [puppet] - 10https://gerrit.wikimedia.org/r/542452 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [15:42:48] shdubsh: o/ can the above alerts be related to your change? [15:42:50] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:42:50] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:43:00] or do we have an outage? [15:43:23] elukey: not related to anything I'm doing. [15:43:40] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:43:45] maybe an outage. checking for effects [15:43:46] what's going on? [15:43:58] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [tfa, image, mostread] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:44:04] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [15:44:11] not sure, there was a change from cole about the swagger prometheus stuff [15:44:20] so I thought it was related [15:44:23] https://gerrit.wikimedia.org/r/#/c/operations/debs/prometheus-swagger-exporter/+/541683/ [15:44:24] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:44:24] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:44:26] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:44:26] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:44:26] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:44:26] RECOVERY - LVS HTTP IPv4 on wikifeeds.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 945 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:44:28] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:44:40] (03CR) 10Jhedden: "openstack..wikimediacloud.org is a CNAME to cloudcontrol1003, but it will restart a lot of services." [puppet] - 10https://gerrit.wikimedia.org/r/542452 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [15:44:42] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wikifeeds_8889: Servers kubernetes1002.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:44:46] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:44:46] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:44:56] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:44:56] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:44:56] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:45:09] marostegui: I think we should ping more people [15:45:14] (03CR) 10Jhedden: [C: 03+1] "On hold until Tuesday, Oct 15 2019" [puppet] - 10https://gerrit.wikimedia.org/r/542452 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [15:45:18] mobrovac: --^ [15:45:22] RECOVERY - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [15:45:24] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restba [15:45:34] (03CR) 10Jhedden: [C: 04-1] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/542452 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [15:45:36] elukey: done [15:45:55] entrypoint latencies are multi-minute [15:46:48] seems only for one endpoint though [15:46:58] (03PS2) 10Jhedden: openstack: update eqiad1 clients to wikimediacloud auth url [puppet] - 10https://gerrit.wikimedia.org/r/542452 (https://phabricator.wikimedia.org/T223907) [15:47:40] the /en.wikipedia.org/v1/feed/featured [15:48:02] yup that's wikifeeds acting up [15:48:03] akosiaris: ^ [15:48:28] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:48:34] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:48:34] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:48:38] aha! [15:49:16] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [tfa, mostread, image] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:49:18] PROBLEM - LVS HTTP IPv4 on wikifeeds.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.47 and port 8889: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:49:22] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:49:22] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:49:22] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:49:22] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restba [15:49:22] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restba [15:49:26] yeah no [15:49:32] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:49:32] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:49:34] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:49:42] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: [15:49:42] most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [15:49:46] akosiaris: ^ i get conn refused if i try manually wikifeeds from a rb host [15:49:46] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:49:48] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:50:16] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:50:30] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [15:50:34] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:50:35] ok seems to be back to normal [15:50:54] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:50:54] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:50:56] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:50:56] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:50:56] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:50:56] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:50:56] RECOVERY - LVS HTTP IPv4 on wikifeeds.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 945 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:51:16] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [15:51:22] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:51:50] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:53:59] 10Operations, 10SRE-Access-Requests: Requesting access to 'analytics-privatedata-users' and 'researchers' for Erin Yener - https://phabricator.wikimedia.org/T234529 (10jrobell) Thanks for your help moving this forward @herron. would it be possible to get on a call or chat with @EYener and @jkumalah to make su... [15:54:31] mobrovac: is wikifeeds on k8s or elsewhere? [15:54:34] (super ignorant) [15:54:44] k8s elukey [15:54:51] ah lovely [16:05:46] do I understand we were in an outage for a bit? because I see no pages even by email [16:12:56] 10Operations, 10Wikimedia-Mailing-lists: Create wikimedia sustainability mailing list - https://phabricator.wikimedia.org/T234999 (10Aklapper) As long as the list description explains, the actual name can be short I guess :) [16:16:00] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:19:58] PROBLEM - Host ganeti2009 is DOWN: PING CRITICAL - Packet loss = 100% [16:26:36] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:32:00] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 58.87 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:32:12] I'll take a look at ganeti2009 [16:32:22] RECOVERY - Host ganeti2009 is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms [16:32:41] oh nevermind that's being setup isn't it ? [16:32:52] papaul: ^ ? [16:33:36] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 78.3 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:42:18] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: labsdb1009 broken PSU - https://phabricator.wikimedia.org/T233273 (10wiki_willy) @Jclark-ctr - this arrived Thursday via https://www.fedex.com/en-us/home.html. Just a heads up, this will need to be replaced before the PDU upgrade next Tuesday, to retain redundan... [16:43:48] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: labsdb1009 broken PSU - https://phabricator.wikimedia.org/T233273 (10Marostegui) I believe we don't have to put the host down for the PSU replacement, do we? However I would like to depool and stop mysql before, as a crash with mysql running could cause data corr... [16:45:22] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: labsdb1009 broken PSU - https://phabricator.wikimedia.org/T233273 (10wiki_willy) Yup, it should be a hot swap. So @Jclark-ctr - please reach out to @Marostegui before replacing it. Thanks, Willy [16:48:51] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10Papaul) @elukey after workin 4 hours on this, te problem ended up no being the Serial configuration in the BIOS but the GRUB settings. on the systems we hav... [16:54:17] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10Papaul) I made the change again on an-conf1001 and did run systemctl enable getty@ttyS1 and reboot the system now it is working so you can do the same for t... [16:55:57] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10wiki_willy) Great job @Papaul in troubleshooting this and tracking it down to the root cause. Thanks! ~Willy [17:06:54] (03PS2) 10Herron: admin: add dedcode to analytics-privatedata-users, researchers [puppet] - 10https://gerrit.wikimedia.org/r/542132 (https://phabricator.wikimedia.org/T234473) [17:09:13] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Effeietsanders) An estimated 120 emails have now been unsubscribed. It looks like AOL and Yahoo. Is this also happening for other mailing lists? [17:09:22] (03PS1) 10Jgreen: add frqueue2001 to icinga nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/542464 (https://phabricator.wikimedia.org/T232630) [17:09:58] (03CR) 10Herron: [C: 03+2] admin: add dedcode to analytics-privatedata-users, researchers [puppet] - 10https://gerrit.wikimedia.org/r/542132 (https://phabricator.wikimedia.org/T234473) (owner: 10Herron) [17:10:36] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Paladox) When i sent a email to wikitech-i last night, it failed (seems to be because yahoo blacklisted lists.wikimedia.org. [17:11:49] (03PS2) 10Herron: admin: add jkumalah to analytics-privatedata-users, researchers [puppet] - 10https://gerrit.wikimedia.org/r/542141 (https://phabricator.wikimedia.org/T234433) [17:14:13] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frban2001.codfw.wmnet - https://phabricator.wikimedia.org/T234069 (10Jgreen) [17:14:16] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frban1001.eqiad.wmnet - https://phabricator.wikimedia.org/T234068 (10Jgreen) [17:14:19] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frnetmon1001 - https://phabricator.wikimedia.org/T232137 (10Jgreen) [17:14:22] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: rack/setup/install frqueue2001 - https://phabricator.wikimedia.org/T232630 (10Jgreen) [17:16:45] (03CR) 10Herron: [C: 03+2] admin: add jkumalah to analytics-privatedata-users, researchers [puppet] - 10https://gerrit.wikimedia.org/r/542141 (https://phabricator.wikimedia.org/T234433) (owner: 10Herron) [17:18:12] 10Operations, 10SRE-Access-Requests: Requesting access to 'analytics-privatedata-users' and 'researchers' for Erin Yener - https://phabricator.wikimedia.org/T234529 (10herron) [17:18:27] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: rack/setup/install frqueue2001 - https://phabricator.wikimedia.org/T232630 (10Jgreen) [17:27:01] (03PS1) 10Cwhite: profile: added swagger exporter jobs at svc endpoints [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [17:28:10] (03PS2) 10Jgreen: add frqueue2001 to icinga nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/542464 (https://phabricator.wikimedia.org/T232630) [17:29:41] (03CR) 10jerkins-bot: [V: 04-1] profile: added swagger exporter jobs at svc endpoints [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [17:29:54] (03CR) 10Jgreen: [C: 03+2] add frqueue2001 to icinga nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/542464 (https://phabricator.wikimedia.org/T232630) (owner: 10Jgreen) [17:31:11] 10Operations, 10ops-eqiad, 10media-storage, 10User-fgiunchedi: ms-be1020 - host went down - https://phabricator.wikimedia.org/T234698 (10Dzahn) [17:31:47] (03PS2) 10Cwhite: profile, prometheus, role: install swagger exporter on prometheus nodes [puppet] - 10https://gerrit.wikimedia.org/r/541619 (https://phabricator.wikimedia.org/T205870) [17:31:48] 10Operations, 10ops-eqiad, 10media-storage, 10User-fgiunchedi: ms-be1020 - firmware upgrade: (was: host went down) - https://phabricator.wikimedia.org/T234698 (10Dzahn) [17:32:22] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/541619 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [17:33:04] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Lea_Lacroix_WMDE) I'm wondering if this is somehow related to the massive spam attack we had a few months ago on some mailing-lists (hundred of //fake// AOL email addresses subscri... [17:35:00] (03PS2) 10Cwhite: profile: added swagger exporter jobs at svc endpoints [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [17:35:03] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: rack/setup/install frqueue2001 - https://phabricator.wikimedia.org/T232630 (10Jgreen) [x] bonded ethernet configuration done [x] redis replication appears to be working now that firewall policy is deployed [x] added to icinga [17:38:23] (03CR) 10Umherirrender: [C: 04-1] "Yes, it looks good on wiki. Lets work with the messages first" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530871 (https://phabricator.wikimedia.org/T78711) (owner: 10Umherirrender) [17:44:45] (03CR) 10Krinkle: [C: 03+1] scap: mediawiki logstash_checker [puppet] - 10https://gerrit.wikimedia.org/r/539881 (https://phabricator.wikimedia.org/T234283) (owner: 10Thcipriani) [17:49:05] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10Papaul) @elukey so the issue is the you used puppet/modules/install_server/files/dhcpd/linux-host-entries.ttyS0-115200 and not puppet/modules/install_... [17:50:07] (03PS16) 10Cwhite: ci: define statsd prometheus exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T233089) [17:50:50] 10Operations, 10Documentation: Document how to fix IPMI issues on Wikitech - https://phabricator.wikimedia.org/T191956 (10Dzahn) @RobH there is a wikitech page you made back in 2012 about the ipmi_mgmt script at https://wikitech.wikimedia.org/wiki/Systems_management. Is that still used? Would it make sense... [17:52:03] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar): Messages in Logstash from php-fatal-error.php are missing from type:mediawiki/channel:fatal - https://phabricator.wikimedia.org/T234283 (10Krinkle) p:05Triage→03High [17:55:12] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar): Messages in Logstash from php-fatal-error.php are missing from type:mediawiki/channel:fatal - https://phabricator.wikimedia.org/T234283 (10Krinkle) [17:57:03] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Aklapper) p:05High→03Normal >>! In T232417#5566947, @Effeietsanders wrote: > An estimated 120 emails have now been unsubscribed. It looks like AOL and Yahoo. Is this also happe... [17:57:36] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to 'analytics-privatedata-users' and 'researchers' for Jerrie Kumalah - https://phabricator.wikimedia.org/T234433 (10jkumalah) @herron or @Nuria when i ssh into the stat1007 my password does not seem to work. I tried my yubikey as wel... [17:58:45] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Paladox) Is this normal though? Having lists.wikimedia.org blocked by yahoo in my opinion is pretty high priority. [17:59:25] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10elukey) Thanks a lot for this work! I was not aware that mistake, and I have also to admit my ignorance about that part of the dhcp configuration. I think t... [18:02:07] 10Operations, 10MediaWiki-General, 10serviceops, 10CPT Initiatives (PHP7 (TEC4)), and 2 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Krinkle) [18:02:11] 10Operations, 10serviceops, 10Patch-For-Review: SRE FY19-20 Q1 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10Krinkle) [18:02:19] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10elukey) The info is in https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Preparation_2 to I have clearly miss it, but a reference in the FAQ of platform... [18:04:10] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10Papaul) @elukey no need to feel bad about a mistake we all make mistakes just glad that it is fix. [18:07:31] 10Operations, 10Wikimedia-Mailing-lists: Please create engprod-mgt@ mailing list - https://phabricator.wikimedia.org/T235291 (10greg) p:05Triage→03Normal [18:08:13] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10aezell) I spoke to a friend who still works in this area and they said that spam detection and management is in freefall at Yahoo/AOL right now. They are rapidly defunding that par... [18:09:12] (03PS1) 10BryanDavis: wikitech: Update hostnames for OpenStack endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542506 (https://phabricator.wikimedia.org/T223907) [18:10:04] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10elukey) >>! In T227025#5567184, @Papaul wrote: > @elukey no need to feel bad about a mistake we all make mistakes just glad that it is fix. Thanks! What I... [18:11:48] 10Operations, 10Documentation: Document how to fix IPMI issues on Wikitech - https://phabricator.wikimedia.org/T191956 (10Dzahn) @ema @srodlund > Wikitech has the following list of IPMI related pages: .. - https://wikitech.wikimedia.org/wiki/Systems_management pinged author in comment above - https://wiki... [18:12:59] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10Papaul) I have no problem with you expanding the documentation : ) [18:19:16] 10Operations, 10Wikimedia-Mailing-lists: Please create engprod-mgt@ mailing list - https://phabricator.wikimedia.org/T235291 (10MarcoAurelio) @greg Will the list be public or private, with or without archives? Thanks. [18:25:17] (03PS1) 10Jforrester: build: Upgrade mediawiki-codesniffer to v28.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542522 [18:27:22] (03CR) 10Jhedden: [C: 03+1] wikitech: Update hostnames for OpenStack endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542506 (https://phabricator.wikimedia.org/T223907) (owner: 10BryanDavis) [18:30:37] (03CR) 10jerkins-bot: [V: 04-1] build: Upgrade mediawiki-codesniffer to v28.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542522 (owner: 10Jforrester) [18:36:27] 10Operations, 10Documentation: Document how to fix IPMI issues on Wikitech - https://phabricator.wikimedia.org/T191956 (10Dzahn) - https://wikitech.wikimedia.org/wiki/Systems_management redirected to [[https://wikitech.wikimedia.org/wiki/Management_Interfaces | Management Interfaces]] [18:43:16] 10Operations, 10SRE-Access-Requests: Requesting access to 'analytics-privatedata-users' and 'researchers' for Jerrie Kumalah - https://phabricator.wikimedia.org/T234433 (10Nuria) Your ssh key is teh one that shoudl work but it should not require a pasword the machine whole name is stat1007.eqiad.wmnet so > s... [18:48:52] 10Operations, 10Wikimedia-Mailing-lists: Please create engprod-mgt@ mailing list - https://phabricator.wikimedia.org/T235291 (10greg) Private, with archives. [18:58:40] (03PS7) 10Dzahn: conftool/LVS: add new service parsoid-php [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654) [19:07:28] 10Operations, 10Wikimedia-Mailing-lists: Please create engprod-mgt@ mailing list - https://phabricator.wikimedia.org/T235291 (10Dzahn) @greg List created. I let it created a random pass, then added the secondary admins and ran a "reset password" command. So you should have received 2 mails and everybody else... [19:11:04] @seen hauskater [19:11:04] mutante: Last time I saw hauskater they were quitting the network with reason: Quit: hauskater N/A at 10/11/2019 6:38:13 PM (32m51s ago) [19:11:33] (03CR) 10Filippo Giunchedi: [C: 04-1] "The idea LGTM, but we'll have to DRY and base the target discovery on puppet resources for the services themselves." [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [19:12:02] 10Operations, 10Wikimedia-Mailing-lists, 10User-greg: Please create engprod-mgt@ mailing list - https://phabricator.wikimedia.org/T235291 (10Dzahn) a:03greg [19:14:30] (03PS1) 10Herron: logstash: add an index for deployment related logs [puppet] - 10https://gerrit.wikimedia.org/r/542557 (https://phabricator.wikimedia.org/T234564) [19:18:31] 10Operations, 10Wikimedia-Mailing-lists, 10Wikispore: Creation of Wikispore mailing list - https://phabricator.wikimedia.org/T232961 (10Dzahn) 05Open→03Resolved a:03Dzahn List has been created list info page: https://lists.wikimedia.org/mailman/listinfo/wikispore admin login: https://lists.wikimedia.o... [19:19:23] (03CR) 10Krinkle: [C: 03+1] logstash: add an index for deployment related logs [puppet] - 10https://gerrit.wikimedia.org/r/542557 (https://phabricator.wikimedia.org/T234564) (owner: 10Herron) [19:21:00] (03CR) 10Filippo Giunchedi: "LGTM! See inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/541619 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [19:21:17] 10Operations, 10Wikimedia-Mailing-lists, 10User-greg: Please create engprod-mgt@ mailing list - https://phabricator.wikimedia.org/T235291 (10greg) 05Open→03Resolved Thanks! done. [19:21:53] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add an index for deployment related logs [puppet] - 10https://gerrit.wikimedia.org/r/542557 (https://phabricator.wikimedia.org/T234564) (owner: 10Herron) [19:25:15] (03CR) 10Herron: "that was quick! awesome. I'll plan to get this rolled out on tuesday, since we're about to go into a long us holiday weekend." [puppet] - 10https://gerrit.wikimedia.org/r/542557 (https://phabricator.wikimedia.org/T234564) (owner: 10Herron) [19:25:37] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2056.codfw.wmnet - https://phabricator.wikimedia.org/T230777 (10Papaul) ` papaul@asw-d-codfw# show | compare [edit interfaces interface-range vlan-private1-d-codfw] - member ge-6/0/4; [edit interfaces interface-range disabled] mem... [19:26:04] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2056.codfw.wmnet - https://phabricator.wikimedia.org/T230777 (10Papaul) [19:27:57] (03PS1) 10Dzahn: add service records for new parsoid-php service [dns] - 10https://gerrit.wikimedia.org/r/542566 (https://phabricator.wikimedia.org/T233654) [19:28:24] (03CR) 10Dzahn: "DNS: https://gerrit.wikimedia.org/r/c/operations/dns/+/542566" [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [19:29:11] (03CR) 10Dzahn: [C: 04-2] "per comment in "discovery-geo-resources" do NOT merge before separate change to hieradata/common/discovery.yaml" [dns] - 10https://gerrit.wikimedia.org/r/542566 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [19:32:51] (03PS1) 10Dzahn: discovery.yaml: add parsoid-php microservice [puppet] - 10https://gerrit.wikimedia.org/r/542572 (https://phabricator.wikimedia.org/T233654) [19:34:56] (03CR) 10Dzahn: "so it looks like first i have to do https://gerrit.wikimedia.org/r/c/operations/puppet/+/542572 and then the DNS change above and then i " [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [19:40:55] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2057.codfw.wmnet - https://phabricator.wikimedia.org/T230394 (10Papaul) [19:42:35] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2063.codfw.wmnet - https://phabricator.wikimedia.org/T230704 (10Papaul) [19:43:46] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:46:57] (03PS13) 10Dzahn: puppetmaster/configmaster: convert from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/451821 [19:49:05] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster/configmaster: convert from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/451821 (owner: 10Dzahn) [19:50:05] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10Dzahn) OIT reports E-mail account has been created. We can start now with some of these. [19:55:32] 10Operations, 10ops-codfw: No microcode updates loaded on puppetmaster2001/2002 after reimage to Buster - https://phabricator.wikimedia.org/T235250 (10Papaul) @MoritzMuehlenhoff The system is running : BIOS version :2.01 /available BIOS version: 2.10 Firmware version: 2.30 /available Firmware version:2.63 [19:56:13] 10Operations, 10MediaWiki-General, 10serviceops, 10CPT Initiatives (PHP7 (TEC4)), and 2 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Gorobay) Many articles beginning with lowercase letters are redirects to arti... [19:56:40] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:00:24] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10herron) >! In T232417#5567208, @aezell wrote: > tl:dr; Contacting someone in the abuse department at Yahoo/AOL is probably the best bet to figure this out. Yes indeed this looks t... [20:06:09] (03PS1) 10Papaul: DNS: Remove mgmt DNS for db2057 and db2063 [dns] - 10https://gerrit.wikimedia.org/r/542597 [20:08:33] 10Operations, 10SRE-Access-Requests: Requesting access to 'analytics-privatedata-users' and 'researchers' for Jerrie Kumalah - https://phabricator.wikimedia.org/T234433 (10jkumalah) {F30630945} Will follow-up with fr-tech teammates. The attached image is what i get each time. [20:09:27] (03CR) 10Dzahn: [C: 03+1] DNS: Remove mgmt DNS for db2057 and db2063 [dns] - 10https://gerrit.wikimedia.org/r/542597 (owner: 10Papaul) [20:11:46] (03PS3) 10Cwhite: profile, prometheus, role: install swagger exporter on prometheus nodes [puppet] - 10https://gerrit.wikimedia.org/r/541619 (https://phabricator.wikimedia.org/T205870) [20:12:01] (03CR) 10Cwhite: profile, prometheus, role: install swagger exporter on prometheus nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/541619 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [20:13:45] (03CR) 10jerkins-bot: [V: 04-1] profile, prometheus, role: install swagger exporter on prometheus nodes [puppet] - 10https://gerrit.wikimedia.org/r/541619 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [20:18:23] 10Operations, 10MediaWiki-General, 10serviceops, 10CPT Initiatives (PHP7 (TEC4)), and 2 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Anomie) And in some cases the actual article is at the lowercase-letter title... [20:18:55] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for db2057 and db2063 [dns] - 10https://gerrit.wikimedia.org/r/542597 (owner: 10Papaul) [20:22:04] (03PS1) 10Herron: admin: add eyener to researchers, analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/542599 (https://phabricator.wikimedia.org/T234529) [20:23:34] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to 'analytics-privatedata-users' and 'researchers' for Erin Yener - https://phabricator.wikimedia.org/T234529 (10herron) Hi @Nuria could you please review this group request for approval? [20:27:16] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2057.codfw.wmnet - https://phabricator.wikimedia.org/T230394 (10Papaul) [20:27:25] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Papaul) [20:27:28] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2057.codfw.wmnet - https://phabricator.wikimedia.org/T230394 (10Papaul) 05Open→03Resolved Complete [20:28:04] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2063.codfw.wmnet - https://phabricator.wikimedia.org/T230704 (10Papaul) [20:28:07] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to 'analytics-privatedata-users' and 'researchers' for Erin Yener - https://phabricator.wikimedia.org/T234529 (10herron) Regarding chat I'd encourage them to reach out with any questions via IRC. Details about available channels and... [20:28:16] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Papaul) [20:28:18] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2063.codfw.wmnet - https://phabricator.wikimedia.org/T230704 (10Papaul) 05Open→03Resolved Complete [20:29:10] 10Operations, 10Research, 10SRE-Access-Requests: Requesting access to analytics cluster for Djellel Difallah - https://phabricator.wikimedia.org/T234473 (10herron) 05Open→03Resolved a:03herron Access has been granted. Transitioning this to resolved now, but if any follow-up is needed please don't hesi... [20:29:34] 10Operations, 10DNS, 10Toolforge, 10Traffic, 10cloud-services-team (Kanban): Update authoratiative nameservers for the toolforge.org domain to point to Designate - https://phabricator.wikimedia.org/T235303 (10Krenair) This will need to be communicated to MarkMonitor who register domains on WMF's behalf..... [20:32:56] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team: Grant LDAP groups and deployment shell access to Kevin Bazira - https://phabricator.wikimedia.org/T234209 (10herron) Great, thank you! @Nuria could you please review/approve for analytics groups? @greg could you please... [20:33:19] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team: Grant LDAP groups and deployment shell access to Kevin Bazira - https://phabricator.wikimedia.org/T234209 (10herron) [20:38:46] (03PS2) 10Jforrester: build: Upgrade mediawiki-codesniffer to v28.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542522 [20:40:30] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10RobH) [20:41:36] 10Operations, 10ops-eqiad, 10DC-Ops: a8-eqiad pdu refresh (Thursday 10/17 @11am UTC) - https://phabricator.wikimedia.org/T227133 (10RobH) [20:41:47] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 (10RobH) [20:41:57] 10Operations, 10ops-eqiad, 10DC-Ops: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) - https://phabricator.wikimedia.org/T227540 (10RobH) [20:42:08] 10Operations, 10ops-eqiad, 10DC-Ops: b7-eqiad pdu refresh (Tuesday 11/5 @11am UTC) - https://phabricator.wikimedia.org/T227542 (10RobH) [20:42:32] 10Operations, 10ops-eqiad, 10DC-Ops: b8-eqiad pdu refresh (Thursday 10/31 @11am UTC) - https://phabricator.wikimedia.org/T227543 (10RobH) [20:46:53] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Effeietsanders) Thanks @Lea_Lacroix_WMDE - I didn't look thoroughly enough at the set of people being affected to recognize this pattern and wasn't aware of this issue at other lis... [21:22:51] (03PS1) 10Jhedden: openstack: Allow tools-dns-manager to connect from labs networks [puppet] - 10https://gerrit.wikimedia.org/r/542605 (https://phabricator.wikimedia.org/T235304) [21:24:38] (03CR) 10BryanDavis: [C: 03+1] openstack: Allow tools-dns-manager to connect from labs networks [puppet] - 10https://gerrit.wikimedia.org/r/542605 (https://phabricator.wikimedia.org/T235304) (owner: 10Jhedden) [21:26:18] (03CR) 10Jhedden: [C: 03+2] openstack: Allow tools-dns-manager to connect from labs networks [puppet] - 10https://gerrit.wikimedia.org/r/542605 (https://phabricator.wikimedia.org/T235304) (owner: 10Jhedden) [21:26:23] (03CR) 10Alex Monk: [C: 03+1] openstack: Allow tools-dns-manager to connect from labs networks [puppet] - 10https://gerrit.wikimedia.org/r/542605 (https://phabricator.wikimedia.org/T235304) (owner: 10Jhedden) [21:52:20] PROBLEM - novaadmin has roles in every project on cloudcontrol1003 is CRITICAL: In tools, user novaadmin should have roles [user, projectadmin] but has [udesignateadmin, uprojectadmin, uuser, uadmin] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:55:34] RECOVERY - novaadmin has roles in every project on cloudcontrol1003 is OK: novaadmin has the correct roles in all projects. https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:18:33] (03PS1) 10Bstorm: keystone: change monitoring some details to email rather than paging [puppet] - 10https://gerrit.wikimedia.org/r/542610 [22:28:08] PROBLEM - IPMI Sensor Status on maps1002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [22:31:24] (03CR) 10Filippo Giunchedi: "> Patch Set 3: Verified-1" [puppet] - 10https://gerrit.wikimedia.org/r/541619 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [22:31:35] 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [22:32:49] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikimedia-Logstash, and 3 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [22:33:27] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decom californium - https://phabricator.wikimedia.org/T189921 (10Jclark-ctr) [22:34:19] (03PS1) 10Papaul: DNS: Remove mgmt DNS for phab1002, astatine and production DNS for astatine [dns] - 10https://gerrit.wikimedia.org/r/542613 [22:35:34] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10decommission: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10Jclark-ctr) [22:36:05] (03PS1) 10Groceryheist: update ssh key for nathante [puppet] - 10https://gerrit.wikimedia.org/r/542614 [22:36:07] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/542614 (owner: 10Groceryheist) [22:36:12] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for phab1002, astatine and production DNS for astatine [dns] - 10https://gerrit.wikimedia.org/r/542613 (owner: 10Papaul) [22:37:02] (03Abandoned) 10Groceryheist: update ssh key for nathante [puppet] - 10https://gerrit.wikimedia.org/r/542614 (owner: 10Groceryheist) [22:39:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission phab1002/WMF4727 - https://phabricator.wikimedia.org/T221391 (10Papaul) 05Open→03Resolved Complete [22:39:55] 10Operations, 10serviceops: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 (10Papaul) [22:40:00] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install phab1002(WMF4727) - https://phabricator.wikimedia.org/T196019 (10Papaul) [22:40:04] 10Operations, 10hardware-requests, 10Patch-For-Review: request to assign wmf6937 (mw1298, former imagescaler) (now: wmf4727) as phab1002 - https://phabricator.wikimedia.org/T195623 (10Papaul) [22:40:09] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10Papaul) [22:40:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission astatine - https://phabricator.wikimedia.org/T221244 (10Papaul) [22:40:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473 (10Papaul) [22:41:02] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission astatine - https://phabricator.wikimedia.org/T221244 (10Papaul) 05Open→03Resolved Complete [22:41:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T226715 (10Jclark-ctr) [22:42:45] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10Jclark-ctr) [22:43:33] 10Operations, 10ops-eqiad, 10decommission: Decommission labcontrol1001 & labcontrol1002 - https://phabricator.wikimedia.org/T221817 (10Jclark-ctr) [22:44:51] 10Operations, 10ops-eqiad, 10decommission, 10User-fgiunchedi: Return graphite100[13] to spares pool (or decom) - https://phabricator.wikimedia.org/T209357 (10Jclark-ctr) [22:47:53] 10Operations, 10ops-eqiad, 10Data-Services, 10decommission, 10cloud-services-team (Kanban): Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet - https://phabricator.wikimedia.org/T216749 (10Jclark-ctr) [22:48:32] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10decommission: Decommission iron - https://phabricator.wikimedia.org/T220505 (10Jclark-ctr) [22:55:21] (03PS1) 10Groceryheist: Update ssh key for nathante (try 2) [puppet] - 10https://gerrit.wikimedia.org/r/542618 [22:58:29] 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10Jclark-ctr) [22:59:54] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:00:47] 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10Jclark-ctr) [23:03:20] (03Abandoned) 10Groceryheist: Update ssh key for nathante (try 2) [puppet] - 10https://gerrit.wikimedia.org/r/542618 (owner: 10Groceryheist) [23:06:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikimedia-Logstash, and 3 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10Jclark-ctr) [23:08:04] (03PS1) 10Groceryheist: update ssh key for nathante [puppet] - 10https://gerrit.wikimedia.org/r/542621 [23:08:49] I need to update my ssh key [23:08:50] https://gerrit.wikimedia.org/r/c/operations/puppet/+/542621 [23:10:30] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:16:08] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10decommission: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10Papaul) ` papaul@asw2-b-eqiad# show | compare [edit interfaces] - ge-5/0/12 { - description dbproxy1004; - } [23:17:11] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10decommission: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10Papaul) [23:22:42] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:23:01] (03PS1) 10Papaul: DNS: Remove DNS for dbproxy1004 and dbproxy1009 [dns] - 10https://gerrit.wikimedia.org/r/542623 [23:25:52] (03CR) 10Papaul: [C: 03+2] DNS: Remove DNS for dbproxy1004 and dbproxy1009 [dns] - 10https://gerrit.wikimedia.org/r/542623 (owner: 10Papaul) [23:27:36] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, and 2 others: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10Papaul) [23:29:29] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, and 2 others: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10Papaul) @Jclark-ctr once you add dbproxy1009 to the decom Sheet, you can resolve the task. Thanks [23:37:28] (03CR) 10Cwhite: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/541619 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [23:42:05] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: labsdb1009 broken PSU - https://phabricator.wikimedia.org/T233273 (10Jclark-ctr) @Marostegui Received PSU. would like to replace Monday [23:43:56] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:48:57] (03PS2) 10Umherirrender: Switch to wmf specific run mode for $wgDisableQueryPageUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530871 (https://phabricator.wikimedia.org/T78711) [23:49:18] (03CR) 10Umherirrender: "Rebased" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530871 (https://phabricator.wikimedia.org/T78711) (owner: 10Umherirrender) [23:49:28] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 (10Jclark-ctr) 05Open→03Resolved updated ps2-a2-eqiad and location set to active. [23:49:29] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10Jclark-ctr)