[00:11:07] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:11:41] !log ryankemper@deploy1001 Started deploy [wdqs/wdqs@8f6f660]: 0.3.41 [00:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:52] !log ryankemper@deploy1001 Finished deploy [wdqs/wdqs@8f6f660]: 0.3.41 (duration: 15m 10s) [00:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:46] (03CR) 10Cicalese: [C: 04-1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609206 (https://phabricator.wikimedia.org/T251279) (owner: 10Cicalese) [00:31:00] (03PS5) 10Cicalese: DO NOT MERGE Remove temporary logging for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606239 (https://phabricator.wikimedia.org/T245595) [00:33:23] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:45:41] RECOVERY - Check the last execution of php7.2-fpm_check_restart on mw2236 is OK: OK: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:45:51] RECOVERY - Check systemd state on mw2236 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:47] 10Operations, 10PDF-Rendering, 10Product-Infrastructure-Team-Backlog, 10Proton, and 3 others: PDF renderer needs better CJK font - https://phabricator.wikimedia.org/T226633 (10Shizhao) [02:49:07] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp4031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:00:05] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp4031 is OK: HTTP OK: HTTP/1.0 200 OK - 23484 bytes in 0.245 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:00:52] (03PS1) 10Andrew Bogott: codfw1dev: switch to an in-cloud puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/612697 (https://phabricator.wikimedia.org/T242607) [04:03:12] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: switch to an in-cloud puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/612697 (https://phabricator.wikimedia.org/T242607) (owner: 10Andrew Bogott) [04:06:32] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257983 (10Marostegui) 05Open→03Invalid [04:06:35] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Marostegui) [04:14:46] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:22:08] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:31:37] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Marostegui) @Jclark-ctr everything done from your side? I see the host is back up. What was done in the end? [04:36:30] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:41:16] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:43:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1135', diff saved to https://phabricator.wikimedia.org/P11907 and previous config saved to /var/cache/conftool/dbconfig/20200715-044332-marostegui.json [04:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1103 weight to 0 before the switchover T254871', diff saved to https://phabricator.wikimedia.org/P11908 and previous config saved to /var/cache/conftool/dbconfig/20200715-044432-marostegui.json [04:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:37] T254871: Upgrade x1 databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254871 [04:46:25] !log Start x1 pre failover steps T254871 [04:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:42] (03CR) 10Marostegui: mariadb: Promote db1103 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/612474 (https://phabricator.wikimedia.org/T254871) (owner: 10Marostegui) [04:51:44] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1103 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/612474 (https://phabricator.wikimedia.org/T254871) (owner: 10Marostegui) [05:05:49] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [05:06:13] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [05:21:06] PROBLEM - Disk space on webperf1002 is CRITICAL: DISK CRITICAL - free space: /srv 11454 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf1002&var-datasource=eqiad+prometheus/ops [05:29:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099:3318 ', diff saved to https://phabricator.wikimedia.org/P11909 and previous config saved to /var/cache/conftool/dbconfig/20200715-052939-marostegui.json [05:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:08] 10Operations, 10ops-eqiad: Interface errors on asw2-b-eqiad:ge-5/0/35 (kubernetes1010) - https://phabricator.wikimedia.org/T257542 (10ayounsi) Yep, same as usual (eg. : T250257) bad port or cable. [05:38:12] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:41:52] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:46:39] In 15 minutes we'll switchover x1's master [05:57:02] 10Operations, 10ops-eqiad, 10DC-Ops, 10observability: eqiad: PDU Upgrade in C8 (July 14, 2pm-4pm UTC)) - https://phabricator.wikimedia.org/T257871 (10ayounsi) Monitoring is alerting for `ps1-c8-eqiad` https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=ps1-c8-eqiad And https://librenms.wikimedia.or... [05:57:32] ACKNOWLEDGEMENT - ps1-c8-eqiad-infeed-load-tower-A-phase-X on ps1-c8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T257871 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:57:32] ACKNOWLEDGEMENT - ps1-c8-eqiad-infeed-load-tower-A-phase-Y on ps1-c8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T257871 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:57:32] ACKNOWLEDGEMENT - ps1-c8-eqiad-infeed-load-tower-A-phase-Z on ps1-c8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T257871 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:57:32] ACKNOWLEDGEMENT - ps1-c8-eqiad-infeed-load-tower-B-phase-X on ps1-c8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T257871 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:57:32] ACKNOWLEDGEMENT - ps1-c8-eqiad-infeed-load-tower-B-phase-Y on ps1-c8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T257871 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:57:32] ACKNOWLEDGEMENT - ps1-c8-eqiad-infeed-load-tower-B-phase-Z on ps1-c8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T257871 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:00:04] marostegui, jynus, and kormat: That opportune time is upon us again. Time for a x1 database master failover deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200715T0600). [06:00:10] o/ [06:00:12] ok to start? [06:00:34] ok [06:00:44] should just take a few seconds [06:00:45] no objection here [06:00:52] !log Starting x1 failover from db1120 to db1103 - T254871 [06:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:57] T254871: Upgrade x1 databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254871 [06:01:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1103 to x1 master T254871', diff saved to https://phabricator.wikimedia.org/P11910 and previous config saved to /var/cache/conftool/dbconfig/20200715-060145-marostegui.json [06:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:51] all done [06:02:08] I will send you a pm on wiki to test [06:02:14] ok thanks [06:03:06] the master's binlog looking good [06:03:18] did you receive a notification? [06:03:21] on enwiki? [06:03:22] yep [06:03:27] test x1 failover -- This email was sent by user "JCrespo (WMF)" on the English Wikipedia to user "MArostegui (WMF)". It... [06:04:39] a few readonly errors as expected [06:05:01] 69 of them [06:06:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1120 for reimage T254871', diff saved to https://phabricator.wikimedia.org/P11911 and previous config saved to /var/cache/conftool/dbconfig/20200715-060649-marostegui.json [06:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:55] T254871: Upgrade x1 databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254871 [06:07:47] (03CR) 10Marostegui: [V: 03+2 C: 03+2] wmnet: Update x1 alias [dns] - 10https://gerrit.wikimedia.org/r/612475 (https://phabricator.wikimedia.org/T254871) (owner: 10Marostegui) [06:09:14] !log Stop replication on db1120 to avoid having 10.4 -> 10.1 replication for long T254871 [06:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:20] 10Operations, 10netops: ripe-atlas-eqiad IPv6 unreachable - https://phabricator.wikimedia.org/T258018 (10ayounsi) p:05Triage→03Medium [06:14:29] (03PS1) 10Marostegui: db1120: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/612701 (https://phabricator.wikimedia.org/T254871) [06:14:47] ACKNOWLEDGEMENT - Host ripe-atlas-eqiad IPv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:861:202:208:80:155:69) ayounsi https://phabricator.wikimedia.org/T258018 [06:15:38] (03CR) 10Marostegui: [C: 03+2] db1120: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/612701 (https://phabricator.wikimedia.org/T254871) (owner: 10Marostegui) [06:21:05] (03PS1) 10Marostegui: install_server: Reimage db1120 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/612702 (https://phabricator.wikimedia.org/T254871) [06:22:05] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1120 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/612702 (https://phabricator.wikimedia.org/T254871) (owner: 10Marostegui) [06:40:45] (03PS1) 10Privacybatm: Firewall.py: Provide an absolute path to commands and refactor a function [software/transferpy] - 10https://gerrit.wikimedia.org/r/612705 (https://phabricator.wikimedia.org/T257600) [06:40:47] (03PS1) 10Privacybatm: Firewall.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/612726 (https://phabricator.wikimedia.org/T257600) [06:59:23] !log depooling wdqs1006 (high lag) [06:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:58] (03Abandoned) 10DCausse: Setup RDF configuration for Commons Beta with correct prefixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531317 (https://phabricator.wikimedia.org/T230840) (owner: 10Smalyshev) [07:04:47] (03CR) 10Muehlenhoff: "You need to enable backports via an apt hook, so e.g. "DIST=buster BACKPORTS=yes pdebuild", did you do that? And does it strictly need Go " [debs/grafana-loki] (debian/sid) - 10https://gerrit.wikimedia.org/r/610864 (owner: 10Cwhite) [07:05:40] (03CR) 10Muehlenhoff: [C: 03+2] Remove apt pin for stretch-backports for npm [puppet] - 10https://gerrit.wikimedia.org/r/612568 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [07:10:32] (03PS1) 10Muehlenhoff: testreduce: Fix syntax for ensure_packages() [puppet] - 10https://gerrit.wikimedia.org/r/612820 [07:12:16] (03CR) 10Muehlenhoff: [C: 03+2] testreduce: Fix syntax for ensure_packages() [puppet] - 10https://gerrit.wikimedia.org/r/612820 (owner: 10Muehlenhoff) [07:12:50] 10Operations, 10netops: ripe-atlas-eqiad IPv6 unreachable - https://phabricator.wikimedia.org/T258018 (10ayounsi) > I have heard from the RIPE NCC, they are going to attempt to upgrade our eqiad anchor in place, it may be down for a few days [07:15:42] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10MoritzMuehlenhoff) [07:17:24] 10Operations, 10DBA, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) [07:18:03] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:23:53] 10Operations, 10ops-codfw: wtp2005 hardware issue - https://phabricator.wikimedia.org/T257903 (10akosiaris) >>! In T257903#6306470, @wiki_willy wrote: > @akosiaris - looks like this server is past the 5yr server life cycle, and was due to be refreshed via T231255. Let us know if we can ignore this alert. Tha... [07:24:04] 10Operations, 10ops-codfw: wtp2005 hardware issue - https://phabricator.wikimedia.org/T257903 (10akosiaris) p:05Triage→03Low [07:24:41] (03PS3) 10Kormat: mysql: Add unit tests. [software/spicerack] - 10https://gerrit.wikimedia.org/r/610282 (https://phabricator.wikimedia.org/T255409) [07:25:51] 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10akosiaris) [07:26:30] 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10akosiaris) 05Resolved→03Open Re-opening. This has been wrongly closed, the last 2 items in the check list have not been completed. [07:27:35] 10Operations, 10ops-codfw: wtp2005 hardware issue - https://phabricator.wikimedia.org/T257903 (10akosiaris) Handling this now as part of T243112 [07:28:18] 10Operations, 10ops-codfw, 10serviceops: wtp2005 hardware issue - https://phabricator.wikimedia.org/T257903 (10akosiaris) [07:29:44] !log delete deprecated AS3209 AMS-IX router [07:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:39] (03PS2) 10Muehlenhoff: role: port netmon to Buster [puppet] - 10https://gerrit.wikimedia.org/r/611317 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [07:31:30] (03CR) 10jerkins-bot: [V: 04-1] role: port netmon to Buster [puppet] - 10https://gerrit.wikimedia.org/r/611317 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [07:31:34] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/610282 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [07:32:43] 10Operations, 10ops-eqiad: Interface errors on asw2-b-eqiad:ge-5/0/35 (kubernetes1010) - https://phabricator.wikimedia.org/T257542 (10akosiaris) kubernetes1010 has been drained and depooled. Feel free to conduct any debugging needed. [07:35:22] (03CR) 10Kormat: [C: 03+2] mysql: Add unit tests. [software/spicerack] - 10https://gerrit.wikimedia.org/r/610282 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [07:38:04] (03PS1) 10Elukey: profile::analytics::database::meta: enable TLS [puppet] - 10https://gerrit.wikimedia.org/r/612821 (https://phabricator.wikimedia.org/T257412) [07:38:38] (03Merged) 10jenkins-bot: mysql: Add unit tests. [software/spicerack] - 10https://gerrit.wikimedia.org/r/610282 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [07:39:00] \o/ [07:39:05] \o/ [07:39:15] volans: trying to re-learn how to mock stuff in python was paainful :) [07:40:08] lol, yes in general it is [07:40:14] (03CR) 10Elukey: [C: 04-1] "Of course I didn't create those files in puppet, self -1" [puppet] - 10https://gerrit.wikimedia.org/r/612821 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [07:41:54] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: mod_auth_cas segfaulting on netmon - https://phabricator.wikimedia.org/T257587 (10ayounsi) No blockers afaik, LibreNMS seems to be tested on Buster: https://docs.librenms.org/Installation/ [07:41:55] (03PS2) 10Elukey: profile::analytics::database::meta: enable TLS [puppet] - 10https://gerrit.wikimedia.org/r/612821 (https://phabricator.wikimedia.org/T257412) [07:43:46] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:44:45] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/23892/an-coord1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/612821 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [07:44:49] 10Operations, 10DBA, 10SRE-tools, 10Patch-For-Review, 10User-Kormat: Add native mysql module to spicerack - https://phabricator.wikimedia.org/T255409 (10Kormat) 05Open→03Resolved Ok, as the basic module is now in place (with unit tests!), i'm going to close this task in favour of smaller-scoped ones... [07:50:18] 10Operations, 10netops, 10observability: LibreNMS monitoring glitch caused paging - https://phabricator.wikimedia.org/T252630 (10ayounsi) I added an extra condition to the inbound and outbound alerts to trigger: usage needs to be < 150%. Which is way bellow the crazy % we saw when the bug happens. This //sh... [07:56:11] 10Operations, 10SRE-tools: Improve sre.hosts.decommission (additionally find host yaml files) - https://phabricator.wikimedia.org/T257297 (10elukey) Yes I think that matching the IP and hostname more strictly is a good idea, when I saw the issue I tried to see if it was a quick patch for the cookbook but it di... [07:57:58] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Don't build and run the tests when compiling Envoy, just the binary. [debs/envoyproxy] - 10https://gerrit.wikimedia.org/r/612675 (https://phabricator.wikimedia.org/T256843) (owner: 10RLazarus) [08:02:08] (03PS3) 10Muehlenhoff: role: port netmon to Buster [puppet] - 10https://gerrit.wikimedia.org/r/611317 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [08:08:06] (03CR) 10Muehlenhoff: [C: 03+2] Switch matomo to CAS [puppet] - 10https://gerrit.wikimedia.org/r/612512 (https://phabricator.wikimedia.org/T159584) (owner: 10Muehlenhoff) [08:11:46] (03PS1) 10Muehlenhoff: Add IDP service definition for piwik [puppet] - 10https://gerrit.wikimedia.org/r/612822 [08:12:25] (03CR) 10Muehlenhoff: [C: 03+2] Add IDP service definition for piwik [puppet] - 10https://gerrit.wikimedia.org/r/612822 (owner: 10Muehlenhoff) [08:14:25] PROBLEM - piwik.wikimedia.org on matomo1002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host: HTTP/1.1 302 Found https://wikitech.wikimedia.org/wiki/Analytics/Systems/Piwik [08:14:42] this is wip --^ [08:16:52] yeah, I'll trigger a puppet run on icinga1001, which should fix i [08:19:14] !log move piwik.wikimedia.org to CAS (idp.wikimedia.org) [08:19:24] !log piwik.wikimedia.org switched to CAS authentication [08:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:34] hahahaha [08:19:36] sorry! [08:19:36] ahaha, all great minds think alike :-) [08:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:56] or as the German proverb goes "Zwei Dumme, ein Gedanke" (two fools, same thought) :-) [08:20:05] (03PS1) 10Ema: ATS: local debugging rule for traffic-cache-atstext [puppet] - 10https://gerrit.wikimedia.org/r/612823 (https://phabricator.wikimedia.org/T256395) [08:20:59] (03PS2) 10Jbond: idp: enable ipv6 for IDP test roles [puppet] - 10https://gerrit.wikimedia.org/r/612605 [08:21:01] (03PS1) 10Jbond: idp: enable ipv6 for IDP role [puppet] - 10https://gerrit.wikimedia.org/r/612824 [08:21:13] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add basic chartmuseum library and helm-chartctl CLI [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/610849 (https://phabricator.wikimedia.org/T257333) (owner: 10JMeybohm) [08:21:18] moritzm: I like the second one! :D [08:21:55] (03CR) 10Jbond: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/612605 (owner: 10Jbond) [08:22:20] moritzm: the second half of the english phrase is "and fools seldom differ" :) [08:22:56] kormat: the second part is censored in the American version [08:23:30] kormat: thanks, I had a hunch this was in use elsewhere as well :-) [08:23:47] ema: TIL :) [08:24:07] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Check if images are debian based before generating report (031 comment) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/611251 (https://phabricator.wikimedia.org/T251918) (owner: 10JMeybohm) [08:24:52] kormat: https://en.wiktionary.org/wiki/great_minds_think_alike#Usage_notes :) [08:25:48] lol thought you where joking ema [08:26:52] that happens strangely often [08:28:15] hehe :D [08:28:33] (03CR) 10Ema: [C: 03+2] ATS: local debugging rule for traffic-cache-atstext [puppet] - 10https://gerrit.wikimedia.org/r/612823 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [08:28:40] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10User-jbond: Ensure hiera only has profile:: qualified or global hiera keys - https://phabricator.wikimedia.org/T247956 (10jbond) https://phabricator.wikimedia.org/T256972 is a plan to refactor the mysql classes which should remove the mysql parts... [08:30:07] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/611317 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [08:30:17] (03PS2) 10Jbond: role: install fcgid package on netmon [puppet] - 10https://gerrit.wikimedia.org/r/611318 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [08:30:45] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/611318 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [08:31:11] (03CR) 10jerkins-bot: [V: 04-1] role: install fcgid package on netmon [puppet] - 10https://gerrit.wikimedia.org/r/611318 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [08:34:26] (03CR) 10Ayounsi: [C: 03+1] "PCC is currently a NOOP https://puppet-compiler.wmflabs.org/compiler1003/23894/" [puppet] - 10https://gerrit.wikimedia.org/r/611317 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [08:36:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [08:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:24] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:41:33] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:45:34] (03PS2) 10Jbond: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) [08:45:36] (03PS1) 10Jbond: mx2001: disable ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/612826 [08:46:30] (03PS1) 10Marostegui: Revert "db1120: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/612709 [08:46:35] (03CR) 10Jbond: role::exim: update config to drop ldap validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [08:47:07] (03CR) 10Marostegui: [C: 03+2] Revert "db1120: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/612709 (owner: 10Marostegui) [08:49:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventgate: Switch stream_config_url to https [deployment-charts] - 10https://gerrit.wikimedia.org/r/612594 (https://phabricator.wikimedia.org/T257887) (owner: 10Alexandros Kosiaris) [08:50:13] (03PS1) 10Jbond: jumpcloud: remove refrenses to jumpcloud [puppet] - 10https://gerrit.wikimedia.org/r/612827 [08:50:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: [mwv] Use NFSv4 by default LXC+Vagrant [puppet] - 10https://gerrit.wikimedia.org/r/612682 (https://phabricator.wikimedia.org/T257855) (owner: 10BryanDavis) [08:50:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1120 after reimage', diff saved to https://phabricator.wikimedia.org/P11912 and previous config saved to /var/cache/conftool/dbconfig/20200715-085032-marostegui.json [08:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:51] 10Operations, 10DBA, 10User-Kormat: Add monitoring to ensure consistency between puppet and zarcillo - https://phabricator.wikimedia.org/T257821 (10Kormat) [08:50:53] (03Merged) 10jenkins-bot: eventgate: Switch stream_config_url to https [deployment-charts] - 10https://gerrit.wikimedia.org/r/612594 (https://phabricator.wikimedia.org/T257887) (owner: 10Alexandros Kosiaris) [08:51:02] 10Operations, 10DBA, 10User-Kormat: Add monitoring to ensure consistency between tendril and zarcillo - https://phabricator.wikimedia.org/T257822 (10Kormat) [08:51:12] 10Operations, 10DBA, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10Kormat) [08:51:34] (03CR) 10Jbond: [C: 03+2] jumpcloud: remove refrenses to jumpcloud [puppet] - 10https://gerrit.wikimedia.org/r/612827 (owner: 10Jbond) [08:56:00] (03PS3) 10Jbond: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) [08:56:40] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Nice! It would be great to have IPv6 support in this component as well. Comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/612603 (owner: 10Jbond) [08:56:49] (03PS2) 10Jbond: mx2001: disable ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/612826 [09:00:08] (03PS1) 10Jbond: cloud.yaml: add missing key [puppet] - 10https://gerrit.wikimedia.org/r/612829 [09:00:51] (03CR) 10Jbond: [C: 03+2] cloud.yaml: add missing key [puppet] - 10https://gerrit.wikimedia.org/r/612829 (owner: 10Jbond) [09:01:03] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/612273 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris) [09:02:31] (03CR) 10Muehlenhoff: [C: 03+1] idp: enable ipv6 for IDP role [puppet] - 10https://gerrit.wikimedia.org/r/612824 (owner: 10Jbond) [09:02:45] (03CR) 10Muehlenhoff: [C: 03+1] idp: enable ipv6 for IDP test roles [puppet] - 10https://gerrit.wikimedia.org/r/612605 (owner: 10Jbond) [09:04:04] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:04:14] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:04:17] 10Operations, 10PDF-Rendering, 10Product-Infrastructure-Team-Backlog, 10Proton, and 2 others: PDF renderer needs better CJK font - https://phabricator.wikimedia.org/T226633 (10Aklapper) @Shizhao: Out of scope for #User-notice. [09:04:22] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:04:30] (03CR) 10JMeybohm: [C: 03+2] Drop support for python3.5 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/610871 (owner: 10JMeybohm) [09:04:31] sigh [09:04:33] (03CR) 10JMeybohm: [C: 03+2] Add basic chartmuseum library and helm-chartctl CLI [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/610849 (https://phabricator.wikimedia.org/T257333) (owner: 10JMeybohm) [09:04:34] checking aqs [09:04:40] (03CR) 10JMeybohm: [C: 03+2] New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/610886 (owner: 10JMeybohm) [09:04:43] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [09:04:43] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [09:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1099:3318', diff saved to https://phabricator.wikimedia.org/P11913 and previous config saved to /var/cache/conftool/dbconfig/20200715-090545-marostegui.json [09:05:46] (03Merged) 10jenkins-bot: Drop support for python3.5 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/610871 (owner: 10JMeybohm) [09:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:07] (03Merged) 10jenkins-bot: Add basic chartmuseum library and helm-chartctl CLI [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/610849 (https://phabricator.wikimedia.org/T257333) (owner: 10JMeybohm) [09:06:11] (03Merged) 10jenkins-bot: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/610886 (owner: 10JMeybohm) [09:06:20] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [09:06:21] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [09:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:45] !log deploy eventgate-analytics in staging, eqiad, codfw for switching to using discovery records and HTTPS for talking to the API [09:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:00] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:07:04] !log Correction: deploy eventgate-analytics-external in staging, eqiad, codfw for switching to using discovery records and HTTPS for talking to the API [09:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:16] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:07:24] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:08:47] the aqs alerts are due to data being dropped, in theory we thought we had fixed the issue but in practice it represents sometimes [09:09:05] Druid will be upgraded soon to hopefully get less weird use cases like this one [09:09:16] (even if we know more or less what is happening) [09:09:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] mobileapps: Talk to API over HTTPS [deployment-charts] - 10https://gerrit.wikimedia.org/r/612600 (https://phabricator.wikimedia.org/T257887) (owner: 10Alexandros Kosiaris) [09:10:54] (03Merged) 10jenkins-bot: mobileapps: Talk to API over HTTPS [deployment-charts] - 10https://gerrit.wikimedia.org/r/612600 (https://phabricator.wikimedia.org/T257887) (owner: 10Alexandros Kosiaris) [09:10:55] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [09:10:55] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [09:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:39] 10Operations, 10DBA, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10jcrespo) > we've not had any issues anymore I honestly don't trust tendril, we said many times "Issues seems now fixed/mitigated" and they end up coming back. I think for performance reasons we... [09:16:53] (03PS3) 10Jbond: envoyproxy: add ability to also listen on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/612603 [09:17:05] (03CR) 10Jbond: "updated thanks" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/612603 (owner: 10Jbond) [09:19:14] !log deploy mobileapps in kubernetes to talk HTTPS to the mw API [09:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:59] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [09:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:00] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [09:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:31] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [09:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:07] 10Operations, 10DBA, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10jcrespo) Alternatively, we can request virtual machines on production for both dc instances and that way we can easily separate both services so they don't interact, until tendril goes away. [09:25:24] (03CR) 10Ema: [C: 03+2] ATS: send 'SSL connection failed' errors to logstash [puppet] - 10https://gerrit.wikimedia.org/r/612282 (https://phabricator.wikimedia.org/T257840) (owner: 10Ema) [09:26:07] 10Operations, 10Traffic, 10Patch-For-Review: planet.wm.org missing from planet.discovery.wmnet Subject Alternative Name - https://phabricator.wikimedia.org/T257840 (10ema) [09:26:19] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) [09:26:31] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) [09:27:38] (03CR) 10JMeybohm: Check if images are debian based before generating report (031 comment) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/611251 (https://phabricator.wikimedia.org/T251918) (owner: 10JMeybohm) [09:30:04] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10akosiaris) [09:30:29] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10akosiaris) [09:33:06] (03PS1) 10Muehlenhoff: Add IDP service definitions for Yarn/Superset/Turnilo [puppet] - 10https://gerrit.wikimedia.org/r/612833 [09:38:14] (03CR) 10Elukey: [C: 03+1] "From an ignorant point of view, it looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/612833 (owner: 10Muehlenhoff) [09:38:56] (03PS1) 10Elukey: profile::piwik::webserver: avoid port 80 for HTTP [puppet] - 10https://gerrit.wikimedia.org/r/612834 [09:40:30] moritzm: --^ [09:40:33] does it make sense? [09:40:42] (to solve the outstanding icinga alert) [09:41:21] 10Operations, 10DBA, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10Marostegui) >>! In T257816#6307674, @jcrespo wrote: >> we've not had any issues anymore > > I honestly don't trust tendril, we said many times "Issues seems now fixed/mitigated" and they end up... [09:42:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1120 after reimage', diff saved to https://phabricator.wikimedia.org/P11914 and previous config saved to /var/cache/conftool/dbconfig/20200715-094226-marostegui.json [09:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:30] 10Operations, 10DBA, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10jcrespo) > We cannot be 100% sure that wherever we host zarcillo will always be up, especially if shared with more stuff. Hence see my last comment. [09:44:43] (03PS1) 10Ema: purged: stop passing -mcast_addrs [puppet] - 10https://gerrit.wikimedia.org/r/612836 (https://phabricator.wikimedia.org/T250781) [09:45:01] 10Operations, 10DBA, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10Marostegui) That also means introducing even more infra - which would also be different from the rest (VMs) - why not trying to make the retrying process a bit easier or auto-healing (maybe even... [09:45:42] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/612836 (https://phabricator.wikimedia.org/T250781) (owner: 10Ema) [09:47:59] !log Deploy schema change on s8 codfw master, lag will appear on codfw T256685 [09:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:04] T256685: pl_namespace index on pagelinks is unique only in s8 - https://phabricator.wikimedia.org/T256685 [09:49:59] (03PS2) 10Ema: purged: stop passing -mcast_addrs [puppet] - 10https://gerrit.wikimedia.org/r/612836 (https://phabricator.wikimedia.org/T250781) [09:50:03] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Remove multicast - https://phabricator.wikimedia.org/T257573 (10ema) When it comes to purged, I just confirmed with `rate(purged_htcp_packets_total[5m]) > 0` that indeed it stopped receiving multicast HTCP purges as mentioned by @Pchelolo on 2020-07-0... [09:50:35] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/612836 (https://phabricator.wikimedia.org/T250781) (owner: 10Ema) [09:53:24] (03CR) 10Ema: "pcc here, lgtm: https://puppet-compiler.wmflabs.org/compiler1003/492/" [puppet] - 10https://gerrit.wikimedia.org/r/612836 (https://phabricator.wikimedia.org/T250781) (owner: 10Ema) [09:53:44] (03CR) 10Alexandros Kosiaris: [C: 03+1] envoyproxy: add ability to also listen on IPv6 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/612603 (owner: 10Jbond) [09:57:46] (03CR) 10Jbond: envoyproxy: add ability to also listen on IPv6 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/612603 (owner: 10Jbond) [09:57:54] (03CR) 10Jbond: [C: 03+2] envoyproxy: add ability to also listen on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/612603 (owner: 10Jbond) [09:58:01] (03PS3) 10Jbond: idp: enable ipv6 for IDP test roles [puppet] - 10https://gerrit.wikimedia.org/r/612605 [09:58:12] (03PS2) 10Jbond: idp: enable ipv6 for IDP role [puppet] - 10https://gerrit.wikimedia.org/r/612824 [09:58:42] (03CR) 10Jbond: [C: 03+2] idp: enable ipv6 for IDP test roles [puppet] - 10https://gerrit.wikimedia.org/r/612605 (owner: 10Jbond) [09:58:48] (03CR) 10Elukey: [C: 03+2] profile::piwik::webserver: avoid port 80 for HTTP [puppet] - 10https://gerrit.wikimedia.org/r/612834 (owner: 10Elukey) [09:59:10] elukey: you happy for me to merge [09:59:17] +1 [09:59:18] thanks! [09:59:55] np merged [10:00:09] now if I break piwik it is your fault! [10:00:14] lol :D [10:05:30] (03PS1) 10Ema: Make -mcast_addrs optional [software/purged] - 10https://gerrit.wikimedia.org/r/612840 (https://phabricator.wikimedia.org/T257573) [10:06:12] (03PS3) 10Ema: purged: stop passing -mcast_addrs [puppet] - 10https://gerrit.wikimedia.org/r/612836 (https://phabricator.wikimedia.org/T250781) [10:07:48] !log imported docker-report_0.0.5-1 to buster-wikimedia [10:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1120 after reimage', diff saved to https://phabricator.wikimedia.org/P11915 and previous config saved to /var/cache/conftool/dbconfig/20200715-100855-marostegui.json [10:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:53] (03CR) 10DCausse: [C: 03+1] add logout config for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/612681 (https://phabricator.wikimedia.org/T257314) (owner: 10Mstyles) [10:19:11] (03PS3) 10Alexandros Kosiaris: mobileapps: Add a temporary non-TLS release [deployment-charts] - 10https://gerrit.wikimedia.org/r/612273 (https://phabricator.wikimedia.org/T218733) [10:19:41] (03PS1) 10Jbond: cloud.yaml: add missing default profile::java::java_packages: [] [puppet] - 10https://gerrit.wikimedia.org/r/612842 [10:20:20] (03CR) 10Jbond: [C: 03+2] cloud.yaml: add missing default profile::java::java_packages: [] [puppet] - 10https://gerrit.wikimedia.org/r/612842 (owner: 10Jbond) [10:20:29] !log updating python3-docker-report to 0.0.5-1 on deneb [10:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:47] (03PS4) 10Alexandros Kosiaris: mobileapps: Add a temporary non-TLS release [deployment-charts] - 10https://gerrit.wikimedia.org/r/612273 (https://phabricator.wikimedia.org/T218733) [10:26:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1120 after reimage', diff saved to https://phabricator.wikimedia.org/P11916 and previous config saved to /var/cache/conftool/dbconfig/20200715-102605-marostegui.json [10:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:27] (03CR) 10Alexandros Kosiaris: [C: 03+2] mobileapps: Add a temporary non-TLS release [deployment-charts] - 10https://gerrit.wikimedia.org/r/612273 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris) [10:28:30] (03Merged) 10jenkins-bot: mobileapps: Add a temporary non-TLS release [deployment-charts] - 10https://gerrit.wikimedia.org/r/612273 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris) [10:30:16] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [10:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:29] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [10:30:29] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [10:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:27] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [10:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:45] !log disable ping offload in eqiad [10:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:59] 10Operations, 10DBA, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10jcrespo) > let's tackle that I would say I am sorry that hosting the backup logs database was such overhead, I honestly thought it was much less resource intensive for DBAs. I will ask for reso... [10:41:07] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [10:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/612567 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris) [10:43:09] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:49] !log re-enable ping offload in eqiad [10:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:48] !log disable ping offload in codfw [10:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:20] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:46:41] 10Operations, 10DBA, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10Marostegui) I have never said it is an overhead and you know very well it is not resource intensive - my point is: let's try not have more special cases and let's try to have things as consisten... [10:50:39] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [10:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:42] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:52:42] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:56] !log re-enable ping offload in codfw [10:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:00] !log disable ping offload in esams [10:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:05] Addshore: #bothumor My software never has bugs. It just develops random features. Rise for European mid-day backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200715T1100). [11:00:05] Addshore: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:16] o/ [11:01:09] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [11:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:24] (03PS4) 10Ema: ATS: handle backend checks at healthcheck.wm.org/ats-be too [puppet] - 10https://gerrit.wikimedia.org/r/610041 (https://phabricator.wikimedia.org/T255015) [11:02:26] (03PS4) 10Ema: varnish: send backend probes to healthcheck.wm.org/ats-be [puppet] - 10https://gerrit.wikimedia.org/r/610042 (https://phabricator.wikimedia.org/T255015) [11:02:28] (03PS4) 10Ema: varnish: handle checks at healthcheck.wm.org/varnish-fe too [puppet] - 10https://gerrit.wikimedia.org/r/610046 (https://phabricator.wikimedia.org/T255015) [11:02:30] (03PS4) 10Ema: LVS: update Varnish healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/610047 (https://phabricator.wikimedia.org/T255015) [11:02:32] (03PS4) 10Ema: icinga: update Varnish healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/610048 (https://phabricator.wikimedia.org/T255015) [11:02:34] (03PS4) 10Ema: varnish: stop responding to varnishcheck.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/610051 (https://phabricator.wikimedia.org/T255015) [11:02:36] (03PS4) 10Ema: ATS: stop responding to varnishcheck/status [puppet] - 10https://gerrit.wikimedia.org/r/610052 (https://phabricator.wikimedia.org/T255015) [11:03:27] (03CR) 10Addshore: [C: 03+2] Wikibase: Split localEntitySourceName config for repo and client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612666 (https://phabricator.wikimedia.org/T254315) (owner: 10Addshore) [11:03:28] 10Operations: Move database backups metadata to m1 - https://phabricator.wikimedia.org/T258045 (10jcrespo) [11:04:27] 10Operations, 10DBA, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10jcrespo) I've created T258045 for the backups database. You can freely decide about zarcillo now. [11:04:54] 10Operations: Move database backups metadata to m1 - https://phabricator.wikimedia.org/T258045 (10jcrespo) p:05Triage→03Medium a:03jcrespo [11:05:04] (03Merged) 10jenkins-bot: Wikibase: Split localEntitySourceName config for repo and client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612666 (https://phabricator.wikimedia.org/T254315) (owner: 10Addshore) [11:05:09] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:33] !log re-enable ping offload in esams [11:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:53] (03CR) 10Addshore: [C: 03+2] Wikibase labs: All client "local" entity sources are wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612667 (https://phabricator.wikimedia.org/T254315) (owner: 10Addshore) [11:08:43] (03Merged) 10jenkins-bot: Wikibase labs: All client "local" entity sources are wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612667 (https://phabricator.wikimedia.org/T254315) (owner: 10Addshore) [11:08:53] !log addshore@deploy1001 Synchronized wmf-config/Wikibase.php: [[gerrit:612666]] Wikibase: Split localEntitySourceName config for repo and client T254315 (duration: 01m 16s) [11:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:58] T254315: entitysources: Directly create entitySources config for WMF production wikis - https://phabricator.wikimedia.org/T254315 [11:11:57] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: LABS [[gerrit:612667]] Wikibase labs: All client "local" entity sources are wikidata T254315 (duration: 01m 04s) [11:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:28] (03CR) 10Addshore: [C: 03+2] Wikidata test: Split client db lists. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612669 (https://phabricator.wikimedia.org/T254315) (owner: 10Addshore) [11:14:19] (03Merged) 10jenkins-bot: Wikidata test: Split client db lists. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612669 (https://phabricator.wikimedia.org/T254315) (owner: 10Addshore) [11:15:45] (03PS1) 10Ayounsi: Revert "esams: set prepending" [homer/public] - 10https://gerrit.wikimedia.org/r/612715 [11:16:09] !log remove as-path prepending in esams [11:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:19] (03CR) 10Ayounsi: [C: 03+2] Revert "esams: set prepending" [homer/public] - 10https://gerrit.wikimedia.org/r/612715 (owner: 10Ayounsi) [11:16:43] (03Merged) 10jenkins-bot: Revert "esams: set prepending" [homer/public] - 10https://gerrit.wikimedia.org/r/612715 (owner: 10Ayounsi) [11:26:19] !log addshore@deploy1001 Synchronized dblists/wikidataclient.dblist: T254315 [[gerrit:612669]] Wikidata test: Split client db lists. PT1/2 (duration: 01m 05s) [11:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:25] T254315: entitysources: Directly create entitySources config for WMF production wikis - https://phabricator.wikimedia.org/T254315 [11:27:49] !log addshore@deploy1001 Synchronized wmf-config: T254315 [[gerrit:612669]] Wikidata test: Split client db lists. PT2/2 (duration: 01m 06s) [11:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:08] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:28:14] 10Operations, 10netops: ripe-atlas-eqiad IPv6 unreachable - https://phabricator.wikimedia.org/T258018 (10ayounsi) There are also some diffscan changes, to be checked once everything is done. [11:28:36] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:28:43] (03CR) 10Addshore: [C: 03+2] Wikibase test: Client local entity sources are always testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612668 (https://phabricator.wikimedia.org/T254315) (owner: 10Addshore) [11:29:29] (03Merged) 10jenkins-bot: Wikibase test: Client local entity sources are always testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612668 (https://phabricator.wikimedia.org/T254315) (owner: 10Addshore) [11:31:50] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:36:08] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:612668]] Wikibase test: Client local entity sources are always testwikidata T254315 (duration: 01m 05s) [11:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:13] T254315: entitysources: Directly create entitySources config for WMF production wikis - https://phabricator.wikimedia.org/T254315 [11:37:09] (03PS5) 10Ema: ATS: handle backend checks at healthcheck.wm.org/ats-be too [puppet] - 10https://gerrit.wikimedia.org/r/610041 (https://phabricator.wikimedia.org/T255015) [11:37:11] (03PS5) 10Ema: varnish: send backend probes to healthcheck.wm.org/ats-be [puppet] - 10https://gerrit.wikimedia.org/r/610042 (https://phabricator.wikimedia.org/T255015) [11:37:13] (03PS5) 10Ema: varnish: handle checks at healthcheck.wm.org/varnish-fe too [puppet] - 10https://gerrit.wikimedia.org/r/610046 (https://phabricator.wikimedia.org/T255015) [11:37:15] (03PS5) 10Ema: LVS: update Varnish healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/610047 (https://phabricator.wikimedia.org/T255015) [11:37:17] (03PS5) 10Ema: icinga: update Varnish healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/610048 (https://phabricator.wikimedia.org/T255015) [11:37:19] (03PS5) 10Ema: varnish: stop responding to varnishcheck.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/610051 (https://phabricator.wikimedia.org/T255015) [11:37:21] (03PS5) 10Ema: ATS: stop responding to varnishcheck/status [puppet] - 10https://gerrit.wikimedia.org/r/610052 (https://phabricator.wikimedia.org/T255015) [11:37:48] RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:41:19] (03PS5) 10DCausse: [wdqs] drop updater mode config [puppet] - 10https://gerrit.wikimedia.org/r/602353 [11:41:21] (03PS24) 10DCausse: [wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 [11:49:41] (03CR) 10Addshore: [C: 03+2] Commons: Define entity sources configuration (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609987 (https://phabricator.wikimedia.org/T256906) (owner: 10Addshore) [11:50:30] (03Merged) 10jenkins-bot: Commons: Define entity sources configuration (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609987 (https://phabricator.wikimedia.org/T256906) (owner: 10Addshore) [11:57:46] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:609987]] Commons: Define entity sources configuration (take 2) T254315 (duration: 01m 03s) [11:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:52] T254315: entitysources: Directly create entitySources config for WMF production wikis - https://phabricator.wikimedia.org/T254315 [11:59:44] !log deploy window closed / done :) [11:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200715T1200) [12:09:04] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:10:58] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:20:50] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Jclark-ctr) @marostegui Yes all items finished sorry for not commenting. Dell did not come till very late yesterday [12:21:44] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Marostegui) Thanks @Jclark-ctr - just for the record in case this host has future issues, was the mainboard and DIMM modules replaced as well as the hard disk? [12:24:19] (03PS1) 10Jcrespo: database-backups: Move database-backup metadata tables from zarcillo to m1 [puppet] - 10https://gerrit.wikimedia.org/r/612856 (https://phabricator.wikimedia.org/T258045) [12:25:31] (03CR) 10jerkins-bot: [V: 04-1] database-backups: Move database-backup metadata tables from zarcillo to m1 [puppet] - 10https://gerrit.wikimedia.org/r/612856 (https://phabricator.wikimedia.org/T258045) (owner: 10Jcrespo) [12:27:10] (03PS2) 10Jcrespo: database-backups: Move database-backup metadata tables from zarcillo to m1 [puppet] - 10https://gerrit.wikimedia.org/r/612856 (https://phabricator.wikimedia.org/T258045) [12:28:26] (03CR) 10jerkins-bot: [V: 04-1] database-backups: Move database-backup metadata tables from zarcillo to m1 [puppet] - 10https://gerrit.wikimedia.org/r/612856 (https://phabricator.wikimedia.org/T258045) (owner: 10Jcrespo) [12:29:20] (03PS3) 10Jcrespo: database-backups: Move database-backup metadata tables from zarcillo to m1 [puppet] - 10https://gerrit.wikimedia.org/r/612856 (https://phabricator.wikimedia.org/T258045) [12:34:28] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:34:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, I'm by far from an Exim expert, but per my understanding it should be correct (FWIW). This covers all the non-mailman se" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [12:35:38] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, and for more elegant than synching with Jumpcloud :-)" [puppet] - 10https://gerrit.wikimedia.org/r/612826 (owner: 10Jbond) [12:36:18] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:37:39] (03PS1) 10Jcrespo: mariadb-backups: Add password to the cloud private puppet repo [labs/private] - 10https://gerrit.wikimedia.org/r/612857 (https://phabricator.wikimedia.org/T258045) [12:38:31] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10Marostegui) The RAID looks good ` Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-... [12:45:18] (03PS1) 10Jbond: exim: update phabricator redirects to use CNAME [puppet] - 10https://gerrit.wikimedia.org/r/612860 [12:46:28] (03CR) 10Jbond: role::exim: update config to drop ldap validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [12:50:28] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Add password to the cloud private puppet repo [labs/private] - 10https://gerrit.wikimedia.org/r/612857 (https://phabricator.wikimedia.org/T258045) (owner: 10Jcrespo) [12:50:33] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mariadb-backups: Add password to the cloud private puppet repo [labs/private] - 10https://gerrit.wikimedia.org/r/612857 (https://phabricator.wikimedia.org/T258045) (owner: 10Jcrespo) [12:51:07] (03PS4) 10Jcrespo: database-backups: Move database-backup metadata tables from zarcillo to m1 [puppet] - 10https://gerrit.wikimedia.org/r/612856 (https://phabricator.wikimedia.org/T258045) [12:51:32] (03CR) 10Muehlenhoff: exim: update phabricator redirects to use CNAME (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/612860 (owner: 10Jbond) [12:53:31] (03PS2) 10Jbond: exim: update phabricator redirects to use CNAME [puppet] - 10https://gerrit.wikimedia.org/r/612860 [12:53:44] (03CR) 10Jbond: exim: update phabricator redirects to use CNAME (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/612860 (owner: 10Jbond) [12:54:03] (03CR) 10Muehlenhoff: [C: 03+2] Improve error handling if malformed host is given [cookbooks] - 10https://gerrit.wikimedia.org/r/612591 (owner: 10Muehlenhoff) [12:55:32] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/612860 (owner: 10Jbond) [13:00:04] James_F and longma: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Mediawiki train - European+American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200715T1300). [13:00:26] (03PS1) 10Jforrester: group1 wikis to 1.35.0-wmf.41 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612861 [13:00:28] (03CR) 10Jforrester: [C: 03+2] group1 wikis to 1.35.0-wmf.41 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612861 (owner: 10Jforrester) [13:01:17] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.41 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612861 (owner: 10Jforrester) [13:03:09] !log jforrester@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.41 [13:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:15] !log jforrester@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.41 (duration: 01m 05s) [13:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:51] (03CR) 10Muehlenhoff: [C: 03+2] role: port netmon to Buster [puppet] - 10https://gerrit.wikimedia.org/r/611317 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [13:07:57] jynus: shall I merge "mariadb-backups: Add password to the cloud private puppet repo" along? [13:08:09] sorry, yes [13:08:15] ack [13:08:23] I forgot I had to merge cloud repot on puppetmaster now [13:09:29] merged [13:13:15] (03PS3) 10Muehlenhoff: role: install fcgid package on netmon [puppet] - 10https://gerrit.wikimedia.org/r/611318 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [13:14:06] (03CR) 10jerkins-bot: [V: 04-1] role: install fcgid package on netmon [puppet] - 10https://gerrit.wikimedia.org/r/611318 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [13:22:41] (03PS1) 10Alexandros Kosiaris: mobileapps: Add LVS IPs on kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/612863 [13:25:59] (03PS1) 10Elukey: profile::mariadb::misc::analytics::multiinstance: move meta to 3306 [puppet] - 10https://gerrit.wikimedia.org/r/612864 (https://phabricator.wikimedia.org/T234826) [13:26:16] (03CR) 10Alexandros Kosiaris: [C: 03+2] "PCC at https://puppet-compiler.wmflabs.org/compiler1003/23898/ says ok" [puppet] - 10https://gerrit.wikimedia.org/r/612863 (owner: 10Alexandros Kosiaris) [13:32:08] (03PS1) 10Muehlenhoff: Allow installing additional libapache2-mod* packages in the httpd class [puppet] - 10https://gerrit.wikimedia.org/r/612865 [13:32:51] (03CR) 10Muehlenhoff: role: install fcgid package on netmon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/611318 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [13:34:57] (03CR) 10CDanis: [C: 03+1] varnish: stop responding to varnishcheck.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/610051 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [13:35:08] (03PS5) 10Jcrespo: database-backups: Move database-backup metadata tables from zarcillo to m1 [puppet] - 10https://gerrit.wikimedia.org/r/612856 (https://phabricator.wikimedia.org/T258045) [13:36:19] (03CR) 10Jcrespo: "I will fix configuring the stats on a separate file (to avoid the redundancy) on a followup commit." [puppet] - 10https://gerrit.wikimedia.org/r/612856 (https://phabricator.wikimedia.org/T258045) (owner: 10Jcrespo) [13:36:52] (03CR) 10Jcrespo: [C: 03+2] database-backups: Move database-backup metadata tables from zarcillo to m1 [puppet] - 10https://gerrit.wikimedia.org/r/612856 (https://phabricator.wikimedia.org/T258045) (owner: 10Jcrespo) [13:37:51] (03CR) 10Ema: [C: 03+2] ATS: handle backend checks at healthcheck.wm.org/ats-be too [puppet] - 10https://gerrit.wikimedia.org/r/610041 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [13:40:43] (03PS1) 10Jcrespo: Revert "database-backups: Move database-backup metadata tables from zarcillo to m1" [puppet] - 10https://gerrit.wikimedia.org/r/612720 [13:41:54] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Revert "database-backups: Move database-backup metadata tables from zarcillo to m1" [puppet] - 10https://gerrit.wikimedia.org/r/612720 (owner: 10Jcrespo) [13:43:06] (03PS1) 10Jcrespo: Revert "Revert "database-backups: Move database-backup metadata tables from zarcillo to m1"" [puppet] - 10https://gerrit.wikimedia.org/r/612721 [13:45:01] (03PS2) 10Jcrespo: Revert "Revert "database-backups: Move database-backup metadata tables from zarcillo to m1"" [puppet] - 10https://gerrit.wikimedia.org/r/612721 [13:45:37] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Make -mcast_addrs optional [software/purged] - 10https://gerrit.wikimedia.org/r/612840 (https://phabricator.wikimedia.org/T257573) (owner: 10Ema) [13:47:23] (03CR) 10Jcrespo: [C: 03+2] Revert "Revert "database-backups: Move database-backup metadata tables from zarcillo to m1"" [puppet] - 10https://gerrit.wikimedia.org/r/612721 (owner: 10Jcrespo) [13:49:14] (03CR) 10Giuseppe Lavagetto: "LGTM, but I'd just remove the option completely. You can always revert this patch in case, right?" [puppet] - 10https://gerrit.wikimedia.org/r/612836 (https://phabricator.wikimedia.org/T250781) (owner: 10Ema) [13:49:21] (03CR) 10Giuseppe Lavagetto: [C: 03+1] purged: stop passing -mcast_addrs [puppet] - 10https://gerrit.wikimedia.org/r/612836 (https://phabricator.wikimedia.org/T250781) (owner: 10Ema) [13:50:34] 10Operations, 10DBA, 10OTRS, 10serviceops: Create a parallel OTRS database with a frozen snapshot of the production one - https://phabricator.wikimedia.org/T257928 (10jcrespo) a:05jcrespo→03akosiaris A clone of the otrs database has been setup on db1077. The question now, @akosiaris, is what needs acce... [13:50:46] (03PS1) 10Jforrester: Add temporary fix to ensure array is passed to array_map() [extensions/UrlShortener] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/612722 (https://phabricator.wikimedia.org/T258056) [13:50:59] (03CR) 10Jforrester: [C: 03+2] Add temporary fix to ensure array is passed to array_map() [extensions/UrlShortener] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/612722 (https://phabricator.wikimedia.org/T258056) (owner: 10Jforrester) [13:51:01] (03PS1) 10Andrew Bogott: update nova observer password for codfw1dev [labs/private] - 10https://gerrit.wikimedia.org/r/612868 [13:51:31] !log akosiaris@cumin1001 conftool action : set/weight=1; selector: dc=codfw,service=mobileapps,name=kubernetes.* [13:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:13] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] update nova observer password for codfw1dev [labs/private] - 10https://gerrit.wikimedia.org/r/612868 (owner: 10Andrew Bogott) [13:53:25] !log akosiaris@cumin1001 conftool action : set/weight=264; selector: dc=codfw,service=mobileapps,name=scb.* [13:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:40] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=mobileapps,name=kubernetes.* [13:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:24] (03Merged) 10jenkins-bot: Add temporary fix to ensure array is passed to array_map() [extensions/UrlShortener] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/612722 (https://phabricator.wikimedia.org/T258056) (owner: 10Jforrester) [13:54:49] !log pool kubernetes nodes for mobileapps in codfw [13:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:10] (03CR) 10RLazarus: [C: 03+2] Don't build and run the tests when compiling Envoy, just the binary. [debs/envoyproxy] - 10https://gerrit.wikimedia.org/r/612675 (https://phabricator.wikimedia.org/T256843) (owner: 10RLazarus) [13:56:11] (03CR) 10Muehlenhoff: [C: 03+2] Add IDP service definitions for Yarn/Superset/Turnilo [puppet] - 10https://gerrit.wikimedia.org/r/612833 (owner: 10Muehlenhoff) [13:57:56] (03CR) 10CDanis: [C: 03+2] "I'm not super thrilled with this, but it does seem like the only reasonable option." [puppet] - 10https://gerrit.wikimedia.org/r/612399 (owner: 10CDanis) [13:58:53] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.41/extensions/UrlShortener/includes/UrlShortenerUtils.php: T258056 Add temporary fix to ensure array is passed to array_map() (duration: 01m 08s) [13:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:59] T258056: PHP Warning: array_map(): Argument #2 should be an array - https://phabricator.wikimedia.org/T258056 [13:59:25] (03PS4) 10Ema: purged: stop passing -mcast_addrs [puppet] - 10https://gerrit.wikimedia.org/r/612836 (https://phabricator.wikimedia.org/T250781) [13:59:42] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/612836 (https://phabricator.wikimedia.org/T250781) (owner: 10Ema) [14:00:07] (03CR) 10Chico Venancio: "> Patch Set 1: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612629 (https://phabricator.wikimedia.org/T257925) (owner: 10Chico Venancio) [14:00:49] (03Abandoned) 10CDanis: vcl: public_clouds_shutdown: ratelimit API reqs as well [puppet] - 10https://gerrit.wikimedia.org/r/609480 (owner: 10CDanis) [14:03:07] (03PS1) 10Jdrewniak: Disable affinity quicksurveys for the following wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612870 (https://phabricator.wikimedia.org/T246977) [14:06:40] (03PS1) 10Jcrespo: mariadb: Fix typo when calling check_mariadb_backups.py [puppet] - 10https://gerrit.wikimedia.org/r/612871 (https://phabricator.wikimedia.org/T258045) [14:07:18] (03PS2) 10Jcrespo: mariadb-backups: Fix typo when calling check_mariadb_backups.py [puppet] - 10https://gerrit.wikimedia.org/r/612871 (https://phabricator.wikimedia.org/T258045) [14:07:49] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mariadb-backups: Fix typo when calling check_mariadb_backups.py [puppet] - 10https://gerrit.wikimedia.org/r/612871 (https://phabricator.wikimedia.org/T258045) (owner: 10Jcrespo) [14:10:41] !log akosiaris@cumin1001 conftool action : set/weight=132; selector: dc=codfw,service=mobileapps,name=scb.* [14:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:50] (03PS1) 10Urbanecm: Create archiver group at itwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612873 (https://phabricator.wikimedia.org/T257927) [14:12:00] !log increase codfw mobileapps kubernetes traffic to 2% T218733 [14:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:05] T218733: Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 [14:12:39] * mdholloway holds his breath [14:12:59] (03CR) 10Ema: [C: 03+2] Make -mcast_addrs optional [software/purged] - 10https://gerrit.wikimedia.org/r/612840 (https://phabricator.wikimedia.org/T257573) (owner: 10Ema) [14:13:53] mdholloway: Big moment! [14:14:57] (03PS1) 10Ema: Release version 0.17 [software/purged] - 10https://gerrit.wikimedia.org/r/612874 [14:17:42] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:18:02] (03CR) 10Ema: [C: 03+2] Release version 0.17 [software/purged] - 10https://gerrit.wikimedia.org/r/612874 (owner: 10Ema) [14:18:04] (03PS1) 10Jforrester: Add VisualEditor support back to wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612875 (https://phabricator.wikimedia.org/T241961) [14:19:08] PROBLEM - wikidata.org dispatch lag is REALLY high ---4000s- on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1803 bytes in 0.084 second response time https://phabricator.wikimedia.org/project/view/71/ [14:21:34] (03PS2) 10Jforrester: Add VisualEditor support back to wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612875 (https://phabricator.wikimedia.org/T241961) [14:23:17] (03CR) 10Jforrester: [C: 03+2] "Let's see if this works." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612875 (https://phabricator.wikimedia.org/T241961) (owner: 10Jforrester) [14:24:07] (03Merged) 10jenkins-bot: Add VisualEditor support back to wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612875 (https://phabricator.wikimedia.org/T241961) (owner: 10Jforrester) [14:25:00] !log repooling wdqs1006 - catched up on lag [14:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:30] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Add exceptional wikitech VE/Parsoid config T241961 (duration: 01m 05s) [14:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:35] T241961: VisualEditor was removed from Wikitech because Parsoid/PHP isn't yet compatible with how Wikitech is set up - https://phabricator.wikimedia.org/T241961 [14:27:14] 10Operations: Move database backups metadata to m1 - https://phabricator.wikimedia.org/T258045 (10jcrespo) 05Open→03Resolved Done, backups now send metadata to m1 master (no proxy is used for now to allow tls). New dbbackups database also has been added to m1 backups and documented on wikitech. [14:28:34] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add exceptional wikitech VE/Parsoid config T241961 (duration: 01m 04s) [14:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:54] 10Operations, 10DBA, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10jcrespo) So it turns out that work on T257816 unveiled that there were a lot of hardcoded endpoints that made that task, not only an option, but a requirement to acheive this one. More work will... [14:30:21] !log upload purged 0.17 to buster-wikimedia T257573 [14:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:25] T257573: Remove multicast - https://phabricator.wikimedia.org/T257573 [14:34:40] (03PS1) 10Lucas Werkmeister (WMDE): Stop checking if WikibaseLib is loaded [extensions/Wikibase] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/612723 (https://phabricator.wikimedia.org/T258062) [14:34:43] (03PS1) 10Andrew Bogott: codfw1dev: fix 'puppet' dns hack to point to cloud-internal puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/612877 (https://phabricator.wikimedia.org/T242607) [14:34:54] o/ Just to let you know we've (the Wikidata team) identified that rolling the train forards has broken change dispatching. We're tracking it in https://phabricator.wikimedia.org/T258062 right now we are writing a fix [14:35:07] see just above [14:35:21] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: fix 'puppet' dns hack to point to cloud-internal puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/612877 (https://phabricator.wikimedia.org/T242607) (owner: 10Andrew Bogott) [14:35:26] is it something that can be replayed at a later time? [14:35:31] yup [14:35:42] ok then [14:35:52] (03CR) 10Addshore: [C: 03+1] "Looks good to me" [extensions/Wikibase] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/612723 (https://phabricator.wikimedia.org/T258062) (owner: 10Lucas Werkmeister (WMDE)) [14:36:05] James_F: am I okay to go ahead and backport it? [14:36:30] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:36:54] !log A:cp: upgrade purged to 0.17 T257573 [14:37:04] (03CR) 10Tarrow: [C: 03+1] Stop checking if WikibaseLib is loaded [extensions/Wikibase] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/612723 (https://phabricator.wikimedia.org/T258062) (owner: 10Lucas Werkmeister (WMDE)) [14:37:06] (03CR) 10Addshore: [C: 03+2] Stop checking if WikibaseLib is loaded [extensions/Wikibase] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/612723 (https://phabricator.wikimedia.org/T258062) (owner: 10Lucas Werkmeister (WMDE)) [14:37:11] addshore: Go for it. [14:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:19] T257573: Remove multicast - https://phabricator.wikimedia.org/T257573 [14:37:20] addshore: Sorry, was just verifying that VE now works on wikitech again. [14:37:29] James_F: cool! ty, *waits for CI* [14:38:25] (03CR) 10Ema: [C: 03+2] purged: stop passing -mcast_addrs [puppet] - 10https://gerrit.wikimedia.org/r/612836 (https://phabricator.wikimedia.org/T250781) (owner: 10Ema) [14:41:01] addshore, James_F: won't it need backporting to REL1_35 as well [14:41:05] 10Operations, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org, and 2 others: VisualEditor was removed from Wikitech because Parsoid/PHP isn't yet compatible with how Wikitech is set up - https://phabricator.wikimedia.org/T241961 (10Jdforrester-WMF) 05Open→03Resolved a:03Jdforrester-WMF Tah-dah. [14:41:19] RhinosF1: I'll check :) [14:41:25] thanks for the poke ! [14:41:32] ty [14:42:56] RhinosF1: indeed, thanks! [14:43:21] Lucas_WMDE: yeah just saw someone do it [14:44:34] it was you! Ty! [14:48:07] 10Operations, 10DBA, 10OTRS, 10serviceops: Create a parallel OTRS database with a frozen snapshot of the production one - https://phabricator.wikimedia.org/T257928 (10akosiaris) >>! In T257928#6308424, @jcrespo wrote: > A clone of the otrs database has been setup on db1077. The question now, @akosiaris, is... [14:48:50] (03CR) 10Addshore: [V: 03+2 C: 03+2] "CI is all going well, and I will watch the result but I want to stage this now on the maint host" [extensions/Wikibase] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/612723 (https://phabricator.wikimedia.org/T258062) (owner: 10Lucas Werkmeister (WMDE)) [14:51:21] !log pulled https://gerrit.wikimedia.org/r/612723 onto mwmaint 1002 ahead of syncing everywhere (and CI finishing) [14:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:40] 10Operations, 10DBA, 10OTRS, 10serviceops: Create a parallel OTRS database with a frozen snapshot of the production one - https://phabricator.wikimedia.org/T257928 (10jcrespo) Np. More questions: do we setup it with a separate user/password (to avoid mistakes with the production db) or the same (for conven... [14:52:26] (03CR) 10Ema: [C: 03+2] varnish: send backend probes to healthcheck.wm.org/ats-be [puppet] - 10https://gerrit.wikimedia.org/r/610042 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema) [14:55:22] 10Operations, 10DBA, 10OTRS, 10serviceops: Create a parallel OTRS database with a frozen snapshot of the production one - https://phabricator.wikimedia.org/T257928 (10akosiaris) >>! In T257928#6308662, @jcrespo wrote: > Np. More questions: do we setup it with a separate user/password (to avoid mistakes wit... [14:56:06] RECOVERY - wikidata.org dispatch lag is REALLY high ---4000s- on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1791 bytes in 0.112 second response time https://phabricator.wikimedia.org/project/view/71/ [14:57:23] syncing to everywhere [14:58:28] !log addshore@deploy1001 Synchronized php-1.35.0-wmf.41/extensions/Wikibase/repo: [[gerrit:612723]] Stop checking if WikibaseLib is loaded T258062 (already on mwmaint1002) (duration: 01m 08s) [14:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:33] T258062: Wikidata Change Dispatching Broken - https://phabricator.wikimedia.org/T258062 [14:59:45] (03CR) 10Alexandros Kosiaris: [C: 03+2] Moved a comment to a better place [puppet] - 10https://gerrit.wikimedia.org/r/611455 (owner: 10Ahmon Dancy) [14:59:54] James_F: all good from our side [15:02:41] 10Operations, 10DBA, 10OTRS, 10serviceops: Create a parallel OTRS database with a frozen snapshot of the production one - https://phabricator.wikimedia.org/T257928 (10jcrespo) a:05akosiaris→03jcrespo [15:04:26] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 75 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:06:23] Ace. [15:08:34] (03PS1) 10Cwhite: debianization [debs/grafana-loki] - 10https://gerrit.wikimedia.org/r/612878 [15:10:02] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 43 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:17:42] (03PS1) 10C. Scott Ananian: Alternate configuration mechanism for Parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612879 (https://phabricator.wikimedia.org/T241961) [15:20:16] !log rebooting webperf* hosts for kernel update [15:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:24] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [15:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:59] 10Operations, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org, and 2 others: VisualEditor was removed from Wikitech because Parsoid/PHP isn't yet compatible with how Wikitech is set up - https://phabricator.wikimedia.org/T241961 (10cscott) ^ slight counter-proposal, to keep wikitech a little bit more i... [15:23:36] (03PS2) 10Cwhite: debianization [debs/grafana-loki] - 10https://gerrit.wikimedia.org/r/612878 [15:24:17] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:44] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [15:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:28] (03Abandoned) 10Cwhite: debianization [debs/grafana-loki] - 10https://gerrit.wikimedia.org/r/612878 (owner: 10Cwhite) [15:27:42] (03PS3) 10Cwhite: debianization [debs/grafana-loki] (debian/sid) - 10https://gerrit.wikimedia.org/r/610864 [15:28:34] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:29:43] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:14] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:34:28] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:40:02] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:43:31] (03CR) 10Cwhite: "> Patch Set 2:" [debs/grafana-loki] (debian/sid) - 10https://gerrit.wikimedia.org/r/610864 (owner: 10Cwhite) [15:49:17] (03PS1) 10Urbanecm: Add avk to the langlist helper [dns] - 10https://gerrit.wikimedia.org/r/612888 (https://phabricator.wikimedia.org/T257943) [15:49:54] (03CR) 10Muehlenhoff: "Ack, using 1.13 is perfectly fine right now, it's just that a package from backport is a bit of a moving target (while 1.11 from Buster wo" [debs/grafana-loki] (debian/sid) - 10https://gerrit.wikimedia.org/r/610864 (owner: 10Cwhite) [15:51:06] (03CR) 10Muehlenhoff: "Context question for the review: Is this targeted towards an upload to Debian? (as the changelog targets unstable)" [debs/grafana-loki] (debian/sid) - 10https://gerrit.wikimedia.org/r/610864 (owner: 10Cwhite) [15:52:31] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Maps (Kartotherian): Geoshapes service is not sending 'access-control-allow-origin' header to some requests - https://phabricator.wikimedia.org/T241644 (10MSantos) a:05MSantos→03None [16:03:19] (03CR) 10Cwhite: "> Patch Set 3:" [debs/grafana-loki] (debian/sid) - 10https://gerrit.wikimedia.org/r/610864 (owner: 10Cwhite) [16:11:12] !log uploaded jenkins 2.235.2 to thirdparty/ci for stretch/buster T257614 [16:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:27] 10Operations, 10Desktop Improvements, 10Traffic, 10Performance-Team (Radar): CDN cache revalidation on several wikis for desktop improvements deployment - https://phabricator.wikimedia.org/T256750 (10ovasileva) [16:18:00] (03PS1) 10Jason Linehan: Enable client error logging on ca.wikipedia.org, and disable on haw.wikipedia.org. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612893 (https://phabricator.wikimedia.org/T258073) [16:25:52] (03Abandoned) 10Elukey: profile::mariadb::misc::analytics::multiinstance: move meta to 3306 [puppet] - 10https://gerrit.wikimedia.org/r/612864 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [16:28:20] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:33:54] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:35:15] (03PS1) 10Urbanecm: Initial configuration for avkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943) [16:36:05] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for avkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943) (owner: 10Urbanecm) [16:37:37] (03PS2) 10Urbanecm: Initial configuration for avkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943) [17:05:28] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:05:28] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:06:04] (03PS2) 10JMeybohm: Check if images are debian based before generating report [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/611251 (https://phabricator.wikimedia.org/T251918) [17:06:06] (03PS2) 10JMeybohm: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/611252 [17:07:09] (03CR) 10jerkins-bot: [V: 04-1] Check if images are debian based before generating report [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/611251 (https://phabricator.wikimedia.org/T251918) (owner: 10JMeybohm) [17:07:39] (03CR) 10jerkins-bot: [V: 04-1] New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/611252 (owner: 10JMeybohm) [17:13:34] (03PS3) 10JMeybohm: Check if images are debian based before generating report [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/611251 (https://phabricator.wikimedia.org/T251918) [17:13:36] (03PS3) 10JMeybohm: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/611252 [17:26:35] (03CR) 10Mholloway: [C: 03+1] "Maybe preferable to keep it on hawwiki and just add cawiki, if we're unlikely to see load-related issues on wikis of this size? Otherwise " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612893 (https://phabricator.wikimedia.org/T258073) (owner: 10Jason Linehan) [17:32:37] !log puppetmaster - revoking cert for planet.discovery.wmnet, add planet.wikimedia.org, remove planet.svc records, remove specific and outdated hostnames (T257840) [17:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:43] T257840: planet.wm.org missing from planet.discovery.wmnet Subject Alternative Name - https://phabricator.wikimedia.org/T257840 [17:40:47] has there been any problem with the jobqueue in the last few days that T257714 could correlate with? [17:40:47] T257714: User is receiving the same Echo web notification over and over again - https://phabricator.wikimedia.org/T257714 [17:41:03] the problem might be elsewhere, I'm just fishing [17:42:08] (03PS1) 10Dzahn: ssl/planet: update cert for planet.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/612904 (https://phabricator.wikimedia.org/T257840) [17:42:49] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:TBD) rack/setup/install rows C and D new PDUs - https://phabricator.wikimedia.org/T253694 (10RobH) >>! In T257871#6307224, @ayounsi wrote: > Monitoring is alerting for `ps1-c8-eqiad` > https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=ps1-c8-eqiad > And... [17:45:02] (03CR) 10Dzahn: [C: 03+2] "openssl x509 -in planet.discovery.wmnet.crt -text | grep DNS" [puppet] - 10https://gerrit.wikimedia.org/r/612904 (https://phabricator.wikimedia.org/T257840) (owner: 10Dzahn) [17:46:27] tgr: nothing known, but I also don't think anyone has looked [17:47:16] <_joe_> we had some delivery failures mw->kafka repeating last week, but I think we know how to solve them [17:48:08] 10Operations, 10Traffic, 10Patch-For-Review: planet.wm.org missing from planet.discovery.wmnet Subject Alternative Name - https://phabricator.wikimedia.org/T257840 (10Dzahn) p:05Low→03Medium [17:51:28] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:TBD) rack/setup/install rows C and D new PDUs - https://phabricator.wikimedia.org/T253694 (10RobH) [17:51:39] (03CR) 10Dzahn: [C: 03+2] "thanks for adding compiler results, looks good. toolforge-only, not used in prod." [puppet] - 10https://gerrit.wikimedia.org/r/611457 (https://phabricator.wikimedia.org/T250157) (owner: 10Ahmon Dancy) [17:52:10] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:TBD) rack/setup/install rows C and D new PDUs - https://phabricator.wikimedia.org/T253694 (10RobH) Added: [] - update puppet repo file: modules/facilities/manifests/init.pp to add the senty4 line to the PDU entry [] - ensure all errors clear in icinga after work... [17:55:05] 10Operations, 10Puppet, 10Diffusion, 10Phabricator: Diffusion (Phabricator) operations-puppet repo synchronization error - https://phabricator.wikimedia.org/T257895 (10mmodell) Some of the files in the OPUP.git/objects directory were owned by root. I fixed that and it should fix the replication error. [17:57:25] (03PS2) 10Dzahn: Add avk to the langlist helper [dns] - 10https://gerrit.wikimedia.org/r/612888 (https://phabricator.wikimedia.org/T257943) (owner: 10Urbanecm) [17:58:56] (03CR) 10Dzahn: [C: 03+2] "approved by langcom - https://en.wikipedia.org/wiki/Kotava" [dns] - 10https://gerrit.wikimedia.org/r/612888 (https://phabricator.wikimedia.org/T257943) (owner: 10Urbanecm) [17:59:23] (03PS2) 10Jdrewniak: Disable affinity quicksurveys for the following wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612870 (https://phabricator.wikimedia.org/T246977) [17:59:59] (03PS2) 10Jason Linehan: Enable client error logging on ca.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612893 (https://phabricator.wikimedia.org/T258073) [18:00:04] James_F and longma: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200715T1800). [18:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200715T1800). [18:00:04] jan_drewniak: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:28] o/ [18:00:46] !log DNS - new language 'avk' has been added - This language is called Kotava and is "a proposed international auxiliary language (IAL) that focuses especially on the principle of cultural neutrality". Learn more at https://en.wikipedia.org/wiki/Kotava [18:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:18] is anyone around to do a quick config deployment? [18:02:24] cdanis: is there an easy way to tell if the same job is being run multiple times? what's the most relevant log I could look at? [18:02:45] tgr: no idea, gonna have to refer you to Pchelolo I think [18:03:01] (03CR) 10Jason Linehan: "> Patch Set 1: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612893 (https://phabricator.wikimedia.org/T258073) (owner: 10Jason Linehan) [18:03:34] tgr: you can grep JobExecutor logs on mwlog [18:04:42] thx [18:07:52] 10Operations, 10Puppet, 10Diffusion, 10Phabricator: Diffusion (Phabricator) operations-puppet repo synchronization error - https://phabricator.wikimedia.org/T257895 (10mmodell) 05Open→03Resolved a:03mmodell I just checked and there are no other repositories with objects owned by root. I also see that... [18:08:02] oh is there still another deployment happening right now? [18:11:00] we're about to do a jenkins upgrade [18:12:05] (03CR) 10Herron: [C: 03+1] "Agreed!" [puppet] - 10https://gerrit.wikimedia.org/r/612826 (owner: 10Jbond) [18:12:11] ah I see, no problem, I'll try the afternoon backport window then. [18:16:42] !log restarting jenkins for upgrade [18:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:50] 10Operations, 10Traffic: planet.wm.org missing from planet.discovery.wmnet Subject Alternative Name - https://phabricator.wikimedia.org/T257840 (10Dzahn) [18:20:11] (03CR) 10Herron: [C: 03+1] exim: update phabricator redirects to use CNAME [puppet] - 10https://gerrit.wikimedia.org/r/612860 (owner: 10Jbond) [18:20:36] 10Operations, 10Traffic: planet.wm.org missing from planet.discovery.wmnet Subject Alternative Name - https://phabricator.wikimedia.org/T257840 (10Dzahn) 05Open→03Resolved @ema Cert has been fixed. I added planet.wikimedia.org in addition to *.planet.wikimedia.org and removed "svc.eqiad/codfw" records and... [18:21:06] (03PS3) 10Dzahn: releases: pull MW security patches from deployment server on all servers [puppet] - 10https://gerrit.wikimedia.org/r/612445 [18:21:57] (03CR) 10Dzahn: [C: 03+1] exim: update phabricator redirects to use CNAME [puppet] - 10https://gerrit.wikimedia.org/r/612860 (owner: 10Jbond) [18:24:06] (03CR) 10Mholloway: [C: 03+1] Enable client error logging on ca.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612893 (https://phabricator.wikimedia.org/T258073) (owner: 10Jason Linehan) [18:24:45] (03PS1) 10RobH: updating ps1-c8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/612928 (https://phabricator.wikimedia.org/T257871) [18:26:02] (03CR) 10RobH: [C: 03+2] updating ps1-c8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/612928 (https://phabricator.wikimedia.org/T257871) (owner: 10RobH) [18:26:12] (03PS1) 10Dzahn: exim: remove RT redirects [puppet] - 10https://gerrit.wikimedia.org/r/612929 [18:27:51] (03CR) 10jerkins-bot: [V: 04-1] updating ps1-c8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/612928 (https://phabricator.wikimedia.org/T257871) (owner: 10RobH) [18:35:17] (03PS15) 10Hnowlan: api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [18:35:37] (03PS6) 10Jeena Huneidi: Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) [18:37:34] (03CR) 10Dzahn: [C: 04-1] "concat(): Requires array to work with (" [puppet] - 10https://gerrit.wikimedia.org/r/612445 (owner: 10Dzahn) [18:38:38] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:39:11] 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Papaul) @akosiaris no need to reopen the task since this needs to be done by the service owner on another task and not on the racking/setup task. Once the server is in s... [18:40:44] 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Dzahn) It should continue on T247441 [18:42:20] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:42:57] 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Dzahn) [18:43:00] 10Operations, 10serviceops, 10Patch-For-Review: move 20 new codfw parsoid servers (parse2*) into production - https://phabricator.wikimedia.org/T247441 (10Dzahn) [18:43:06] 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Dzahn) [18:43:08] 10Operations, 10serviceops, 10Patch-For-Review: move 20 new codfw parsoid servers (parse2*) into production - https://phabricator.wikimedia.org/T247441 (10Dzahn) [18:43:20] (03PS7) 10Jeena Huneidi: Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) [18:43:33] mutante: is there CI for other debs/* repos, or shoudl I jsut enable V+2 for now? [18:46:41] (03PS2) 10RobH: updating ps1-c8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/612928 (https://phabricator.wikimedia.org/T257871) [18:47:29] (03CR) 10RobH: [C: 03+2] updating ps1-c8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/612928 (https://phabricator.wikimedia.org/T257871) (owner: 10RobH) [18:47:43] Krinkle: i see "debian-glue-non-voting" on some of them [18:49:06] Krinkle: https://phabricator.wikimedia.org/rCICF1db72e0f9618b0672fd7b72e8c54f77c88fddc26 [18:49:40] (03PS8) 10Jeena Huneidi: Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) [18:50:24] (03PS9) 10Jeena Huneidi: Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) [18:50:47] (03CR) 10RobH: [C: 03+1] "I agree we don't need to have email linked in any longer. RT can become a purely static reference." [puppet] - 10https://gerrit.wikimedia.org/r/612929 (owner: 10Dzahn) [18:52:46] mutante: ok, well, for now it should work [18:52:49] RECOVERY - ps1-c8-eqiad-infeed-load-tower-A-phase-X on ps1-c8-eqiad is OK: SNMP OK - ps1-c8-eqiad-infeed-load-tower-A-phase-X 256 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:52:49] RECOVERY - ps1-c8-eqiad-infeed-load-tower-A-phase-Y on ps1-c8-eqiad is OK: SNMP OK - ps1-c8-eqiad-infeed-load-tower-A-phase-Y 231 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:52:49] RECOVERY - ps1-c8-eqiad-infeed-load-tower-A-phase-Z on ps1-c8-eqiad is OK: SNMP OK - ps1-c8-eqiad-infeed-load-tower-A-phase-Z 291 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:52:49] RECOVERY - ps1-c8-eqiad-infeed-load-tower-B-phase-X on ps1-c8-eqiad is OK: SNMP OK - ps1-c8-eqiad-infeed-load-tower-B-phase-X 236 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:52:49] RECOVERY - ps1-c8-eqiad-infeed-load-tower-B-phase-Y on ps1-c8-eqiad is OK: SNMP OK - ps1-c8-eqiad-infeed-load-tower-B-phase-Y 211 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:52:50] RECOVERY - ps1-c8-eqiad-infeed-load-tower-B-phase-Z on ps1-c8-eqiad is OK: SNMP OK - ps1-c8-eqiad-infeed-load-tower-B-phase-Z 230 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:52:56] woot [18:52:59] Krinkle: ok [18:53:13] thus my saga of homebrew on new os x version is concluded [18:53:34] (i hadnt used it since upgrading os x) [18:53:43] burn it all, start over. [18:54:27] Krinkle: works. merged [18:54:31] robh: thanks for review [18:55:30] (03CR) 10Herron: "SGTM, please see one comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/612929 (owner: 10Dzahn) [18:57:26] 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Papaul) [18:57:33] wecome [18:57:36] welcome [18:57:39] 10Operations, 10serviceops, 10Patch-For-Review: move 20 new codfw parsoid servers (parse2*) into production - https://phabricator.wikimedia.org/T247441 (10Papaul) [18:57:48] 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Papaul) 05Open→03Resolved [18:57:50] 10Operations, 10serviceops, 10Patch-For-Review: move 20 new codfw parsoid servers (parse2*) into production - https://phabricator.wikimedia.org/T247441 (10Papaul) [18:58:26] (03PS4) 10Dzahn: releases: pull MW security patches from deployment server on all servers [puppet] - 10https://gerrit.wikimedia.org/r/612445 [18:58:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10observability, 10Patch-For-Review: eqiad: PDU Upgrade in C8 (July 14, 2pm-4pm UTC)) - https://phabricator.wikimedia.org/T257871 (10RobH) 05Open→03Resolved All clear in icinga now, and added the puppet update and icinga clear check as steps on the template checkl... [18:58:43] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:TBD) rack/setup/install rows C and D new PDUs - https://phabricator.wikimedia.org/T253694 (10RobH) [19:00:04] James_F and longma: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Mediawiki train - European+American Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200715T1900). [19:00:04] (03PS2) 10Dzahn: exim: remove RT redirects [puppet] - 10https://gerrit.wikimedia.org/r/612929 [19:00:39] Train already on group1. Things are well. [19:07:11] (03CR) 10Dzahn: exim: remove RT redirects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/612929 (owner: 10Dzahn) [19:11:28] (03CR) 10Herron: [C: 03+1] "LGTM -- thx for the cleanup!" [puppet] - 10https://gerrit.wikimedia.org/r/612929 (owner: 10Dzahn) [19:14:31] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:16:09] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:24:15] (03PS3) 10Dzahn: exim: remove RT redirects [puppet] - 10https://gerrit.wikimedia.org/r/612929 [20:00:04] halfak and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200715T2000). [20:03:25] (03PS1) 10Andrew Bogott: Change labtestpuppetmaster2001 to role(spare::system) [puppet] - 10https://gerrit.wikimedia.org/r/612940 (https://phabricator.wikimedia.org/T258103) [20:04:20] (03CR) 10Andrew Bogott: [C: 03+2] Change labtestpuppetmaster2001 to role(spare::system) [puppet] - 10https://gerrit.wikimedia.org/r/612940 (https://phabricator.wikimedia.org/T258103) (owner: 10Andrew Bogott) [20:14:02] (03PS5) 10Dzahn: releases: pull MW security patches from deployment server on all servers [puppet] - 10https://gerrit.wikimedia.org/r/612445 [20:18:15] (03CR) 10Dzahn: [C: 03+2] "adding the string to the array works now, there is no straight-way to do it in regular puppet https://puppet-compiler.wmflabs.org/compile" [puppet] - 10https://gerrit.wikimedia.org/r/612445 (owner: 10Dzahn) [20:22:24] (03CR) 10Dzahn: meet::accountmanager: add some fake private secrets (example) (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/607153 (owner: 10Dzahn) [20:22:51] (03Abandoned) 10Dzahn: meet::accountmanager: add some fake private secrets (example) [labs/private] - 10https://gerrit.wikimedia.org/r/607153 (owner: 10Dzahn) [20:24:30] (03CR) 10Dzahn: "This created /usr/local/sbin/sync-srv-patches-releases1001.eqiad.wmnet on releases1001 and was noop on releases1002 as it should. this mea" [puppet] - 10https://gerrit.wikimedia.org/r/612445 (owner: 10Dzahn) [20:24:56] (03CR) 10Dzahn: "content of that file: /usr/bin/rsync -a rsync://deploy1001.eqiad.wmnet/srv-patches-releases1001.eqiad.wmnet /srv/patches"" [puppet] - 10https://gerrit.wikimedia.org/r/612445 (owner: 10Dzahn) [20:35:50] (03PS1) 10BryanDavis: toolforge: Set Strict-Transport-Security to 366 days [puppet] - 10https://gerrit.wikimedia.org/r/612947 (https://phabricator.wikimedia.org/T102367) [20:35:52] (03PS1) 10BryanDavis: toolforge: Perform HTTPS redirects unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/612948 (https://phabricator.wikimedia.org/T102367) [20:46:10] (03CR) 10BryanDavis: "PCC output: https://puppet-compiler.wmflabs.org/compiler1002/23905/" [puppet] - 10https://gerrit.wikimedia.org/r/612948 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis) [20:59:34] 10Operations, 10Phabricator, 10Security-Team: HTTP 500 error trying to access phabricator from my server - https://phabricator.wikimedia.org/T257507 (10Aklapper) [21:07:16] 10Operations, 10Toolforge, 10Traffic, 10HTTPS, and 4 others: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367 (10bd808) [21:12:25] (03CR) 10Subramanya Sastry: [C: 03+1] Alternate configuration mechanism for Parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612879 (https://phabricator.wikimedia.org/T241961) (owner: 10C. Scott Ananian) [21:18:24] 10Operations, 10Cloud-Services, 10Wikimedia-Mailing-lists, 10User-bd808, 10cloud-services-team (Kanban): Create cloud-admin and archive labs-admin mailing list - https://phabricator.wikimedia.org/T167155 (10bd808) [21:42:24] (03PS1) 10Bstorm: clouddb: Uncap the network for the clouddb-services project [puppet] - 10https://gerrit.wikimedia.org/r/612958 (https://phabricator.wikimedia.org/T257884) [21:51:59] (03PS1) 10Alexandros Kosiaris: mobileapps: Amend statsd exporter buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/612959 (https://phabricator.wikimedia.org/T218733) [21:56:11] (03CR) 10Bstorm: "After putting this out on toolsbeta, this appears to be safe and a functional noop https://puppet-compiler.wmflabs.org/compiler1003/23906/" [puppet] - 10https://gerrit.wikimedia.org/r/612647 (https://phabricator.wikimedia.org/T257945) (owner: 10Bstorm) [21:56:13] (03CR) 10Alexandros Kosiaris: [C: 03+2] mobileapps: Amend statsd exporter buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/612959 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris) [21:57:17] (03Merged) 10jenkins-bot: mobileapps: Amend statsd exporter buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/612959 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris) [21:58:26] (03PS4) 10Bstorm: cloud-nfs: Allow changing the nfs mount version [puppet] - 10https://gerrit.wikimedia.org/r/612647 (https://phabricator.wikimedia.org/T257945) [22:00:04] (03CR) 10Bstorm: cloud-nfs: Allow changing the nfs mount version (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/612647 (https://phabricator.wikimedia.org/T257945) (owner: 10Bstorm) [22:10:40] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [22:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:05] (03PS1) 10Alexandros Kosiaris: mobileapps: fix prometheus statsd exporter issue in 0.0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/612966 (https://phabricator.wikimedia.org/T218733) [22:20:34] (03CR) 10Alexandros Kosiaris: [C: 03+2] mobileapps: fix prometheus statsd exporter issue in 0.0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/612966 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris) [22:21:30] (03Merged) 10jenkins-bot: mobileapps: fix prometheus statsd exporter issue in 0.0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/612966 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris) [22:21:50] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [22:21:50] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [22:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:06] PROBLEM - DPKG on stat1008 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [22:27:46] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [22:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:12] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [22:29:12] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [22:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:18] (03CR) 10Clarakosi: "We still need to update it with our most recent envoy changes but otherwise looking good!" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [22:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:28] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [22:30:28] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [22:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:18] (03PS4) 10Ottomata: Initial debian commit [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610880 (https://phabricator.wikimedia.org/T251006) [22:33:48] (03PS5) 10Ottomata: Initial debian commit [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610880 (https://phabricator.wikimedia.org/T251006) [22:38:26] (03PS6) 10Ottomata: Initial debian commit [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610880 (https://phabricator.wikimedia.org/T251006) [22:39:08] (03PS1) 10Dzahn: add IP for aphlict1001 [dns] - 10https://gerrit.wikimedia.org/r/612970 (https://phabricator.wikimedia.org/T257617) [22:43:38] (03PS1) 10Dzahn: add IP for testreduce1001 [dns] - 10https://gerrit.wikimedia.org/r/612971 (https://phabricator.wikimedia.org/T257940) [22:45:21] (03CR) 10Dzahn: [C: 03+2] add IP for aphlict1001 [dns] - 10https://gerrit.wikimedia.org/r/612970 (https://phabricator.wikimedia.org/T257617) (owner: 10Dzahn) [22:50:55] (03CR) 10Dzahn: [C: 03+2] add IP for testreduce1001 [dns] - 10https://gerrit.wikimedia.org/r/612971 (https://phabricator.wikimedia.org/T257940) (owner: 10Dzahn) [22:52:16] (03CR) 10Ottomata: "I renamed the repository, so this is a new patchset." [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610880 (https://phabricator.wikimedia.org/T251006) (owner: 10Ottomata) [22:52:29] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [22:52:29] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [22:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:57] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [22:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:47] (03PS1) 10BryanDavis: Add Cloud VPS global root key for Sam Reed [labs/private] - 10https://gerrit.wikimedia.org/r/612974 (https://phabricator.wikimedia.org/T249774) [22:54:49] (03PS1) 10BryanDavis: Remove Cloud VPS global root key for valhallasw [labs/private] - 10https://gerrit.wikimedia.org/r/612975 (https://phabricator.wikimedia.org/T255697) [22:54:55] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for jlinehan - https://phabricator.wikimedia.org/T258119 (10jlinehan) [22:58:18] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [22:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:57] 10Operations, 10Parsoid, 10vm-requests, 10Parsoid-Tests, 10Patch-For-Review: eqiad: 1 VM request for testreduce - https://phabricator.wikimedia.org/T257940 (10Dzahn) [22:59:14] 10Operations, 10Phabricator, 10vm-requests, 10Patch-For-Review: eqiad: 1 VM request for aphlict - https://phabricator.wikimedia.org/T257617 (10Dzahn) a:03Dzahn [23:00:02] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for jlinehan - https://phabricator.wikimedia.org/T258119 (10jlinehan) This was completed back in October 2018 when I was hired (T207951) but the recent audit (T237696) seems to have removed me from the deployment group, due to not having depl... [23:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Evening backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200715T2300). Please do the needful. [23:01:08] PROBLEM - SSH on webperf2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:02:48] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [23:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:52] RECOVERY - SSH on webperf2002 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:04:23] !log tools.admin Removed valhallasw from maintainers (T255697) [23:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:29] T255697: Offboard valhallasw as vps/toolforge admin - https://phabricator.wikimedia.org/T255697 [23:07:56] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for jlinehan - https://phabricator.wikimedia.org/T258119 (10dcipoletti) As Jason's manager, I approve [23:12:45] heh. this was totally the wrong channel for that !log [23:21:45] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [23:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:42] (03PS1) 10Dzahn: DHCP/partman: add MAC for aphlict1001 [puppet] - 10https://gerrit.wikimedia.org/r/612982 (https://phabricator.wikimedia.org/T257617) [23:31:23] (03PS1) 10Dzahn: DHCP/partman: add MAC for testreduce1001 [puppet] - 10https://gerrit.wikimedia.org/r/612983 (https://phabricator.wikimedia.org/T257940) [23:56:48] (03CR) 10Dzahn: [C: 03+2] DHCP/partman: add MAC for aphlict1001 [puppet] - 10https://gerrit.wikimedia.org/r/612982 (https://phabricator.wikimedia.org/T257617) (owner: 10Dzahn) [23:57:09] (03CR) 10Dzahn: [C: 03+2] DHCP/partman: add MAC for testreduce1001 [puppet] - 10https://gerrit.wikimedia.org/r/612983 (https://phabricator.wikimedia.org/T257940) (owner: 10Dzahn) [23:57:15] (03PS2) 10Dzahn: DHCP/partman: add MAC for testreduce1001 [puppet] - 10https://gerrit.wikimedia.org/r/612983 (https://phabricator.wikimedia.org/T257940)