[00:00:04] twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200917T0000). [00:18:14] (03CR) 10Nuria: "Note that ticket does not match changeset" [puppet] - 10https://gerrit.wikimedia.org/r/627878 (https://phabricator.wikimedia.org/T213741) (owner: 10Razzi) [00:38:09] 10Operations, 10Wikidata, 10Wikimedia-Mailing-lists: Stop archiving the wikidata-bugs mailinglist in pipermail - https://phabricator.wikimedia.org/T262773 (10Ladsgroup) >>! In T262773#6464894, @Dzahn wrote: > I can only agree with what RLazarus already said. This is just a config change at the list-admin lev... [01:00:50] 10Operations, 10Wikidata, 10Wikimedia-Mailing-lists: Stop archiving the wikidata-bugs mailinglist in pipermail - https://phabricator.wikimedia.org/T262773 (10Dzahn) >>! In T262773#6465226, @Ladsgroup wrote: > What about pywikibugs-l? I assume it's also pretty big and people can just search in phabricator ins... [01:05:17] 10Operations, 10Wikidata, 10Wikimedia-Mailing-lists: Stop archiving the wikidata-bugs mailinglist in pipermail - https://phabricator.wikimedia.org/T262773 (10Ladsgroup) >>! In T262773#6468887, @Dzahn wrote: >>>! In T262773#6465226, @Ladsgroup wrote: >> What about pywikibugs-l? I assume it's also pretty big a... [01:11:30] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 74 probes of 570 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:14:12] 10Operations, 10Wikidata, 10Wikimedia-Mailing-lists: Stop archiving the wikidata-bugs mailinglist in pipermail - https://phabricator.wikimedia.org/T262773 (10Dzahn) >>! In T262773#6468857, @Ladsgroup wrote: > it would probably have backup (which is a good thing, the current setup doesn't have backups AFAIK)... [01:14:45] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-Site-requests, and 2 others: Remove "Cache-control: no-cache" hack from wmf-config - https://phabricator.wikimedia.org/T247783 (10Krinkle) 05Open→03Declined [01:15:28] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 59 probes of 653 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:16:32] (03PS2) 10Krinkle: noc: Improve phrasing of highlight.php error message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620138 (https://phabricator.wikimedia.org/T254646) [01:16:36] (03CR) 10Krinkle: [C: 03+2] noc: Improve phrasing of highlight.php error message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620138 (https://phabricator.wikimedia.org/T254646) (owner: 10Krinkle) [01:17:18] (03Merged) 10jenkins-bot: noc: Improve phrasing of highlight.php error message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620138 (https://phabricator.wikimedia.org/T254646) (owner: 10Krinkle) [01:17:50] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Traffic, and 2 others: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Tgr) >>! In T191183#6465408, @hashar wrote: > We can't use the third party service gravatar.com since that leaks personal information to a third party.... [01:19:05] 10Operations, 10Wikidata, 10Wikimedia-Mailing-lists: Stop archiving the wikidata-bugs mailinglist in pipermail - https://phabricator.wikimedia.org/T262773 (10Dzahn) >>! In T262773#6468857, @Ladsgroup wrote: >>>Maybe we can also get an overview of the biggest archives to handle the low-hanging fruits first? I... [01:22:40] !log krinkle@mwmaint2001 synced docroot/noc – https://gerrit.wikimedia.org/r/620138 [01:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:45] !log krinkle@mwmaint1002 synced docroot/noc – https://gerrit.wikimedia.org/r/620138 [01:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:24:04] 10Operations, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Dzahn) Here are the largest ones as requested by Ladsgroup on T262773 ` 10G helpdesk-l 6.1G reading-web-team.mbox 6.1G arbcom-l 5.5G wmfall 4.8G reading-web-team 3.8G wikidata-bugs 3.... [01:24:17] 10Operations, 10Wikidata, 10Wikimedia-Mailing-lists: Stop archiving the wikidata-bugs mailinglist in pipermail - https://phabricator.wikimedia.org/T262773 (10Dzahn) >>! In T262773#6468918, @Ladsgroup wrote: >>> There are two: `pywikibot-bugs` and `pywikipedia-bugs` 787M pywikibot-bugs 260M pywikibot-bugs.mb... [02:01:28] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:12] RECOVERY - Check the last execution of package_builder_Clean_up_build_directory on deneb is OK: OK: Status of the systemd unit package_builder_Clean_up_build_directory https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:20:34] 10Operations, 10Sentry: Procure hardware for Sentry - https://phabricator.wikimedia.org/T93138 (10Tgr) 05Stalled→03Declined Per grandparent task. We use {T226986} instead in production, which uses existing hardware. [02:26:52] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 12 probes of 653 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:27:07] (03Abandoned) 10Gergő Tisza: [WIP] logstash: send errors to sentry [puppet] - 10https://gerrit.wikimedia.org/r/263024 (https://phabricator.wikimedia.org/T85239) (owner: 10Gergő Tisza) [02:27:37] (03Abandoned) 10Gergő Tisza: Improve sentry plugin [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/263027 (https://phabricator.wikimedia.org/T85239) (owner: 10Gergő Tisza) [02:28:46] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 570 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:34:46] (03PS1) 10Dzahn: openstack: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/627966 (https://phabricator.wikimedia.org/T209953) [02:43:04] (03PS1) 10Dzahn: openstack: replace remaining hiera() that had default values [puppet] - 10https://gerrit.wikimedia.org/r/627967 [02:44:08] (03CR) 10jerkins-bot: [V: 04-1] openstack: replace remaining hiera() that had default values [puppet] - 10https://gerrit.wikimedia.org/r/627967 (owner: 10Dzahn) [02:47:35] (03PS1) 10Dzahn: spicerack: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/627968 [03:08:02] (03PS1) 10Dzahn: nagios_common: delete unused contacts-new template [puppet] - 10https://gerrit.wikimedia.org/r/627969 [03:17:27] (03PS1) 10Dzahn: docker: replace hiera with lookup, add data types for builder and registry [puppet] - 10https://gerrit.wikimedia.org/r/627970 [03:20:27] (03PS1) 10Dzahn: parsoid: replace hiera with lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/627971 [04:39:19] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team-TODO, 10Traffic, 10Release-Engineering-Team (Other / Uncategorized): Investigate what caused the unattended varnish upgrade in Beta Cluster - https://phabricator.wikimedia.org/T179197 (10DannyS712) [04:53:43] !log Deploy schema change on s1 eqiad primary master - T238966 [04:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:51] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [04:58:20] marostegui do you have an estimate of when replag will be back to normal? I understand that the process takes a while, just wondering [04:58:47] DannyS712: On all the hosts but 1 is back to normal, there might be lag again due to the above schema change [04:58:53] maybe 1-2 days [04:59:49] Okay. https://replag.toolforge.org/ says 65 hours and rising for enwiki, and its been noticed (had to explain that a bot report was running, there was just nothing different from the last report). Thanks [05:00:31] DannyS712: yes, that is one of the hosts, which is still executing the big alter [05:00:34] the others are in sync [05:00:39] labsdb1009 [05:00:40] Seconds_Behind_Master: 0 [05:00:40] labsdb1010 [05:00:40] Seconds_Behind_Master: 0 [05:00:40] labsdb1011 [05:00:40] Seconds_Behind_Master: 120525 [05:00:41] labsdb1012 [05:00:41] Seconds_Behind_Master: 0 [05:01:45] ack, thanks [05:02:03] (03PS1) 10Marostegui: mariadb: Enable notifications on many hosts [puppet] - 10https://gerrit.wikimedia.org/r/627975 [05:02:43] (03CR) 10Marostegui: [C: 03+2] mariadb: Enable notifications on many hosts [puppet] - 10https://gerrit.wikimedia.org/r/627975 (owner: 10Marostegui) [05:05:13] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: dbproxy1020 PS disconnected - https://phabricator.wikimedia.org/T262998 (10Marostegui) @Cmjohnson thanks for resetting the logs yesterday. Unfortunately, the idrac still reports one of the power supplies as down. {P12612} [05:07:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es2015 after cloning es2031 T261717', diff saved to https://phabricator.wikimedia.org/P12615 and previous config saved to /var/cache/conftool/dbconfig/20200917-050739-marostegui.json [05:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:46] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [05:17:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es2011 as es1 master and es2017 as es3 master and then depool es2018 and es2012 to clone es2029 and es2030 T261717', diff saved to https://phabricator.wikimedia.org/P12616 and previous config saved to /var/cache/conftool/dbconfig/20200917-051741-marostegui.json [05:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:49] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [05:23:42] (03PS1) 10Marostegui: mariadb: Productionize es2029, es2030 [puppet] - 10https://gerrit.wikimedia.org/r/627978 (https://phabricator.wikimedia.org/T261717) [05:23:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es2015 after cloning es2031 T261717', diff saved to https://phabricator.wikimedia.org/P12617 and previous config saved to /var/cache/conftool/dbconfig/20200917-052347-marostegui.json [05:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:54] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [05:25:45] (03PS2) 10Marostegui: mariadb: Productionize es2029, es2030 [puppet] - 10https://gerrit.wikimedia.org/r/627978 (https://phabricator.wikimedia.org/T261717) [05:27:29] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize es2029, es2030 [puppet] - 10https://gerrit.wikimedia.org/r/627978 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [05:35:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es2015 after cloning es2031 T261717', diff saved to https://phabricator.wikimedia.org/P12618 and previous config saved to /var/cache/conftool/dbconfig/20200917-053503-marostegui.json [05:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:11] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [05:36:48] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) @Papaul I guess we should do that...and then wait again for the next crash and start the whole loop again. @Papaul let me know whe... [05:38:12] (03PS1) 10Marostegui: instances.yaml: Add es2031 [puppet] - 10https://gerrit.wikimedia.org/r/627979 (https://phabricator.wikimedia.org/T261717) [05:39:14] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add es2031 [puppet] - 10https://gerrit.wikimedia.org/r/627979 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [05:42:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es2031 on es2 for the first time with minimal weight T261717', diff saved to https://phabricator.wikimedia.org/P12619 and previous config saved to /var/cache/conftool/dbconfig/20200917-054226-marostegui.json [05:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:34] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [05:44:38] (03PS1) 10Marostegui: db1131: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/627981 [05:45:38] (03CR) 10Marostegui: [C: 03+2] db1131: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/627981 (owner: 10Marostegui) [05:46:12] !log Stop mysql on db1131 - T262901 [05:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:18] T262901: Physically move db1131 from B5 to C8 - https://phabricator.wikimedia.org/T262901 [05:47:00] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Physically move db1131 from B5 to C8 - https://phabricator.wikimedia.org/T262901 (10Marostegui) @Cmjohnson mysql is down on this host. I haven't powered it off yet, just to see if you can give me the new IP before I do so, so I can change it on the host so it boo... [05:51:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1131 for on-site maintenace', diff saved to https://phabricator.wikimedia.org/P12620 and previous config saved to /var/cache/conftool/dbconfig/20200917-055158-marostegui.json [05:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool es2015 after cloning es2031 T261717', diff saved to https://phabricator.wikimedia.org/P12621 and previous config saved to /var/cache/conftool/dbconfig/20200917-055219-marostegui.json [05:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:26] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:03:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es2031 T261717', diff saved to https://phabricator.wikimedia.org/P12622 and previous config saved to /var/cache/conftool/dbconfig/20200917-060312-marostegui.json [06:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:19] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:19:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es2031 T261717', diff saved to https://phabricator.wikimedia.org/P12623 and previous config saved to /var/cache/conftool/dbconfig/20200917-061931-marostegui.json [06:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:39] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:27:57] PROBLEM - ores on ores2001 is CRITICAL: connect to address 10.192.0.12 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:43:23] RECOVERY - ores on ores2001 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:47:09] 10Operations, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10jcrespo) > That's a pretty big database and I do share Jaime's concerns. > I would definitely not store that on m2 as it already has `OTRS` which is around 500GB already. If we had to... [06:49:36] 10Operations, 10observability, 10Goal: Export Prometheus-compatible JVM metrics from JVMs in production - https://phabricator.wikimedia.org/T177197 (10hashar) [06:49:42] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10observability, 10Release-Engineering-Team (Development services): Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10hashar) [06:49:52] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10observability, and 3 others: Add Prometheus exporter to Jenkins instances - https://phabricator.wikimedia.org/T182759 (10hashar) 05Declined→03Open Reopening, we need at least the JVM metrics to be exported so we c... [06:55:05] !log Taking a heap dump of Gerrit JVM [06:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:15] 10Operations, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Marostegui) >>! In T256538#6469302, @jcrespo wrote: >> That's a pretty big database and I do share Jaime's concerns. >> I would definitely not store that on m2 as it already has `OTRS... [06:56:57] (03CR) 10Volans: [C: 03+2] "Thanks for the patch, LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/627968 (owner: 10Dzahn) [06:58:36] 10Operations, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10jcrespo) Yes, agree, once wikitech is moved, we can keep it and not introduce a new section, keep it for mailman, that will need significant IOPS and disk. [07:01:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es2031 T261717', diff saved to https://phabricator.wikimedia.org/P12624 and previous config saved to /var/cache/conftool/dbconfig/20200917-070145-marostegui.json [07:01:47] (03PS2) 10Volans: dns: split public zones per DC [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627899 (https://phabricator.wikimedia.org/T244153) [07:01:49] (03PS2) 10Volans: dns: correctly sort IPv6 PTR records [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627909 (https://phabricator.wikimedia.org/T244153) [07:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:52] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [07:02:36] (03CR) 10Volans: "updated comment." (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/627605 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [07:33:09] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops: wikifeeds OpenAPI spec test doesn't fail if the response from `feed/featured` is malformed - https://phabricator.wikimedia.org/T263097 (10Joe) [07:39:42] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops: Wikifeeds should send uncachable response in case of some upstream failure - https://phabricator.wikimedia.org/T263100 (10Joe) [07:40:59] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops: Wikifeeds should send uncachable response in case of some upstream failure - https://phabricator.wikimedia.org/T263100 (10Joe) p:05Triage→03Low Setting the priority to low as in production we set maxage to 5 minutes as far... [07:42:25] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: reload when nagios config changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/627849 (https://phabricator.wikimedia.org/T263027) (owner: 10Filippo Giunchedi) [07:48:25] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Joe) My current theory is that when we enabled the service proxy, the pods were cpu starved. The steep increase in "throttled" cpu time w... [07:53:29] (03CR) 10JMeybohm: [C: 03+2] Use Kernel 4.19 on kubestage1002 [puppet] - 10https://gerrit.wikimedia.org/r/627867 (https://phabricator.wikimedia.org/T262527) (owner: 10JMeybohm) [07:55:10] !log cordoning kubestage1002 for kernel upgrade - T262527 [07:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:18] T262527: Update to kernel 4.19 on kubernetes nodes - https://phabricator.wikimedia.org/T262527 [07:55:26] !log draining kubestage1002 for kernel upgrade - T262527 [07:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:25] (03CR) 10Ema: [C: 03+1] Add hue-next.wikimedia.org settings for ATS [puppet] - 10https://gerrit.wikimedia.org/r/627856 (https://phabricator.wikimedia.org/T258768) (owner: 10Elukey) [08:06:21] PROBLEM - kubelet operational latencies on kubestage1001 is CRITICAL: instance=kubestage1001.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:06:35] (03PS1) 10Matthias Mullie: Disable MediaSearch A/B test [extensions/WikimediaEvents] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627955 [08:07:02] (03CR) 10Elukey: [C: 03+2] Add hue-next.wikimedia.org settings for ATS [puppet] - 10https://gerrit.wikimedia.org/r/627856 (https://phabricator.wikimedia.org/T258768) (owner: 10Elukey) [08:08:19] RECOVERY - kubelet operational latencies on kubestage1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:11:20] (03PS1) 10Gilles: Account for empty layout shift sources array [extensions/NavigationTiming] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/628039 (https://phabricator.wikimedia.org/T263047) [08:11:48] (03PS1) 10Filippo Giunchedi: icinga: notify on commands change [puppet] - 10https://gerrit.wikimedia.org/r/628040 (https://phabricator.wikimedia.org/T263027) [08:17:05] (03PS1) 10Giuseppe Lavagetto: wikifeeds: enable the service proxy in eqiad, raising cpu limits. [deployment-charts] - 10https://gerrit.wikimedia.org/r/628041 (https://phabricator.wikimedia.org/T263043) [08:18:15] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: enable remote syslog queues in codfw [puppet] - 10https://gerrit.wikimedia.org/r/627816 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [08:24:57] !log graphite add 300G to /srv [08:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:06] (03PS1) 10Muehlenhoff: Change service_id to hue_next.w.o [puppet] - 10https://gerrit.wikimedia.org/r/628042 [08:25:40] !log reboot kubestage1002 for kernel upgrade - T262527 [08:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:46] T262527: Update to kernel 4.19 on kubernetes nodes - https://phabricator.wikimedia.org/T262527 [08:29:58] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [08:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:40] (03CR) 10Muehlenhoff: [C: 03+2] Change service_id to hue_next.w.o [puppet] - 10https://gerrit.wikimedia.org/r/628042 (owner: 10Muehlenhoff) [08:31:32] (03CR) 10Gilles: [C: 03+2] Account for empty layout shift sources array [extensions/NavigationTiming] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/628039 (https://phabricator.wikimedia.org/T263047) (owner: 10Gilles) [08:37:00] !log graphite compress /var/log/carbon logs older than 2d [08:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:21] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [08:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:17] RECOVERY - Disk space on graphite1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=graphite1004&var-datasource=eqiad+prometheus/ops [08:43:53] !log uncordoned kubestage1002 after kernel upgrade - T262527 [08:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:59] T262527: Update to kernel 4.19 on kubernetes nodes - https://phabricator.wikimedia.org/T262527 [08:49:48] !log deleting some random pods in kubernetes staging to rebalance load back on kubestage1002 - T262527 [08:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:55] T262527: Update to kernel 4.19 on kubernetes nodes - https://phabricator.wikimedia.org/T262527 [08:50:51] (03Merged) 10jenkins-bot: Account for empty layout shift sources array [extensions/NavigationTiming] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/628039 (https://phabricator.wikimedia.org/T263047) (owner: 10Gilles) [08:52:48] (03PS2) 10Volans: Migrate ulsfo private records to automated DNS [dns] - 10https://gerrit.wikimedia.org/r/627605 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [08:52:50] (03PS1) 10Volans: Migrate ulsfo public records to automated DNS [dns] - 10https://gerrit.wikimedia.org/r/628046 (https://phabricator.wikimedia.org/T258729) [08:53:12] (03CR) 10jerkins-bot: [V: 04-1] Migrate ulsfo public records to automated DNS [dns] - 10https://gerrit.wikimedia.org/r/628046 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [08:58:14] (03PS2) 10Volans: Migrate ulsfo public records to automated DNS [dns] - 10https://gerrit.wikimedia.org/r/628046 (https://phabricator.wikimedia.org/T258729) [08:58:35] expected failure on CI [08:58:40] (03CR) 10jerkins-bot: [V: 04-1] Migrate ulsfo public records to automated DNS [dns] - 10https://gerrit.wikimedia.org/r/628046 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [09:00:56] volans: don't we all [09:01:11] lol [09:03:25] (03PS1) 10Alexandros Kosiaris: prometheus: Use FQDNS for k8s etcds [puppet] - 10https://gerrit.wikimedia.org/r/628049 [09:05:25] (03PS2) 10Giuseppe Lavagetto: wikifeeds: enable the service proxy in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/628041 (https://phabricator.wikimedia.org/T263043) [09:08:29] (03CR) 10Giuseppe Lavagetto: [C: 03+2] wikifeeds: enable the service proxy in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/628041 (https://phabricator.wikimedia.org/T263043) (owner: 10Giuseppe Lavagetto) [09:10:23] matthiasmullie: you're probably not going to be able to deploy your backport later on until this is fixed https://phabricator.wikimedia.org/T263047#6469550 [09:10:35] (03Merged) 10jenkins-bot: wikifeeds: enable the service proxy in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/628041 (https://phabricator.wikimedia.org/T263043) (owner: 10Giuseppe Lavagetto) [09:10:49] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/25150/" [puppet] - 10https://gerrit.wikimedia.org/r/627821 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [09:11:00] gilles: oh okay - thanks for the headsup! [09:11:13] (03CR) 10Jbond: [C: 03+2] cookbook sre.pdu: Fix reboot logic and other minor fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [09:11:21] (03PS37) 10Jbond: cookbook sre.pdu: Fix reboot logic and other minor fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (https://phabricator.wikimedia.org/T246890) [09:13:04] (03PS1) 10Ayounsi: Move HE to transits on cr3-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/628050 [09:13:13] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article [09:13:13] dia returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [09:13:38] (03CR) 10Ayounsi: [C: 03+2] Move HE to transits on cr3-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/628050 (owner: 10Ayounsi) [09:14:06] !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [09:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:13] PROBLEM - kubelet operational latencies on kubestage1001 is CRITICAL: instance=kubestage1001.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:14:19] (03Merged) 10jenkins-bot: Move HE to transits on cr3-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/628050 (owner: 10Ayounsi) [09:15:09] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [09:15:49] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Joe) In the meantime, I changed focus given even in eqiad we were still seeing failures with just some monitoring traffic. [09:16:09] RECOVERY - kubelet operational latencies on kubestage1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:19:24] 10Operations, 10Wikimedia-Mailing-lists: Disabling the Daily-article-hy mailing list - https://phabricator.wikimedia.org/T263105 (10Ashot1997) [09:21:43] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/627966 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [09:25:35] (03PS6) 10Jbond: cumin: for new wmcs. prefix for cookbooks, grant access to wmcs-admins [puppet] - 10https://gerrit.wikimedia.org/r/621343 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [09:40:22] (03PS4) 10Muehlenhoff: reboot-groups (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 [09:43:28] (03CR) 10Vgutierrez: [C: 03+2] Handle new pylint raise-missing-from [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627813 (owner: 10Vgutierrez) [09:43:34] (03CR) 10Vgutierrez: [C: 03+2] x509: Alternative chain support [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627809 (https://phabricator.wikimedia.org/T263006) (owner: 10Vgutierrez) [09:43:37] (03CR) 10Vgutierrez: [C: 03+2] api: Allow acme-chief clients to fetch alt. chain cert versions [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627869 (https://phabricator.wikimedia.org/T263006) (owner: 10Vgutierrez) [09:43:40] (03CR) 10Vgutierrez: [C: 03+2] requests: Fetch alternative chains [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627823 (https://phabricator.wikimedia.org/T263006) (owner: 10Vgutierrez) [09:43:43] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Save alternative chains [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627835 (owner: 10Vgutierrez) [09:43:46] (03CR) 10Vgutierrez: [C: 03+2] Release 0.29 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627873 (https://phabricator.wikimedia.org/T263006) (owner: 10Vgutierrez) [09:45:47] (03CR) 10Lucas Werkmeister (WMDE): "This change is ready for review." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620050 (https://phabricator.wikimedia.org/T260118) (owner: 10Guergana Tzatchkova) [09:46:28] (03Merged) 10jenkins-bot: Handle new pylint raise-missing-from [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627813 (owner: 10Vgutierrez) [09:46:32] (03Merged) 10jenkins-bot: x509: Alternative chain support [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627809 (https://phabricator.wikimedia.org/T263006) (owner: 10Vgutierrez) [09:46:41] (03Merged) 10jenkins-bot: requests: Fetch alternative chains [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627823 (https://phabricator.wikimedia.org/T263006) (owner: 10Vgutierrez) [09:46:42] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Joe) It seems we struck gold: after adding a keepalive timeout of 4 seconds to `restbase-for-services`, the errors on wikifeeds in eqiad... [09:46:47] (03Merged) 10jenkins-bot: acme_chief: Save alternative chains [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627835 (owner: 10Vgutierrez) [09:46:49] (03Merged) 10jenkins-bot: api: Allow acme-chief clients to fetch alt. chain cert versions [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627869 (https://phabricator.wikimedia.org/T263006) (owner: 10Vgutierrez) [09:47:32] (03PS1) 10Giuseppe Lavagetto: services_proxy: add a 4s keepalive timeout to restbase endpoint [puppet] - 10https://gerrit.wikimedia.org/r/628051 (https://phabricator.wikimedia.org/T263043) [09:48:24] (03PS1) 10Vgutierrez: Handle new pylint raise-missing-from [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/628052 [09:48:26] (03PS1) 10Vgutierrez: x509: Alternative chain support [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/628053 (https://phabricator.wikimedia.org/T263006) [09:48:28] (03PS1) 10Vgutierrez: requests: Fetch alternative chains [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/628054 (https://phabricator.wikimedia.org/T263006) [09:48:30] (03PS1) 10Vgutierrez: acme_chief: Save alternative chains [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/628055 [09:48:32] (03PS1) 10Vgutierrez: api: Allow acme-chief clients to fetch alt. chain cert versions [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/628056 (https://phabricator.wikimedia.org/T263006) [09:48:57] (03Merged) 10jenkins-bot: Release 0.29 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/627873 (https://phabricator.wikimedia.org/T263006) (owner: 10Vgutierrez) [09:51:43] (03PS1) 10Vgutierrez: Release 0.29 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/628057 (https://phabricator.wikimedia.org/T263006) [09:51:49] (03CR) 10JMeybohm: [C: 03+1] services_proxy: add a 4s keepalive timeout to restbase endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/628051 (https://phabricator.wikimedia.org/T263043) (owner: 10Giuseppe Lavagetto) [09:55:05] (03CR) 10Alexandros Kosiaris: [C: 03+2] prometheus: Use FQDNS for k8s etcds [puppet] - 10https://gerrit.wikimedia.org/r/628049 (owner: 10Alexandros Kosiaris) [09:59:41] (03CR) 10Ladsgroup: Remove $wgExtraLanguageNames from Wikidata and Commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620050 (https://phabricator.wikimedia.org/T260118) (owner: 10Guergana Tzatchkova) [10:00:04] mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200917T1000) [10:00:26] (03PS2) 10Giuseppe Lavagetto: services_proxy: add a 4s keepalive timeout to restbase endpoint [puppet] - 10https://gerrit.wikimedia.org/r/628051 (https://phabricator.wikimedia.org/T263043) [10:02:49] (03CR) 10JMeybohm: [C: 03+1] services_proxy: add a 4s keepalive timeout to restbase endpoint [puppet] - 10https://gerrit.wikimedia.org/r/628051 (https://phabricator.wikimedia.org/T263043) (owner: 10Giuseppe Lavagetto) [10:04:18] (03CR) 10MSantos: [C: 03+1] services_proxy: add a 4s keepalive timeout to restbase endpoint [puppet] - 10https://gerrit.wikimedia.org/r/628051 (https://phabricator.wikimedia.org/T263043) (owner: 10Giuseppe Lavagetto) [10:14:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] services_proxy: add a 4s keepalive timeout to restbase endpoint [puppet] - 10https://gerrit.wikimedia.org/r/628051 (https://phabricator.wikimedia.org/T263043) (owner: 10Giuseppe Lavagetto) [10:15:07] (03CR) 10Kormat: [C: 03+1] "LGTM" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/626602 (owner: 10Jcrespo) [10:17:26] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [10:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:49] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:18:11] (03PS3) 10Volans: dns: split public zones per DC [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627899 (https://phabricator.wikimedia.org/T244153) [10:18:12] (03PS3) 10Volans: dns: correctly sort IPv6 PTR records [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627909 (https://phabricator.wikimedia.org/T244153) [10:18:14] (03PS1) 10Volans: dns: do not try to generate PTR for external IPs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/628061 (https://phabricator.wikimedia.org/T244153) [10:18:19] (03CR) 10Kormat: [C: 03+1] resolve: Allow connections with :
, in addition to port [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/626603 (owner: 10Jcrespo) [10:18:42] (03CR) 10jerkins-bot: [V: 04-1] dns: do not try to generate PTR for external IPs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/628061 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [10:18:49] !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [10:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:58] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [10:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:41] !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [10:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:50] (03PS2) 10Volans: dns: do not try to generate PTR for external IPs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/628061 (https://phabricator.wikimedia.org/T244153) [10:27:28] !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [10:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:57] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [10:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:16] (03CR) 10Volans: "CI failure is expected because depends on I24887226bd30e3a1a8d5622179ac695c083b992a being merged and the sre.dns.netbox cookbook run to ge" (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/628046 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [10:39:30] (03PS1) 10Elukey: cdh::hue: deploy libsasl2-modules-gssapi-mit [puppet] - 10https://gerrit.wikimedia.org/r/628063 [10:40:37] !log oblivian@cumin1001 conftool action : set/ttl=10; selector: dnsdisc=wikifeeds [10:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:40] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/628063 (owner: 10Elukey) [10:41:53] (03CR) 10Elukey: [C: 03+2] cdh::hue: deploy libsasl2-modules-gssapi-mit [puppet] - 10https://gerrit.wikimedia.org/r/628063 (owner: 10Elukey) [10:42:18] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops, 10Patch-For-Review: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Joe) Current situation: eqiad and staging use the service proxy, and after the fix was deployed show no signs of er... [10:46:05] (03PS1) 10Santhosh: wgSkipSkins: Exclude contenttranslation skin from skin options for users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628065 (https://phabricator.wikimedia.org/T263093) [10:46:23] !log oblivian@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=wikifeeds,name=eqiad [10:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es2031 T261717', diff saved to https://phabricator.wikimedia.org/P12626 and previous config saved to /var/cache/conftool/dbconfig/20200917-104816-marostegui.json [10:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:23] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [10:50:22] (03CR) 10Vgutierrez: [C: 03+2] Release 0.29 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/628057 (https://phabricator.wikimedia.org/T263006) (owner: 10Vgutierrez) [10:51:16] !log oblivian@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=wikifeeds,name=codfw [10:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:30] (03PS1) 10Hnowlan: api-gateway: Add all mwdebug hosts to debug clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/628066 (https://phabricator.wikimedia.org/T262396) [10:52:49] (03CR) 10Vgutierrez: [C: 03+2] Handle new pylint raise-missing-from [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/628052 (owner: 10Vgutierrez) [10:52:52] (03CR) 10Vgutierrez: [C: 03+2] x509: Alternative chain support [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/628053 (https://phabricator.wikimedia.org/T263006) (owner: 10Vgutierrez) [10:52:55] (03CR) 10Vgutierrez: [C: 03+2] requests: Fetch alternative chains [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/628054 (https://phabricator.wikimedia.org/T263006) (owner: 10Vgutierrez) [10:52:58] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Save alternative chains [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/628055 (owner: 10Vgutierrez) [10:53:02] (03CR) 10Vgutierrez: [C: 03+2] api: Allow acme-chief clients to fetch alt. chain cert versions [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/628056 (https://phabricator.wikimedia.org/T263006) (owner: 10Vgutierrez) [10:53:21] (03PS1) 10Vgutierrez: debian: Add release 0.29 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/628067 (https://phabricator.wikimedia.org/T263006) [10:54:03] RECOVERY - hue-next.wikimedia.org requires authentication on an-tool1009 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 514 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:55:02] yesss [10:55:50] (03Merged) 10jenkins-bot: Handle new pylint raise-missing-from [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/628052 (owner: 10Vgutierrez) [10:55:52] (03Merged) 10jenkins-bot: x509: Alternative chain support [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/628053 (https://phabricator.wikimedia.org/T263006) (owner: 10Vgutierrez) [10:55:54] (03Merged) 10jenkins-bot: requests: Fetch alternative chains [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/628054 (https://phabricator.wikimedia.org/T263006) (owner: 10Vgutierrez) [10:55:56] (03Merged) 10jenkins-bot: acme_chief: Save alternative chains [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/628055 (owner: 10Vgutierrez) [10:56:05] (03Merged) 10jenkins-bot: api: Allow acme-chief clients to fetch alt. chain cert versions [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/628056 (https://phabricator.wikimedia.org/T263006) (owner: 10Vgutierrez) [10:56:07] (03Merged) 10jenkins-bot: Release 0.29 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/628057 (https://phabricator.wikimedia.org/T263006) (owner: 10Vgutierrez) [10:58:03] (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.29 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/628067 (https://phabricator.wikimedia.org/T263006) (owner: 10Vgutierrez) [10:58:04] !log oblivian@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=wikifeeds,name=codfw [10:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:26] !log Stop mysql on db1125 for PDU mainteanance, lag will appear on s2, s4, s6 and s7 on labsdb hosts T261459 [10:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:32] T261459: New Date - Thur, Sept 17: PDU Upgrade 12pm-4pm UTC- Racks D1 and D2 - https://phabricator.wikimedia.org/T261459 [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a European mid-day backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200917T1100). [11:00:04] matthiasmullie: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:21] I don’t see anything in the calendar [11:00:44] oh, it was moved to the puppet request window [11:00:48] and jouncebot didn’t refresh in time [11:00:51] (03PS1) 10Muehlenhoff: Apply CAS auth settings for the entire vhost [puppet] - 10https://gerrit.wikimedia.org/r/628068 [11:00:53] (T243394) [11:00:54] T243394: Automatically refresh jouncebot just before a deployment window starts - https://phabricator.wikimedia.org/T243394 [11:00:55] (03Merged) 10jenkins-bot: debian: Add release 0.29 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/628067 (https://phabricator.wikimedia.org/T263006) (owner: 10Vgutierrez) [11:01:01] I moved it to this evening because of https://phabricator.wikimedia.org/T263047#6469550 - unless that's resolved? [11:01:09] let me check [11:01:11] oh, moved it to the wrong slot - should be the other backports window :p [11:01:41] yup, the lockfile still exists [11:02:07] does anyone know more about this lockfile? (cc krinkle if he’s online) [11:02:33] I don't know more about taht, but I SMS'd Krinkle to see if he's awake [11:02:34] * matthiasmullie moved patch to correct window this eve [11:02:43] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path /mostread/articles = Expected 1 array elements, gotten 0 https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:03:34] thanks liw [11:03:35] !log oblivian@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=wikifeeds,name=eqiad [11:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:39] timestamp would match https://sal.toolforge.org/log/KcuomXQBLkHzneNNCjM7 / T254646#6468972 [11:03:40] T254646: Reconsidering how we name things - https://phabricator.wikimedia.org/T254646 [11:03:56] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: New Date - Thur, Sept 17: PDU Upgrade 12pm-4pm UTC- Racks D1 and D2 - https://phabricator.wikimedia.org/T261459 (10Marostegui) mysql stopped on db1125 on all the instances [11:04:11] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [11:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:25] !log upload acme-chief 0.29 to apt.wm.o (buster) - T263006 [11:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:31] T263006: Let's Encrypt transitioning to ISRG's Root - https://phabricator.wikimedia.org/T263006 [11:04:41] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:04:48] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [11:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:58] Lucas_WMDE: which lock file? [11:05:08] `/var/lock/scap-global-lock` on deploy1001 [11:05:12] which apparently blocks scaps [11:05:16] yeah on purpose [11:05:28] someone must have set it earlier [11:05:46] yes, twelve hours ago according to the file modification time… [11:05:52] (03CR) 10Elukey: Apply CAS auth settings for the entire vhost (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/628068 (owner: 10Muehlenhoff) [11:06:00] I guess we can just drop it [11:06:12] !log update to acme-chief 0.29 on acmechief[12]001 - T263006 [11:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:58] if you can, that would be great – I don’t think I have the needed rights (root, if I’m not mistaken, since that’s the owner of the containing directory) [11:07:07] hashar, it would be good to unblock scap; but it would also be good to know why the lock is there [11:07:20] (see also https://phabricator.wikimedia.org/T263047#6469550) [11:07:55] hashar, I'm leaning towards deleting it and fixing what breaks, if you're around, though [11:08:27] * liw pops out for a minute [11:09:39] needs SRE to delete it :-\ [11:10:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool es2031 T261717', diff saved to https://phabricator.wikimedia.org/P12627 and previous config saved to /var/cache/conftool/dbconfig/20200917-111028-marostegui.json [11:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:35] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [11:11:41] 10Operations, 10Acme-chief, 10Traffic, 10Patch-For-Review: Let's Encrypt transitioning to ISRG's Root - https://phabricator.wikimedia.org/T263006 (10Vgutierrez) p:05High→03Medium acme-chief updated to version 0.29 in our production environment, the unified cert should be renewed tomorrow, we will check... [11:12:03] Lucas_WMDE: liw: I poked SRe on the back channel [11:12:20] thanks [11:12:40] (fwiw, I have a meeting in fifteen minutes, so I probably won’t be able to deploy the config change anyways even if scap gets unblocked) [11:12:54] Lucas_WMDE: it is good now ;] [11:12:57] thanks to volans ! [11:13:38] matthiasmullie: want to go ahead with the deployment or wait until the next slot? [11:13:43] thanks volans and hashar! [11:13:52] might as well get it done now :) [11:14:07] (03CR) 10Matthias Mullie: [C: 03+2] Disable MediaSearch A/B test [extensions/WikimediaEvents] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627955 (owner: 10Matthias Mullie) [11:14:11] afk doorbell [11:14:29] (03PS3) 10Hnowlan: changeprop: remove changeprop from puppet [puppet] - 10https://gerrit.wikimedia.org/r/603534 (https://phabricator.wikimedia.org/T220399) [11:14:41] back [11:14:46] looks like CI on that repo is fast so let’s gooo [11:14:55] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Disable MediaSearch A/B test [extensions/WikimediaEvents] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627955 (owner: 10Matthias Mullie) [11:15:33] Lucas_WMDE: are you deploying or am I? [11:15:56] I don't care either way - just making sure we don't do duplicate work :D [11:16:06] if you want to do it go ahead [11:16:14] I wasn’t sure if you’re a deployer or not ^^ [11:16:44] ah yeah - thanks for that ;) [11:16:49] I'll go ahead [11:17:37] (03Merged) 10jenkins-bot: Disable MediaSearch A/B test [extensions/WikimediaEvents] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/627955 (owner: 10Matthias Mullie) [11:18:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: Telia IC-361191 patch - https://phabricator.wikimedia.org/T261791 (10ayounsi) [11:18:33] btw - has someone backported this one yet, now that the lockfile is gone? https://phabricator.wikimedia.org/T263047#6469550 [11:18:49] s/backported/synced/ [11:19:17] no, good point [11:19:52] can you do both? [11:21:01] sure [11:21:38] 10Blocked-on-Operations, 10Puppet, 10Product-Infrastructure-Team-Backlog, 10Sentry, and 2 others: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956 (10Nikerabbit) [11:22:07] !log mlitn@deploy1001 Synchronized php-1.36.0-wmf.9/extensions/WikimediaEvents/: Disable MediaSearch A/B test (duration: 01m 08s) [11:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:20] thanks [11:24:09] !log mlitn@deploy1001 Synchronized php-1.36.0-wmf.9/extensions/NavigationTiming/: Account for empty layout shift sources array (duration: 01m 05s) [11:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:32] !log End Euro B&C [11:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:59] (03CR) 10Volans: "The patch seems reasonable to me, couple of minor comments inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/621343 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [11:26:13] hashar, volans, thanks for unblocking Scap [11:29:29] (03CR) 10Muehlenhoff: Apply CAS auth settings for the entire vhost (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/628068 (owner: 10Muehlenhoff) [11:29:41] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/25152/" [puppet] - 10https://gerrit.wikimedia.org/r/628068 (owner: 10Muehlenhoff) [11:38:03] (03CR) 10Hnowlan: "pcc output: https://puppet-compiler.wmflabs.org/compiler1003/25153/scb1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/603534 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [11:44:03] (03CR) 10Muehlenhoff: [C: 03+2] Add grafana-rw to CORS origins [puppet] - 10https://gerrit.wikimedia.org/r/627777 (owner: 10Muehlenhoff) [11:46:49] gilles, so your patch for T263047 got deployed - can you confirm we can drop it as a train blocker? [11:46:49] T263047: Uncaught TypeError: Cannot read property 'node' of undefined - https://phabricator.wikimedia.org/T263047 [11:52:56] (03CR) 10Elukey: [C: 03+1] Apply CAS auth settings for the entire vhost (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/628068 (owner: 10Muehlenhoff) [11:57:13] (03CR) 10Hnowlan: [C: 03+2] changeprop: remove changeprop from puppet [puppet] - 10https://gerrit.wikimedia.org/r/603534 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [11:57:26] liw: I'm checking right now [11:58:52] liw: fix confirmed https://phabricator.wikimedia.org/T263047#6470033 [11:59:45] gilles, merci boucoup! [11:59:57] (which I hope means "thank you") [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200917T1200) [12:12:36] (03CR) 10Jcrespo: "@kormat: Thanks for the review. My approach to wmfmariadbpy was thinking you would merge when it was adequate (you do the +2 as owner, as " [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/626602 (owner: 10Jcrespo) [12:14:13] (03PS2) 10Hnowlan: api-gateway: Add all mwdebug hosts to debug clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/628066 (https://phabricator.wikimedia.org/T262396) [12:18:35] !log pdu swap maintenance beginning now for racks D1, D2 and C1 eqiad [12:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:40] D1: Initial commit - https://phabricator.wikimedia.org/D1 [12:18:41] D2: Add .arcconfig for differential/arcanist - https://phabricator.wikimedia.org/D2 [12:27:06] (03CR) 10Muehlenhoff: [C: 03+2] Apply CAS auth settings for the entire vhost [puppet] - 10https://gerrit.wikimedia.org/r/628068 (owner: 10Muehlenhoff) [12:31:52] (03PS6) 10Hnowlan: api-gateway: migrate to new helmfile format [deployment-charts] - 10https://gerrit.wikimedia.org/r/627250 [12:33:34] PROBLEM - ps1-d1-eqiad-infeed-load-tower-B-phase-Z on ps1-d1-eqiad is CRITICAL: SNMP CRITICAL - ps1-d1-eqiad-infeed-load-tower-B-phase-Z *-1* https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:34:07] (03PS1) 10Jbond: stunnel: Add new stunnel class and daemon define [puppet] - 10https://gerrit.wikimedia.org/r/628078 [12:34:38] (03CR) 10jerkins-bot: [V: 04-1] stunnel: Add new stunnel class and daemon define [puppet] - 10https://gerrit.wikimedia.org/r/628078 (owner: 10Jbond) [12:39:36] PROBLEM - Host ps1-d1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:40:40] (03CR) 10Elukey: [C: 03+2] sre.cassandra.roll-restart.py: add more accurate sleep time [cookbooks] - 10https://gerrit.wikimedia.org/r/627512 (owner: 10Elukey) [12:40:46] (03PS3) 10Elukey: sre.cassandra.roll-restart.py: add more accurate sleep time [cookbooks] - 10https://gerrit.wikimedia.org/r/627512 [12:47:51] (03PS2) 10Jbond: stunnel: Add new stunnel class and daemon define [puppet] - 10https://gerrit.wikimedia.org/r/628078 [12:48:12] (03PS3) 10Rosalie Perside (WMDE): Remove $wgExtraLanguageNames from Wikidata and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620050 (https://phabricator.wikimedia.org/T260118) (owner: 10Guergana Tzatchkova) [12:48:27] (03CR) 10jerkins-bot: [V: 04-1] stunnel: Add new stunnel class and daemon define [puppet] - 10https://gerrit.wikimedia.org/r/628078 (owner: 10Jbond) [12:49:04] (03CR) 10jerkins-bot: [V: 04-1] Remove $wgExtraLanguageNames from Wikidata and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620050 (https://phabricator.wikimedia.org/T260118) (owner: 10Guergana Tzatchkova) [12:50:13] (03PS4) 10JMeybohm: lvs: Remove mobileapps non-TLS endpoint from LVS 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/627266 (https://phabricator.wikimedia.org/T255876) [12:51:00] (03PS3) 10Jbond: stunnel: Add new stunnel class and daemon define [puppet] - 10https://gerrit.wikimedia.org/r/628078 (https://phabricator.wikimedia.org/T259117) [12:52:00] (03PS10) 10Ottomata: helmfile.d: refactor eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/619437 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [12:53:52] PROBLEM - IPMI Sensor Status on mw1351 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [12:54:27] (03CR) 10Ottomata: [C: 03+2] helmfile.d: refactor eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/619437 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [12:54:40] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) [12:55:11] (03CR) 10Kormat: [C: 03+2] mariadb: Use labsdb mysql config group for both labsdb and clouddb hosts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/627502 (owner: 10Jcrespo) [12:55:41] (03CR) 10JMeybohm: [C: 03+2] lvs: Remove mobileapps non-TLS endpoint from LVS 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/627266 (https://phabricator.wikimedia.org/T255876) (owner: 10JMeybohm) [12:55:46] PROBLEM - IPMI Sensor Status on mw1362 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [12:57:07] (03Merged) 10jenkins-bot: helmfile.d: refactor eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/619437 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [12:57:38] (03PS4) 10Rosalie Perside (WMDE): Remove $wgExtraLanguageNames from Wikidata and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620050 (https://phabricator.wikimedia.org/T260118) (owner: 10Guergana Tzatchkova) [13:00:05] liw and brennen: I, the Bot under the Fountain, allow thee, The Deployer, to do Mediawiki train - European+American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200917T1300). [13:00:33] (03PS1) 10Lars Wirzenius: all wikis to 1.36.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628081 [13:00:35] (03CR) 10Lars Wirzenius: [C: 03+2] all wikis to 1.36.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628081 (owner: 10Lars Wirzenius) [13:00:40] (03PS5) 10Rosalie Perside (WMDE): Remove $wgExtraLanguageNames from Wikidata and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620050 (https://phabricator.wikimedia.org/T260118) (owner: 10Guergana Tzatchkova) [13:01:09] (03PS2) 10JMeybohm: lvs: Remove blubberoid non-TLS endpoint from LVS 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/627270 (https://phabricator.wikimedia.org/T236017) [13:01:37] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628081 (owner: 10Lars Wirzenius) [13:02:59] (03CR) 10JMeybohm: [C: 03+2] lvs: Remove blubberoid non-TLS endpoint from LVS 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/627270 (https://phabricator.wikimedia.org/T236017) (owner: 10JMeybohm) [13:03:40] !log liw@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.9 [13:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:19] (03CR) 10Kormat: [C: 03+2] mariadb.py: Remove redundant code already present on WMFMariaDB class [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/626602 (owner: 10Jcrespo) [13:05:22] (03CR) 10Kormat: [C: 03+2] resolve: Allow connections with :
, in addition to port [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/626603 (owner: 10Jcrespo) [13:06:30] (03Merged) 10jenkins-bot: mariadb.py: Remove redundant code already present on WMFMariaDB class [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/626602 (owner: 10Jcrespo) [13:06:32] (03Merged) 10jenkins-bot: resolve: Allow connections with :
, in addition to port [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/626603 (owner: 10Jcrespo) [13:06:34] (03Merged) 10jenkins-bot: mariadb: Use labsdb mysql config group for both labsdb and clouddb hosts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/627502 (owner: 10Jcrespo) [13:07:02] (03CR) 10Ppchelko: [C: 03+1] api-gateway: Add all mwdebug hosts to debug clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/628066 (https://phabricator.wikimedia.org/T262396) (owner: 10Hnowlan) [13:07:20] (03PS1) 10Ottomata: eventgate-main - use proper path to private values file in helmfile.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/628082 (https://phabricator.wikimedia.org/T258572) [13:08:24] (03CR) 10JMeybohm: [C: 03+1] eventgate-main - use proper path to private values file in helmfile.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/628082 (https://phabricator.wikimedia.org/T258572) (owner: 10Ottomata) [13:08:40] PROBLEM - Host ms-be1059 is DOWN: PING CRITICAL - Packet loss = 100% [13:08:44] (03CR) 10Ottomata: [C: 03+2] eventgate-main - use proper path to private values file in helmfile.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/628082 (https://phabricator.wikimedia.org/T258572) (owner: 10Ottomata) [13:08:50] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-main - use proper path to private values file in helmfile.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/628082 (https://phabricator.wikimedia.org/T258572) (owner: 10Ottomata) [13:08:58] cmjohnson1: expected? ms-be1059 down [13:09:12] cc godog [13:09:57] (03CR) 10Lucas Werkmeister (WMDE): Remove $wgExtraLanguageNames from Wikidata and Commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620050 (https://phabricator.wikimedia.org/T260118) (owner: 10Guergana Tzatchkova) [13:10:34] Volans not expected but both power supplies on same pdu [13:10:40] Powering up now [13:10:42] 10Operations, 10DBA: Remove sections from db configs - https://phabricator.wikimedia.org/T263127 (10Kormat) [13:10:46] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.31:8748, 10.2.1.14:8888]) https://wikitech.wikimedia.org/wiki/PyBal [13:11:19] 10Operations, 10DBA, 10User-Kormat: Remove sections from db configs - https://phabricator.wikimedia.org/T263127 (10Kormat) p:05Triage→03Medium [13:11:40] PROBLEM - ps1-d2-eqiad-infeed-load-tower-B-phase-X on ps1-d2-eqiad is CRITICAL: SNMP CRITICAL - ps1-d2-eqiad-infeed-load-tower-B-phase-X *-1* https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:12:04] PROBLEM - ps1-d2-eqiad-infeed-load-tower-B-phase-Y on ps1-d2-eqiad is CRITICAL: SNMP CRITICAL - ps1-d2-eqiad-infeed-load-tower-B-phase-Y *-1* https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:12:22] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.14:8888, 10.2.2.31:8748]) https://wikitech.wikimedia.org/wiki/PyBal [13:12:30] PROBLEM - ps1-d2-eqiad-infeed-load-tower-B-phase-Z on ps1-d2-eqiad is CRITICAL: SNMP CRITICAL - ps1-d2-eqiad-infeed-load-tower-B-phase-Z *-1* https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:12:32] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.31:8748, 10.2.1.14:8888]) https://wikitech.wikimedia.org/wiki/PyBal [13:12:39] me ^ [13:12:44] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.14:8888, 10.2.2.31:8748]) https://wikitech.wikimedia.org/wiki/PyBal [13:13:26] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:13:29] !log restarting pybal on lvs1016.eqiad.wmnet,lvs2010.codfw.wmnet [13:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:56] RECOVERY - Host ms-be1059 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [13:14:43] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [13:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:49] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [13:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:36] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) @Marostegui you can at anytime. Thanks [13:16:44] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) @Papaul doing it now, thanks - will ping you once it is ready for you [13:17:15] !log Stop MySQL on db2125 for on-site maintenance T260670 [13:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:21] T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 [13:17:45] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) @Papaul host off - you can go ahead [13:19:02] !log restarting pybal on lvs1015.eqiad.wmnet,lvs2009.codfw.wmnet [13:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:19] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [13:20:19] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [13:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:59] RECOVERY - IPMI Sensor Status on mw1351 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:24:24] (03PS1) 10Elukey: Use local puppet host certificates for Hadoop Test's TLS encryption [puppet] - 10https://gerrit.wikimedia.org/r/628084 (https://phabricator.wikimedia.org/T253957) [13:25:14] !log kormat@cumin1001 dbctl commit (dc=all): 'Start depooling db2114 T259831', diff saved to https://phabricator.wikimedia.org/P12628 and previous config saved to /var/cache/conftool/dbconfig/20200917-132513-kormat.json [13:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:21] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [13:25:57] RECOVERY - IPMI Sensor Status on mw1362 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:26:33] PROBLEM - Host ps1-d2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [13:26:46] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:27:55] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:28:07] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:28:23] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:29:36] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops, 10Patch-For-Review: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Joe) After further review - while the change I made fixed the error messages, it did not fix the final restbase res... [13:31:49] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:32:14] !log ran ipvsadm -D -t 10.2.1.14:8888 on lvs2010.codfw.wmnet,lvs2009.codfw.wmnet [13:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:29] !log ran ipvsadm -D -t 10.2.1.31:8748 on lvs2010.codfw.wmnet,lvs2009.codfw.wmnet [13:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:59] 10Operations, 10Traffic, 10netops: Wikimedia projects not reachable for some Telecom Italia users - https://phabricator.wikimedia.org/T262869 (10Nemo_bis) >>! In T262869#6468339, @CDanis wrote: > There was another instance of it about 10 hours ago. Also right now it seems, at least from some TIM customer in... [13:32:59] !log ran ipvsadm -D -t 10.2.2.31:8748 on lvs1016.eqiad.wmnet,lvs1015.eqiad.wmnet [13:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:37] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [13:33:37] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [13:33:39] !log ran ipvsadm -D -t 10.2.2.14:8888 on lvs1016.eqiad.wmnet,lvs1015.eqiad.wmnet [13:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:22] 10Operations, 10Traffic, 10netops: Wikimedia projects not reachable for some Telecom Italia users - https://phabricator.wikimedia.org/T262869 (10CDanis) >>! In T262869#6470378, @Nemo_bis wrote: >>>! In T262869#6468339, @CDanis wrote: >> There was another instance of it about 10 hours ago. > > Also right now... [13:35:18] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops, 10Patch-For-Review: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Pchelolo) @Joe absolutely correct. Restbase just calls that url and fetches the summary (from itself, so that shoul... [13:36:35] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: 184 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:37:53] (03PS1) 10Volans: swift: remove old unused service records [dns] - 10https://gerrit.wikimedia.org/r/628086 (https://phabricator.wikimedia.org/T244153) [13:38:09] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: (C)100 gt (W)50 gt 20 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:44:52] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Gehel) >>! In T260271#6463408, @Papaul wrote: > @RKemper the installing is not able to setup the raid using the partman recipe below . We are using a HW r... [13:46:45] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) upgrade BIOS from BIOS Version 2.8.1 to 2.8.2 changed profile settings from to Performace @Marostegui all... [13:50:39] (03PS1) 10Gehel: maps: add partman configuration for newer maps servers. [puppet] - 10https://gerrit.wikimedia.org/r/628089 (https://phabricator.wikimedia.org/T260271) [13:51:36] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move mobileapps to use TLS only - https://phabricator.wikimedia.org/T255876 (10JMeybohm) [13:51:44] 10Operations, 10DBA, 10Performance-Team, 10Platform Engineering, 10User-Kormat: Remove sections from db configs - https://phabricator.wikimedia.org/T263127 (10Marostegui) So in the past, the hosts serving those 5 groups used to have different schema partitioning (different indexes, PKs and even partition... [13:52:06] 10Operations, 10serviceops, 10Kubernetes, 10Patch-For-Review, 10Release Pipeline (Blubber): Move blubberoid to use TLS only. - https://phabricator.wikimedia.org/T236017 (10JMeybohm) [13:52:28] (03PS1) 10Filippo Giunchedi: am: use status.cgi JSON as source for problems [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/628090 [13:52:36] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10JMeybohm) [13:52:55] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) Thank you @Papaul - I will start mysql and start slowing repooling the host in production. [13:53:20] 10Operations, 10DBA, 10Performance-Team, 10Platform Engineering, 10User-Kormat: Remove sections from db configs - https://phabricator.wikimedia.org/T263127 (10Kormat) [13:53:24] (03PS2) 10Filippo Giunchedi: am: use status.cgi JSON as source for problems [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/628090 [13:54:43] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops, 10Patch-For-Review: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Mholloway) Aha. Nice catch. So, @Pchelolo, it sounds like the fix here is to collect and assemble full page summa... [13:55:28] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops, 10Patch-For-Review: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Pchelolo) >>! In T263043#6470484, @Mholloway wrote: > Aha. Nice catch. So, @Pchelolo, it sounds like the fix here... [13:55:32] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops, 10Patch-For-Review: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Joe) So for each request for a feed, we do the following: - we call restbase - restbase calls wikifeeds - wikifeed... [13:55:41] (03PS1) 10Marostegui: db2125: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/628091 (https://phabricator.wikimedia.org/T260670) [13:56:06] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:56:13] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:57:40] kormat: ^ maybe that's you? [13:57:44] oops, sigh, yes. [13:58:06] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, cross-checked via code search:" [dns] - 10https://gerrit.wikimedia.org/r/628086 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [13:58:08] fixed, thanks. [13:58:22] <3 [13:59:05] (03CR) 10Marostegui: [C: 03+2] db2125: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/628091 (https://phabricator.wikimedia.org/T260670) (owner: 10Marostegui) [13:59:19] (03CR) 10Filippo Giunchedi: "Strawdog version, won't work out of the box as we need an additional vhost" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/628090 (owner: 10Filippo Giunchedi) [14:00:03] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops, 10Patch-For-Review: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Mholloway) >>! In T263043#6470490, @Joe wrote: > - restbase calls aqs and other stuff Slight correction: it's actua... [14:00:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2125 T260670', diff saved to https://phabricator.wikimedia.org/P12629 and previous config saved to /var/cache/conftool/dbconfig/20200917-140014-marostegui.json [14:00:17] (03PS2) 10Elukey: Use local puppet host certificates for Hadoop Test's TLS encryption [puppet] - 10https://gerrit.wikimedia.org/r/628084 (https://phabricator.wikimedia.org/T253957) [14:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:21] T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 [14:01:17] (03CR) 10jerkins-bot: [V: 04-1] Use local puppet host certificates for Hadoop Test's TLS encryption [puppet] - 10https://gerrit.wikimedia.org/r/628084 (https://phabricator.wikimedia.org/T253957) (owner: 10Elukey) [14:01:17] 10Operations, 10Product-Infrastructure-Team-Backlog, 10RESTBase, 10Wikifeeds, 10serviceops: Wikifeeds should send uncachable response in case of some upstream failure - https://phabricator.wikimedia.org/T263100 (10Mholloway) [14:01:30] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:01:36] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:02:20] !log Start mysql on db1125 after PDU maintenance T261459 [14:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:26] T261459: New Date - Thur, Sept 17: PDU Upgrade 12pm-4pm UTC- Racks D1 and D2 - https://phabricator.wikimedia.org/T261459 [14:02:52] 10Operations, 10ops-eqiad, 10DC-Ops: New Date - Thur, Sept 17: PDU Upgrade 12pm-4pm UTC- Racks D1 and D2 - https://phabricator.wikimedia.org/T261459 (10Marostegui) mysql started back on db1125 [14:05:53] PROBLEM - Juniper alarms on fasw-c-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 2 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [14:06:02] (03PS3) 10Elukey: Use local puppet host certificates for Hadoop Test's TLS encryption [puppet] - 10https://gerrit.wikimedia.org/r/628084 (https://phabricator.wikimedia.org/T253957) [14:06:26] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops, 10Patch-For-Review: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Joe) >>! In T263043#6470502, @Mholloway wrote: >>>! In T263043#6470490, @Joe wrote: >> - restbase calls aqs and oth... [14:07:03] (03CR) 10jerkins-bot: [V: 04-1] Use local puppet host certificates for Hadoop Test's TLS encryption [puppet] - 10https://gerrit.wikimedia.org/r/628084 (https://phabricator.wikimedia.org/T253957) (owner: 10Elukey) [14:07:50] (03PS1) 10Andrew Bogott: toolforge spreadcheck: don't page! [puppet] - 10https://gerrit.wikimedia.org/r/628094 [14:09:34] 10Operations, 10Traffic, 10netops: Wikimedia projects not reachable for some Telecom Italia users - https://phabricator.wikimedia.org/T262869 (10Nemo_bis) Yes, we're telling that to everybody (including to journalists who called WMIT, social media, internal mailing lists and colleagues). Did you get any info... [14:10:32] (03CR) 10Hnowlan: [C: 03+2] api-gateway: Add all mwdebug hosts to debug clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/628066 (https://phabricator.wikimedia.org/T262396) (owner: 10Hnowlan) [14:10:38] RECOVERY - Juniper alarms on fasw-c-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [14:11:11] (03PS1) 10Filippo Giunchedi: hieradata: bump swift object replicator concurrency [puppet] - 10https://gerrit.wikimedia.org/r/628095 (https://phabricator.wikimedia.org/T261633) [14:11:49] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops: Move feed assembly from RESTBase to Wikifeeds - https://phabricator.wikimedia.org/T263133 (10Pchelolo) [14:12:01] 10Operations, 10Platform Engineering, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops: Move feed assembly from RESTBase to Wikifeeds - https://phabricator.wikimedia.org/T263133 (10Pchelolo) a:05Joe→03None [14:12:38] (03Merged) 10jenkins-bot: api-gateway: Add all mwdebug hosts to debug clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/628066 (https://phabricator.wikimedia.org/T262396) (owner: 10Hnowlan) [14:13:03] (03PS1) 10Muehlenhoff: grafana.discovery.wmnet.crt: Add grafana-rw.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/628096 (https://phabricator.wikimedia.org/T262512) [14:13:03] 10Operations, 10Platform Engineering, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops: Move feed assembly from RESTBase to Wikifeeds - https://phabricator.wikimedia.org/T263133 (10Pchelolo) [14:13:08] (03CR) 10Andrew Bogott: [C: 03+2] toolforge spreadcheck: don't page! [puppet] - 10https://gerrit.wikimedia.org/r/628094 (owner: 10Andrew Bogott) [14:13:26] (03CR) 10jerkins-bot: [V: 04-1] grafana.discovery.wmnet.crt: Add grafana-rw.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/628096 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [14:15:24] (03PS2) 10Muehlenhoff: grafana.discovery.wmnet.crt: Add grafana-rw.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/628096 (https://phabricator.wikimedia.org/T262512) [14:15:34] 10Operations, 10Analytics, 10Event-Platform, 10Wikimedia-production-error: Could not enqueue jobs from stream mediawiki.job.cirrusSearchIncomingLinkCount - https://phabricator.wikimedia.org/T263132 (10dcausse) https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&from=now-3h&to=now shows a... [14:15:48] (03CR) 10jerkins-bot: [V: 04-1] grafana.discovery.wmnet.crt: Add grafana-rw.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/628096 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [14:18:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2125 T260670', diff saved to https://phabricator.wikimedia.org/P12630 and previous config saved to /var/cache/conftool/dbconfig/20200917-141825-marostegui.json [14:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:34] T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 [14:18:46] (03PS1) 10Jbond: pki: add client auth kety [labs/private] - 10https://gerrit.wikimedia.org/r/628097 [14:18:53] (03PS3) 10Muehlenhoff: grafana.discovery.wmnet.crt: Add grafana-rw.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/628096 (https://phabricator.wikimedia.org/T262512) [14:23:16] PROBLEM - Host ps1-c1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:26:31] (03PS4) 10Jbond: stunnel: Add new stunnel class and daemon define [puppet] - 10https://gerrit.wikimedia.org/r/628078 (https://phabricator.wikimedia.org/T259117) [14:26:33] (03PS1) 10Jbond: pki::client: add class to install client pki tools [puppet] - 10https://gerrit.wikimedia.org/r/628098 (https://phabricator.wikimedia.org/T259117) [14:28:55] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [14:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:21] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [14:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:15] !log replacing msw-d1,d2,d3,d4,d5 and d6 [14:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:41] (03PS2) 10Jbond: pki::client: add class to install client pki tools [puppet] - 10https://gerrit.wikimedia.org/r/628098 (https://phabricator.wikimedia.org/T259117) [14:33:27] (03CR) 10Filippo Giunchedi: [C: 03+1] grafana.discovery.wmnet.crt: Add grafana-rw.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/628096 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [14:35:26] PROBLEM - Host db2100.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:35:30] PROBLEM - Host ps1-d1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:35:36] RECOVERY - Host ps1-d1-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.56 ms [14:38:28] PROBLEM - Host db2084.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:39:10] (03CR) 10Jbond: [V: 03+2 C: 03+2] pki: add client auth kety [labs/private] - 10https://gerrit.wikimedia.org/r/628097 (owner: 10Jbond) [14:39:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2125 T260670', diff saved to https://phabricator.wikimedia.org/P12631 and previous config saved to /var/cache/conftool/dbconfig/20200917-143914-marostegui.json [14:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:22] T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 [14:40:22] (03PS1) 10Volans: urldownloader: convert A record to CNAME [dns] - 10https://gerrit.wikimedia.org/r/628102 (https://phabricator.wikimedia.org/T244153) [14:40:30] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [14:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:45] (03CR) 10Muehlenhoff: [C: 03+2] grafana.discovery.wmnet.crt: Add grafana-rw.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/628096 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [14:41:10] RECOVERY - Host db2100.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.63 ms [14:43:33] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/628102 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [14:43:36] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops, 10Platform Team Workboards (Clinic Duty Team): Move feed assembly from RESTBase to Wikifeeds - https://phabricator.wikimedia.org/T263133 (10Pchelolo) a:03Pchelolo [14:46:46] PROBLEM - Host db2101.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:46:47] (03PS1) 10Herron: prometheus: add pop hosts to prometheus_all_nodes, set replica_label [puppet] - 10https://gerrit.wikimedia.org/r/628104 (https://phabricator.wikimedia.org/T243057) [14:47:23] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops, 10Patch-For-Review: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Mholloway) My mistake. Wikifeeds calls RESTBase which calls AQS. [14:47:33] PROBLEM - Host kubernetes2013.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:47:33] PROBLEM - Host kubernetes2014.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:47:43] PROBLEM - Host ps1-d6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:48:13] PROBLEM - Host db2140.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:48:13] PROBLEM - Host dbproxy2004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:49:07] PROBLEM - Host db2074.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:49:13] PROBLEM - Host db2130.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:49:13] PROBLEM - Host es2019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:49:17] !log ending pdu maintenance in eqiad [14:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:51] (03PS3) 10Jbond: pki::client: add class to install client pki tools [puppet] - 10https://gerrit.wikimedia.org/r/628098 (https://phabricator.wikimedia.org/T259117) [14:50:53] RECOVERY - Host ps1-d6-codfw is UP: PING OK - Packet loss = 0%, RTA = 38.63 ms [14:52:01] (03PS4) 10Elukey: Use local puppet host certificates for Hadoop Test's TLS encryption [puppet] - 10https://gerrit.wikimedia.org/r/628084 (https://phabricator.wikimedia.org/T253957) [14:52:03] RECOVERY - Host db2074.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.80 ms [14:52:05] RECOVERY - Host db2101.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [14:52:39] RECOVERY - Host kubernetes2013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.98 ms [14:52:39] RECOVERY - Host kubernetes2014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.00 ms [14:53:03] (03CR) 10jerkins-bot: [V: 04-1] Use local puppet host certificates for Hadoop Test's TLS encryption [puppet] - 10https://gerrit.wikimedia.org/r/628084 (https://phabricator.wikimedia.org/T253957) (owner: 10Elukey) [14:53:21] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: add pop hosts to prometheus_all_nodes, set replica_label [puppet] - 10https://gerrit.wikimedia.org/r/628104 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [14:53:27] RECOVERY - Host db2140.mgmt is UP: PING OK - Packet loss = 0%, RTA = 40.06 ms [14:53:27] RECOVERY - Host dbproxy2004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 39.79 ms [14:54:05] PROBLEM - Host db1131.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:54:25] RECOVERY - Host db2130.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.14 ms [14:54:25] RECOVERY - Host es2019.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [14:54:49] RECOVERY - Host db2084.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.44 ms [14:54:52] !log kormat@cumin1001 dbctl commit (dc=all): 'db2114: depool for schema change T259831', diff saved to https://phabricator.wikimedia.org/P12632 and previous config saved to /var/cache/conftool/dbconfig/20200917-145451-kormat.json [14:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:58] (03PS4) 10Jbond: pki::client: add class to install client pki tools [puppet] - 10https://gerrit.wikimedia.org/r/628098 (https://phabricator.wikimedia.org/T259117) [14:54:59] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [14:55:05] (03PS1) 10Giuseppe Lavagetto: wikifeeds: use https to connect to restbase directly in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/628115 (https://phabricator.wikimedia.org/T263043) [14:55:27] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [14:55:31] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [14:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:20] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops, 10Platform Team Workboards (Clinic Duty Team): Move feed assembly from RESTBase to Wikifeeds - https://phabricator.wikimedia.org/T263133 (10Pchelolo) One other reason feeds were proxied via restbase was that we didn't request... [14:56:45] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [14:56:46] (03CR) 10Giuseppe Lavagetto: [C: 03+2] wikifeeds: use https to connect to restbase directly in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/628115 (https://phabricator.wikimedia.org/T263043) (owner: 10Giuseppe Lavagetto) [14:58:11] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:58:17] (03CR) 10Jbond: [C: 03+2] stunnel: Add new stunnel class and daemon define [puppet] - 10https://gerrit.wikimedia.org/r/628078 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [14:58:21] (03CR) 10Jbond: [C: 03+2] pki::client: add class to install client pki tools [puppet] - 10https://gerrit.wikimedia.org/r/628098 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [14:59:15] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 238, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:59:25] (03Merged) 10jenkins-bot: wikifeeds: use https to connect to restbase directly in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/628115 (https://phabricator.wikimedia.org/T263043) (owner: 10Giuseppe Lavagetto) [14:59:51] PROBLEM - Host mw2273.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:00:20] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:00:21] !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [15:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:22] !log deploying extended grants for admin account on sys/p_s at s8@codfw T195578 [15:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:29] T195578: Deploy access to performance_schema/sys for the administrative mediawiki account (mediawiki deployers) - https://phabricator.wikimedia.org/T195578 [15:02:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db2125 T260670', diff saved to https://phabricator.wikimedia.org/P12633 and previous config saved to /var/cache/conftool/dbconfig/20200917-150234-marostegui.json [15:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:43] T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 [15:03:05] (03PS5) 10Elukey: Use local puppet host certificates for Hadoop Test's TLS encryption [puppet] - 10https://gerrit.wikimedia.org/r/628084 (https://phabricator.wikimedia.org/T253957) [15:03:59] RECOVERY - Host mw2273.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.49 ms [15:04:57] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:57] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:05] (03PS1) 10Jbond: pki: enable pki client [puppet] - 10https://gerrit.wikimedia.org/r/628117 (https://phabricator.wikimedia.org/T259117) [15:10:26] (03CR) 10CRusnov: [C: 03+1] "With the latest change, LGTM." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627899 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [15:11:14] (03CR) 10Jbond: [C: 03+2] pki: enable pki client [puppet] - 10https://gerrit.wikimedia.org/r/628117 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [15:12:38] (03CR) 10Volans: [C: 03+2] dns: split public zones per DC [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627899 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [15:13:01] (03CR) 10Volans: [C: 03+2] dns: correctly sort IPv6 PTR records [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627909 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [15:13:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db2125 T260670', diff saved to https://phabricator.wikimedia.org/P12634 and previous config saved to /var/cache/conftool/dbconfig/20200917-151347-marostegui.json [15:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:55] T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 [15:15:52] (03CR) 10CRusnov: [C: 03+1] "looks good" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/628061 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [15:16:05] (03CR) 10Volans: [C: 03+2] dns: do not try to generate PTR for external IPs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/628061 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [15:17:02] !log volans@cumin1001 START - Cookbook sre.dns.netbox [15:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:15] (03CR) 10Urbanecm: [C: 04-1] "Please add the wiki into db-codfw.php and db-eqiad.php - otherwise, MediaWiki would still communicate with s3 servers. We can also remove " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627278 (https://phabricator.wikimedia.org/T262812) (owner: 10Majavah) [15:17:58] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/25165/" [puppet] - 10https://gerrit.wikimedia.org/r/628084 (https://phabricator.wikimedia.org/T253957) (owner: 10Elukey) [15:18:12] 10Operations, 10Product-Infrastructure-Team-Backlog, 10RESTBase, 10Wikifeeds, 10serviceops: wikifeeds OpenAPI spec test doesn't fail if the response from `feed/featured` is malformed - https://phabricator.wikimedia.org/T263097 (10Mholloway) [15:18:21] (03CR) 10Cwhite: [C: 03+1] icinga: notify on commands change [puppet] - 10https://gerrit.wikimedia.org/r/628040 (https://phabricator.wikimedia.org/T263027) (owner: 10Filippo Giunchedi) [15:19:18] RECOVERY - Host ps1-c1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.89 ms [15:19:47] 10Operations, 10Product-Infrastructure-Team-Backlog, 10RESTBase, 10Wikifeeds, 10serviceops: wikifeeds OpenAPI spec test doesn't fail if the response from `feed/featured` is malformed - https://phabricator.wikimedia.org/T263097 (10Mholloway) I added the RESTBase project tag since feed/featured is currentl... [15:20:20] !log kormat@cumin1001 dbctl commit (dc=all): 'db2114: repool at 25% T259831', diff saved to https://phabricator.wikimedia.org/P12635 and previous config saved to /var/cache/conftool/dbconfig/20200917-152019-kormat.json [15:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:27] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [15:20:39] PROBLEM - ps1-c1-eqiad-infeed-load-tower-A-phase-Y on ps1-c1-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:21:13] RECOVERY - Host ps1-d2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.72 ms [15:21:31] RECOVERY - Host ps1-d1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.25 ms [15:21:51] PROBLEM - ps1-c1-eqiad-infeed-load-tower-A-phase-X on ps1-c1-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:21:51] PROBLEM - ps1-c1-eqiad-infeed-load-tower-B-phase-X on ps1-c1-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:21:51] PROBLEM - ps1-c1-eqiad-infeed-load-tower-A-phase-Z on ps1-c1-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:23:55] PROBLEM - ps1-d2-eqiad-infeed-load-tower-A-phase-Y on ps1-d2-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:24:05] PROBLEM - ps1-d1-eqiad-infeed-load-tower-A-phase-X on ps1-d1-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:24:05] PROBLEM - ps1-d1-eqiad-infeed-load-tower-A-phase-Y on ps1-d1-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:24:05] PROBLEM - ps1-d1-eqiad-infeed-load-tower-B-phase-X on ps1-d1-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:24:07] PROBLEM - ps1-d1-eqiad-infeed-load-tower-B-phase-Y on ps1-d1-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:25:33] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:15] PROBLEM - ps1-c1-eqiad-infeed-load-tower-B-phase-Y on ps1-c1-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:28:41] PROBLEM - Host db2074.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:30:53] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:34:07] RECOVERY - Host db2074.mgmt is UP: PING OK - Packet loss = 0%, RTA = 31.02 ms [15:35:15] !log kormat@cumin1001 dbctl commit (dc=all): 'db2114: repool at 50% T259831', diff saved to https://phabricator.wikimedia.org/P12636 and previous config saved to /var/cache/conftool/dbconfig/20200917-153514-kormat.json [15:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:21] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [15:39:29] RECOVERY - Host db1131.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [15:41:04] c1 netbox updates done [15:44:32] !log kormat@cumin1001 dbctl commit (dc=all): 'db2114: repool at 75% T259831', diff saved to https://phabricator.wikimedia.org/P12637 and previous config saved to /var/cache/conftool/dbconfig/20200917-154431-kormat.json [15:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:38] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [15:47:05] RECOVERY - ps1-d1-eqiad-infeed-load-tower-A-phase-X on ps1-d1-eqiad is OK: SNMP OK - ps1-d1-eqiad-infeed-load-tower-A-phase-X 272 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:47:05] RECOVERY - ps1-d1-eqiad-infeed-load-tower-A-phase-Y on ps1-d1-eqiad is OK: SNMP OK - ps1-d1-eqiad-infeed-load-tower-A-phase-Y 341 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:47:07] RECOVERY - ps1-d1-eqiad-infeed-load-tower-B-phase-X on ps1-d1-eqiad is OK: SNMP OK - ps1-d1-eqiad-infeed-load-tower-B-phase-X 273 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:47:09] RECOVERY - ps1-d1-eqiad-infeed-load-tower-B-phase-Y on ps1-d1-eqiad is OK: SNMP OK - ps1-d1-eqiad-infeed-load-tower-B-phase-Y 316 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:47:11] RECOVERY - ps1-d2-eqiad-infeed-load-tower-B-phase-X on ps1-d2-eqiad is OK: SNMP OK - ps1-d2-eqiad-infeed-load-tower-B-phase-X 385 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:47:43] RECOVERY - ps1-d2-eqiad-infeed-load-tower-B-phase-Y on ps1-d2-eqiad is OK: SNMP OK - ps1-d2-eqiad-infeed-load-tower-B-phase-Y 321 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:47:53] RECOVERY - ps1-d2-eqiad-infeed-load-tower-B-phase-Z on ps1-d2-eqiad is OK: SNMP OK - ps1-d2-eqiad-infeed-load-tower-B-phase-Z 342 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:49:11] RECOVERY - ps1-c1-eqiad-infeed-load-tower-A-phase-Y on ps1-c1-eqiad is OK: SNMP OK - ps1-c1-eqiad-infeed-load-tower-A-phase-Y 139 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:51:21] RECOVERY - ps1-c1-eqiad-infeed-load-tower-A-phase-Z on ps1-c1-eqiad is OK: SNMP OK - ps1-c1-eqiad-infeed-load-tower-A-phase-Z 231 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:55:15] PROBLEM - Host elastic2054.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:55:37] RECOVERY - ps1-c1-eqiad-infeed-load-tower-B-phase-Y on ps1-c1-eqiad is OK: SNMP OK - ps1-c1-eqiad-infeed-load-tower-B-phase-Y 190 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:56:25] RECOVERY - ps1-c1-eqiad-infeed-load-tower-A-phase-X on ps1-c1-eqiad is OK: SNMP OK - ps1-c1-eqiad-infeed-load-tower-A-phase-X 213 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:56:25] RECOVERY - ps1-c1-eqiad-infeed-load-tower-B-phase-X on ps1-c1-eqiad is OK: SNMP OK - ps1-c1-eqiad-infeed-load-tower-B-phase-X 240 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:56:51] RECOVERY - ps1-d1-eqiad-infeed-load-tower-B-phase-Z on ps1-d1-eqiad is OK: SNMP OK - ps1-d1-eqiad-infeed-load-tower-B-phase-Z 293 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:57:09] !log kormat@cumin1001 dbctl commit (dc=all): 'db2114: repool at 100% T259831', diff saved to https://phabricator.wikimedia.org/P12638 and previous config saved to /var/cache/conftool/dbconfig/20200917-155708-kormat.json [15:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:15] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [15:57:19] RECOVERY - ps1-d2-eqiad-infeed-load-tower-A-phase-Y on ps1-d2-eqiad is OK: SNMP OK - ps1-d2-eqiad-infeed-load-tower-A-phase-Y 310 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:58:25] PROBLEM - Host ps1-d7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:58:55] PROBLEM - Host ms-be2056.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:59:09] PROBLEM - Host cp2041.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:59:09] PROBLEM - Host cp2042.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:59:15] PROBLEM - Host ms-be2025.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:59:15] PROBLEM - Host ms-be2026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:59:15] PROBLEM - Host ms-be2039.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:59:17] PROBLEM - Host ms-be2050.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:59:43] PROBLEM - Host elastic2053.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:59:55] !log Update rack location on zarcillo for db1131 T262901 [16:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:01] T262901: Physically move db1131 from B5 to C8 - https://phabricator.wikimedia.org/T262901 [16:00:04] jbond42 and cdanis: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200917T1600). [16:00:41] PROBLEM - Host thanos-be2004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:00:57] PROBLEM - Host elastic2060.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:02:11] PROBLEM - Host kafka-main2005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:02:49] RECOVERY - Host ps1-d7-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.48 ms [16:03:14] !log Recreate db1131 on tendril T262901 [16:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:11] RECOVERY - Host ms-be2056.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.53 ms [16:04:27] RECOVERY - Host cp2041.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.85 ms [16:04:27] RECOVERY - Host cp2042.mgmt is UP: PING OK - Packet loss = 0%, RTA = 31.66 ms [16:04:33] RECOVERY - Host ms-be2025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.50 ms [16:04:33] RECOVERY - Host ms-be2026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.50 ms [16:04:33] RECOVERY - Host ms-be2039.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.46 ms [16:04:35] RECOVERY - Host ms-be2050.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.84 ms [16:05:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Change db1131 IP after moving it to a different rack T262901', diff saved to https://phabricator.wikimedia.org/P12639 and previous config saved to /var/cache/conftool/dbconfig/20200917-160540-marostegui.json [16:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:47] T262901: Physically move db1131 from B5 to C8 - https://phabricator.wikimedia.org/T262901 [16:05:58] RECOVERY - Host elastic2054.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.79 ms [16:06:00] RECOVERY - Host elastic2053.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.75 ms [16:06:08] RECOVERY - Host thanos-be2004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 31.77 ms [16:06:24] RECOVERY - Host elastic2060.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.91 ms [16:07:14] RECOVERY - Host kafka-main2005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 31.16 ms [16:15:04] !log replacing msw-d8-codfw [16:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:39] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [16:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:48] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:27] PROBLEM - Host parse2020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:19:49] PROBLEM - Host restbase2023.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:20:21] PROBLEM - Host ganeti2018.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:20:31] PROBLEM - Host parse2018.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:20:51] RECOVERY - Host restbase2023.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.48 ms [16:21:30] !log Restart wikibugs [16:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:25] (03PS7) 10Bstorm: cumin: for new wmcs. prefix for cookbooks, grant access to wmcs-admins [puppet] - 10https://gerrit.wikimedia.org/r/621343 (https://phabricator.wikimedia.org/T260389) [16:22:47] (03CR) 10Bstorm: cumin: for new wmcs. prefix for cookbooks, grant access to wmcs-admins (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/621343 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [16:23:01] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Traffic, and 2 others: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10greg) (Not wanting to nit-pick, really, just curious...) Is sending username/email to gravatar by default (how I assume this would work using that local... [16:23:13] (03CR) 10Ppchelko: [C: 03+1] api-gateway: allow mwdebug hosts in calico [deployment-charts] - 10https://gerrit.wikimedia.org/r/628127 (https://phabricator.wikimedia.org/T262396) (owner: 10Hnowlan) [16:24:41] RECOVERY - Host parse2020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.82 ms [16:25:38] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [16:25:39] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [16:25:39] RECOVERY - Host ganeti2018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.82 ms [16:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:49] RECOVERY - Host parse2018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.88 ms [16:27:10] !log ppchelko@deploy1001 Started deploy [restbase/deploy@6f507e0]: Fix up metrics editors-by-country schema [16:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:33] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [16:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:51] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:17] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Traffic, and 2 others: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Ladsgroup) Sorta responding to Greg and hashar too. I agree having proxy is less than optimal. But there's another option as suggested in {T256541} (grav... [16:29:10] Hello, why s1 and s4 have so big "lag values"? https://pastebin.com/5jamFY0c [16:30:01] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:18] (03CR) 10Jbond: [C: 03+2] cumin: for new wmcs. prefix for cookbooks, grant access to wmcs-admins [puppet] - 10https://gerrit.wikimedia.org/r/621343 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [16:31:34] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1033 ipmi alert - https://phabricator.wikimedia.org/T263145 (10Andrew) [16:32:10] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:24] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@6f507e0]: Fix up metrics editors-by-country schema (duration: 06m 14s) [16:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:33] !log ppchelko@deploy1001 Started deploy [restbase/deploy@6f507e0]: Fix up metrics editors-by-country schema, feed timeout [16:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:00] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@6f507e0]: Fix up metrics editors-by-country schema, feed timeout (duration: 07m 26s) [16:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:05] !log ppchelko@deploy1001 Started deploy [restbase/deploy@6f507e0]: Fix up metrics editors-by-country schema, feed timeout [16:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:54] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@6f507e0]: Fix up metrics editors-by-country schema, feed timeout (duration: 02m 50s) [16:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:23] !log ppchelko@deploy1001 Started deploy [restbase/deploy@6f507e0]: Fix up metrics editors-by-country schema, feed timeout [16:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:35] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@6f507e0]: Fix up metrics editors-by-country schema, feed timeout (duration: 01m 12s) [16:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:01] (03PS1) 10Elukey: profile::hadoop::common: refactor puppet TLS cert deployment [puppet] - 10https://gerrit.wikimedia.org/r/628141 (https://phabricator.wikimedia.org/T253957) [16:46:26] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Traffic, and 2 others: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Tgr) >>! In T191183#6471328, @greg wrote: > Is sending username/email to gravatar by default (how I assume this would work using that local proxy) consid... [16:46:39] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10RLazarus) Thank you @Trizek-WMF! Really appreciate all your work on this, and your team's. One note for you, maybe to do with translation... [16:48:09] 10Operations, 10ops-codfw, 10netops: (Need by: ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10Papaul) [16:49:08] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Traffic, and 2 others: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10hashar) Just for the context the reason I declined this again is because I have seen on IRC a notice about disabling gravatar for OTRS ( T187984#6465324... [16:53:52] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Traffic, and 2 others: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10greg) Agreed with @hashar on the way forward for this request. [16:56:57] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627964 (https://phabricator.wikimedia.org/T263149) (owner: 10DannyS712) [16:57:14] 04Critical Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page [16:57:26] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary inbound port utilisation over 80% #page [16:57:41] 👀 [16:57:45] cdanis: related to some WIP? [16:57:49] none I know of [16:57:49] * volans here [16:57:55] XioNoX: ? [16:58:12] looking [16:59:01] I see ~5Gbit of usage but not 8 [16:59:57] cdanis: https://librenms.wikimedia.org/graphs/to=1600361700/id=16333/type=port_bits/from=1600275300/ [17:00:04] halfak and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200917T1700). Please do the needful. [17:00:16] ah [17:00:19] it has been oscillating around the 8G [17:00:28] so it's "normal" and no emergency [17:00:29] PROBLEM - LibreNMS has a critical alert #page on icinga1001 is CRITICAL: Primary inbound port utilisation over 80% #page (cr2-eqdfw.wikimedia.org) // Primary outbound port utilisation over 80% #page (cr1-codfw.wikimedia.org) https://bit.ly/wmf-librenms [17:00:38] but worrying [17:00:39] er [17:01:06] here if you need more hands [17:01:16] can someone ACK the alert [17:01:16] o/ [17:01:26] VO is asking me to log-in -_-' [17:01:34] just cked [17:01:39] i had it open [17:01:43] done also [17:01:57] rzl beat meeee [17:01:59] shows him [17:02:03] XioNoX: 120003 is down for 2h [17:02:36] so, codfw->eqiad traffic might also be transiting eqdfw? [17:02:44] and we're doing 6G/s of outbound peering in eqdfw [17:03:12] yeah peering + transit [17:03:14] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page [17:03:22] * jbond42 here [17:03:26] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqdfw.wikimedia.org recovered from Primary inbound port utilisation over 80% #page [17:03:40] ! log restarting ps1-d8-codfw [17:03:43] RECOVERY - LibreNMS has a critical alert #page on icinga1001 is OK: OK: zero critical LibreNMS alerts https://bit.ly/wmf-librenms [17:03:56] !log restarting ps1-d8-codfw [17:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:08] and https://librenms.wikimedia.org/device/device=139/tab=port/port=16552/ is bored because yay load balancing, VRRP master must be on cr1 [17:04:20] papaul: can you hold your work for a bit? [17:04:38] about the codfw network event it's fine [17:04:51] (just to let these alarms to clear) [17:04:59] elukey: they did [17:05:09] we might want to repool eqiad just in case [17:05:20] cdanis: ack I thought some were ongoing, good [17:05:21] yeah... [17:05:33] rzl: do you have thoughts or feelings re: repooling eqiad? [17:06:03] elukey: ok [17:06:04] XioNoX is the only person I'd have checked with first :) if we need the capacity, go for it [17:06:22] want me to put it in? [17:06:38] please [17:06:38] if you can prepare it for now, please [17:06:44] ack, preparing [17:07:05] trying to figure out what changed, but it might be normal traffic fluctuation that brought us right above the 80% limit [17:07:07] wiki_willy: fyi please consider all cp* nodes in eqiad to be live again, I'll update you if that changes [17:07:35] papaul: my bad the issue seems over, please go ahead [17:08:29] from https://librenms.wikimedia.org/bill/bill_id=15/ it looks like we're sending a bit more traffic out than the other days [17:08:35] (03PS1) 10RLazarus: Revert "Depool eqiad for 2020-2021 DC switchover" [dns] - 10https://gerrit.wikimedia.org/r/628143 [17:08:44] but nothing alarming [17:09:08] (03PS1) 10Volans: sre.dns.netbox: allow to run in DRY-RUN mode [cookbooks] - 10https://gerrit.wikimedia.org/r/628144 (https://phabricator.wikimedia.org/T258729) [17:09:08] ^ repool is ready, waiting [17:10:01] rzl: please go ahead [17:10:04] (03CR) 10CDanis: [C: 03+1] Revert "Depool eqiad for 2020-2021 DC switchover" [dns] - 10https://gerrit.wikimedia.org/r/628143 (owner: 10RLazarus) [17:10:10] XioNoX: ack [17:10:17] (03CR) 10RLazarus: [C: 03+2] Revert "Depool eqiad for 2020-2021 DC switchover" [dns] - 10https://gerrit.wikimedia.org/r/628143 (owner: 10RLazarus) [17:10:20] elukey: ok [17:10:20] rzl: You have the authority to deploy this change. [17:10:39] (03PS2) 10Bartosz Dziewoński: DiscussionTools: Fix task comments for second round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627288 (owner: 10Esanders) [17:11:32] good thing I can't find my running headphones, was about to head out [17:11:34] (03PS2) 10Elukey: profile::hadoop::common: refactor puppet TLS cert deployment [puppet] - 10https://gerrit.wikimedia.org/r/628141 (https://phabricator.wikimedia.org/T253957) [17:11:52] authdns-update is running [17:11:54] (03CR) 10Bartosz Dziewoński: [C: 03+1] Enable DiscussionTools beta on jawiki & viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627289 (https://phabricator.wikimedia.org/T261654) (owner: 10Esanders) [17:12:01] XioNoX: this is the kind of thing we could also work around with some traffic engineering in codfw/eqdfw, but we haven't bothered to do that because it's usually so cold, right? [17:12:28] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission [17:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:42] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:12:44] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) [17:12:46] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission [17:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:16] (03PS1) 10Razzi: Add MaxMind data files to Matomo /misc directory [puppet] - 10https://gerrit.wikimedia.org/r/628146 (https://phabricator.wikimedia.org/T213741) [17:13:44] yeah the cold caches won't be great [17:13:54] but they'll figure themselves out fairly quickly, for the most part [17:14:30] (03CR) 10jerkins-bot: [V: 04-1] Add MaxMind data files to Matomo /misc directory [puppet] - 10https://gerrit.wikimedia.org/r/628146 (https://phabricator.wikimedia.org/T213741) (owner: 10Razzi) [17:14:50] note that there is no urgency, so if there is a way to warm the cache go for it [17:14:58] not a great one [17:15:08] plus it's already self-warming :) [17:15:15] cdanis: yeah, there are a few options: 1/ terminate the links on a switch stack and put them in a LACP bundle, 2/ do traffic engineering, but here it means de-prefing peering over transit, 3/ we can do better load balancing on those links (complicated) [17:15:18] yeah, it gets ramped up over 5 minutes [17:15:25] XioNoX: ack [17:15:33] just saying, expect a big spike in request rate to applayer services [17:15:43] nod [17:15:46] it will tail off quickly, but it will also take many hours to fully normalize [17:16:24] RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:16:44] I'm getting an "SSO" error trying to login on VO from my phone [17:17:14] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/25169/" [puppet] - 10https://gerrit.wikimedia.org/r/628141 (https://phabricator.wikimedia.org/T253957) (owner: 10Elukey) [17:17:14] 04Critical Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page [17:17:22] XioNoX: that happened to me before, I had to force killing the app on the phone and then it worked [17:17:26] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary inbound port utilisation over 80% #page [17:17:31] er [17:17:48] (03PS2) 10Razzi: Add MaxMind data files to Matomo /misc directory [puppet] - 10https://gerrit.wikimedia.org/r/628146 (https://phabricator.wikimedia.org/T213741) [17:18:14] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page [17:18:18] ok [17:18:25] odd to see them 1 minute apart [17:18:26] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqdfw.wikimedia.org recovered from Primary inbound port utilisation over 80% #page [17:18:36] (03PS1) 10Andrew Bogott: Remove refs to cloudvirt100[1-9] [puppet] - 10https://gerrit.wikimedia.org/r/628147 (https://phabricator.wikimedia.org/T263151) [17:18:51] (03PS1) 10Nskaggs: Revert "toolforge: Temp handling for tools.wmflabs.org/wpcleaner" [puppet] - 10https://gerrit.wikimedia.org/r/628148 (https://phabricator.wikimedia.org/T258813) [17:18:54] I'll open a task to not forget about the issue [17:19:06] thanks! [17:19:13] (03PS1) 10Volans: sre.hosts.decommission: allow to run in dry-run [cookbooks] - 10https://gerrit.wikimedia.org/r/628149 [17:19:20] I will keep an eye on peering/transit/transport traffic [17:20:04] oh I suppose the SAL might have been a good place to mention this [17:20:13] !log repooled eqiad at 17:11 [17:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:40] Hi, is it OK deploy a new version of mobileapps ? [17:21:52] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 53.04 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:22:13] ^ expected, traffic shifted to eqiad [17:22:47] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/627602 (owner: 10Bstorm) [17:22:51] cdanis, XioNoX: see nemo-yiannis's question [17:23:04] yep, all fine [17:23:10] nemo-yiannis: yep! [17:23:17] network related there are 0 issues if the repooling goes well [17:23:24] (03CR) 10Bstorm: [C: 03+2] icinga: permissions add both cases for bstorm [puppet] - 10https://gerrit.wikimedia.org/r/627602 (owner: 10Bstorm) [17:23:26] thanks cdanis! [17:23:28] (03PS1) 10Andrew Bogott: Remove refs to cloudvirt100[1-9] [dns] - 10https://gerrit.wikimedia.org/r/628150 (https://phabricator.wikimedia.org/T263151) [17:25:32] (03PS1) 10Elukey: profile::hadoop::common: add g+r to the puppet cert exposed file [puppet] - 10https://gerrit.wikimedia.org/r/628151 [17:26:07] (03PS1) 10Dzahn: base/monitoring: move monitor_screens to proper profile parameter [puppet] - 10https://gerrit.wikimedia.org/r/628152 [17:26:10] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [17:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:12] (03CR) 10jerkins-bot: [V: 04-1] base/monitoring: move monitor_screens to proper profile parameter [puppet] - 10https://gerrit.wikimedia.org/r/628152 (owner: 10Dzahn) [17:27:17] (03CR) 10Elukey: [C: 03+2] profile::hadoop::common: add g+r to the puppet cert exposed file [puppet] - 10https://gerrit.wikimedia.org/r/628151 (owner: 10Elukey) [17:29:04] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Traffic, and 2 others: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Tgr) Filed {T263161} about having a Gravatar proxy. [17:30:14] (03CR) 10CRusnov: [C: 03+1] "looks good as discussed" [cookbooks] - 10https://gerrit.wikimedia.org/r/628149 (owner: 10Volans) [17:30:56] (03CR) 10Dzahn: [C: 03+2] Extend Cumin alias for mw-canary to also include the canary API servers [puppet] - 10https://gerrit.wikimedia.org/r/627327 (owner: 10Muehlenhoff) [17:31:01] (03PS2) 10Dzahn: Extend Cumin alias for mw-canary to also include the canary API servers [puppet] - 10https://gerrit.wikimedia.org/r/627327 (owner: 10Muehlenhoff) [17:31:13] (03CR) 10Razzi: "Puppet catalog compiler output: https://puppet-compiler.wmflabs.org/compiler1001/25170/" [puppet] - 10https://gerrit.wikimedia.org/r/628146 (https://phabricator.wikimedia.org/T213741) (owner: 10Razzi) [17:31:34] (03CR) 10Elukey: [C: 03+1] Add MaxMind data files to Matomo /misc directory [puppet] - 10https://gerrit.wikimedia.org/r/628146 (https://phabricator.wikimedia.org/T213741) (owner: 10Razzi) [17:32:13] 04Critical Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page [17:32:26] but [17:32:41] that's the eqiad/codfw link [17:32:49] any idea why it could saturate? [17:32:57] ZYO/120003 is down again [17:32:59] https://librenms.wikimedia.org/device/device=1/tab=port/port=6815/ [17:33:12] it's not saturating but peaking about the 80% [17:33:14] 04Critical Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page [17:33:52] I acked them in librenms [17:33:53] uff [17:33:55] thanks [17:34:00] the eqiad/codfw link is because eqiad is depooled on the inside [17:34:01] (03PS2) 10Volans: sre.hosts.decommission: allow to run in dry-run [cookbooks] - 10https://gerrit.wikimedia.org/r/628149 [17:34:13] but a bunch of traffic is now arriving at the eqiad edge and then the misses reach out to codfw [17:34:14] 04Critical Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page got acknowledged [17:34:26] 04Critical Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page got acknowledged [17:34:29] yeah, what bblack said [17:34:45] it should calm down as the caches warm, though [17:34:45] (03CR) 10Andrew Bogott: [C: 03+2] Remove refs to cloudvirt100[1-9] [puppet] - 10https://gerrit.wikimedia.org/r/628147 (https://phabricator.wikimedia.org/T263151) (owner: 10Andrew Bogott) [17:34:49] so it should calm down when the caches are warm? [17:34:51] cool [17:34:54] we've probably already seen the peak [17:34:56] (03CR) 10Andrew Bogott: [C: 03+2] Remove refs to cloudvirt100[1-9] [dns] - 10https://gerrit.wikimedia.org/r/628150 (https://phabricator.wikimedia.org/T263151) (owner: 10Andrew Bogott) [17:35:45] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [17:36:57] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [17:37:07] (03CR) 10CRusnov: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/628144 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [17:37:29] 10Operations, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudvirt100[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T263151 (10Andrew) a:05Andrew→03Cmjohnson [17:38:13] (03CR) 10Volans: [C: 03+2] sre.dns.netbox: allow to run in DRY-RUN mode [cookbooks] - 10https://gerrit.wikimedia.org/r/628144 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [17:39:16] well, maybe we've seen the peak, waiting for a few more samples [17:39:50] (03Merged) 10jenkins-bot: sre.dns.netbox: allow to run in DRY-RUN mode [cookbooks] - 10https://gerrit.wikimedia.org/r/628144 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [17:40:02] yeah, 8.4Gbit [17:40:18] it does look like we're prob at the inflection point tho [17:40:23] 🤞 [17:41:50] max q :) [17:42:12] heh [17:43:03] good thing we don't have 8Gbps links [17:43:41] good thing we use 5 minute sampling too, because the instantenous peaks are probably closer to saturation and that's why it has a hard time averaging about 8.4 :) [17:43:53] s/about/above/ [17:44:01] (03PS1) 10Dzahn: cumin: add alias for just canary appservers [puppet] - 10https://gerrit.wikimedia.org/r/628153 [17:46:51] er, true [17:47:45] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/628158 [17:48:59] (03CR) 10Dzahn: [C: 03+2] "@Effie adding alias for mw-app-canary here as you suggested" [puppet] - 10https://gerrit.wikimedia.org/r/628153 (owner: 10Dzahn) [17:49:08] (03CR) 10Razzi: [C: 03+2] Add MaxMind data files to Matomo /misc directory [puppet] - 10https://gerrit.wikimedia.org/r/628146 (https://phabricator.wikimedia.org/T213741) (owner: 10Razzi) [17:49:31] (03CR) 10Dzahn: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/627327 (owner: 10Muehlenhoff) [17:49:40] at this point in the curve we're somewhere around double the normal miss volume, and slowly getting better [17:49:53] (03CR) 10CRusnov: [C: 03+1] "latest change looks good." [cookbooks] - 10https://gerrit.wikimedia.org/r/628149 (owner: 10Volans) [17:50:09] (03PS15) 10Dzahn: scap: Fix "Could not find resource 'User[deploy-devtools]'" in cloud [puppet] - 10https://gerrit.wikimedia.org/r/626035 (owner: 10Paladox) [17:51:16] the old "varnish-caching" dashboard provides some insight [17:51:18] https://grafana.wikimedia.org/d/000000500/varnish-caching?orgId=1&refresh=15m&from=now-1h&to=now&var-cluster=cache_text&var-site=codfw&var-site=eqiad&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5 [17:51:22] heh I am doing too many things at once :) trying to figure out which of many of the various TCP retransmit metrics we care about [17:51:31] because that's probably our best near-realtime proxy for saturation [17:51:34] e.g. you can combine eqiad+codfw there, and see the impact on their combined hit/miss rates [17:53:02] (03PS2) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/628158 [17:53:05] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:53:14] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: allow to run in dry-run [cookbooks] - 10https://gerrit.wikimedia.org/r/628149 (owner: 10Volans) [17:53:37] (03PS3) 10Herron: prometheus: enable rsyncd on pop hosts [puppet] - 10https://gerrit.wikimedia.org/r/628158 (https://phabricator.wikimedia.org/T243057) [17:53:43] (03Merged) 10jenkins-bot: sre.hosts.decommission: allow to run in dry-run [cookbooks] - 10https://gerrit.wikimedia.org/r/628149 (owner: 10Volans) [17:54:05] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops, 10Platform Team Workboards (Clinic Duty Team): Move feed assembly from RESTBase to Wikifeeds - https://phabricator.wikimedia.org/T263133 (10Pchelolo) Interesting disparity between RB feed and wikifeeds feed is that in RB we e... [17:54:08] fwiw there *hasn't* been much impact on appserver-reported retransmits [17:55:42] (03PS1) 10Volans: sre.dns.netbox: fix misleading message [cookbooks] - 10https://gerrit.wikimedia.org/r/628161 [17:56:37] cdanis: https://grafana.wikimedia.org/d/000000366/network-performances-global?viewPanel=18&orgId=1 ? [17:57:10] XioNoX: yeah, saw that, but then wanted raw numbers, and then saw there were like 20 different metrics [17:57:28] should've just trusted it when it told me there wasn't really a problem ;) [17:57:29] (03CR) 10CRusnov: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/628161 (owner: 10Volans) [17:58:14] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80% #page [17:58:30] nice [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200917T1800). [18:00:04] DannyS712 and MatmaRex: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:17] Here [18:00:21] I can deploy today! [18:00:30] My patch is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/627964 [18:00:30] hi [18:01:47] (03CR) 10Volans: [C: 03+2] sre.dns.netbox: fix misleading message [cookbooks] - 10https://gerrit.wikimedia.org/r/628161 (owner: 10Volans) [18:02:16] DannyS712: I'd rather consult this patch with others, this is not one of the "freely granted" permissions [18:02:56] (03Merged) 10jenkins-bot: sre.dns.netbox: fix misleading message [cookbooks] - 10https://gerrit.wikimedia.org/r/628161 (owner: 10Volans) [18:03:05] so, I can't deploy this today DannyS712 [18:03:11] past two librenms datapoints have been consistently downwards [18:03:14] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page [18:03:18] (03PS3) 10Urbanecm: DiscussionTools: Fix task comments for second round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627288 (owner: 10Esanders) [18:03:51] Urbanecm can I ask why? Stewards don't have access to otrswiki so currently no one can delete big pages there [18:03:53] (03CR) 10Urbanecm: [C: 03+2] DiscussionTools: Fix task comments for second round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627288 (owner: 10Esanders) [18:04:26] (03PS2) 10Urbanecm: Enable DiscussionTools beta on jawiki & viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627289 (https://phabricator.wikimedia.org/T261654) (owner: 10Esanders) [18:04:28] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools beta on jawiki & viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627289 (https://phabricator.wikimedia.org/T261654) (owner: 10Esanders) [18:04:33] DannyS712: as I said, I'd rather consult this patch with performance people, as bigdelete can cause a lot of issues. [18:04:47] (03Merged) 10jenkins-bot: DiscussionTools: Fix task comments for second round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627288 (owner: 10Esanders) [18:04:48] (03Merged) 10jenkins-bot: Enable DiscussionTools beta on jawiki & viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627289 (https://phabricator.wikimedia.org/T261654) (owner: 10Esanders) [18:04:55] DannyS712: I'm not asking this will never go out, I'm just saying I'd like to confirm this is sane [18:05:08] okay [18:05:16] *I'm not saying [18:06:44] !log Move /srv/mediawiki-stagging/grep (owned by tstarling) to /home/urbanecm to make working directory clean (cc TimStarling) [18:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:13] MatmaRex: hello! Both patches are now at mwdebug2001, can you test, please? [18:07:27] looking [18:07:33] (I mean, the second one, the first is obvious nop) [18:08:38] (03PS1) 10Jcrespo: mariadb: Stop using puppet to deploy wmfbackups and use debian packages [puppet] - 10https://gerrit.wikimedia.org/r/628163 (https://phabricator.wikimedia.org/T138562) [18:08:40] (03PS1) 10Razzi: profile::piwik::instance: actually copy Maxmind data files [puppet] - 10https://gerrit.wikimedia.org/r/628164 (https://phabricator.wikimedia.org/T213741) [18:08:44] Urbanecm: looks good [18:08:47] thanks, syncing [18:09:17] (03PS2) 10Jcrespo: mariadb: Stop using puppet to deploy wmfbackups and use debian packages [puppet] - 10https://gerrit.wikimedia.org/r/628163 (https://phabricator.wikimedia.org/T138562) [18:09:24] (03CR) 10jerkins-bot: [V: 04-1] profile::piwik::instance: actually copy Maxmind data files [puppet] - 10https://gerrit.wikimedia.org/r/628164 (https://phabricator.wikimedia.org/T213741) (owner: 10Razzi) [18:09:51] Urbanecm when you're done can you note your concerns on T263149 so its clear why it wasn't deployed? [18:09:51] T263149: Add bigdelete to OTRS Wiki bureaucrats - https://phabricator.wikimedia.org/T263149 [18:10:00] sure, I'll follow up there [18:10:15] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 40591d3dfdc2fc360cb060770677a48e2a53362c: Enable DiscussionTools beta on jawiki & viwiki (T261654; T262109) (duration: 00m 56s) [18:10:17] (03PS2) 10Razzi: profile::piwik::instance: actually copy Maxmind data files [puppet] - 10https://gerrit.wikimedia.org/r/628164 (https://phabricator.wikimedia.org/T213741) [18:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:23] T262109: Enable Reply Tool on vi.wiki - https://phabricator.wikimedia.org/T262109 [18:10:23] T261654: Enable DiscussionTools on jawiki - https://phabricator.wikimedia.org/T261654 [18:10:44] MatmaRex: unless you have anything else, we're done :). [18:10:53] thanks [18:11:19] no problem [18:11:23] !log Morning B&C done [18:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:23] (03PS1) 10Jgiannelos: Bump mobileapps to version 2020-09-17-151051-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/628165 [18:12:59] (03PS3) 10Jcrespo: mariadb: Stop using puppet to deploy wmfbackups and use debian packages [puppet] - 10https://gerrit.wikimedia.org/r/628163 (https://phabricator.wikimedia.org/T138562) [18:13:06] (03CR) 10Urbanecm: [C: 04-2] "This should definitely be signed off by someone (maybe perfteam?) before getting deployed, as deleting big pages is capable to cause real " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627964 (https://phabricator.wikimedia.org/T263149) (owner: 10DannyS712) [18:13:12] (03CR) 10Razzi: "Catalog compile: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/25175/" [puppet] - 10https://gerrit.wikimedia.org/r/628164 (https://phabricator.wikimedia.org/T213741) (owner: 10Razzi) [18:14:28] (03CR) 10Elukey: [C: 03+1] profile::piwik::instance: actually copy Maxmind data files [puppet] - 10https://gerrit.wikimedia.org/r/628164 (https://phabricator.wikimedia.org/T213741) (owner: 10Razzi) [18:14:34] (03CR) 10Razzi: [C: 03+2] profile::piwik::instance: actually copy Maxmind data files [puppet] - 10https://gerrit.wikimedia.org/r/628164 (https://phabricator.wikimedia.org/T213741) (owner: 10Razzi) [18:16:40] (03CR) 10Mholloway: [C: 03+1] Bump mobileapps to version 2020-09-17-151051-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/628165 (owner: 10Jgiannelos) [18:17:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx1001 & frdata1002 - https://phabricator.wikimedia.org/T260181 (10Jgreen) [18:19:04] (03PS1) 10Dzahn: kafka: remove non-existing changeprop admin group to fix puppet run [puppet] - 10https://gerrit.wikimedia.org/r/628167 [18:20:09] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: dbproxy1020 PS disconnected - https://phabricator.wikimedia.org/T262998 (10RobH) So this shows an alarm in software, but Chris checked and is green on the actual PSU. I sent a racreset, and it seems to have fixed things: > 23 $> ssh root@dbproxy1020.mgmt.eqia... [18:20:24] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Tue, Sept 15 PDU Upgrade 12pm-4pm UTC- Racks C4 and C5 - https://phabricator.wikimedia.org/T261456 (10RobH) [18:20:29] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: dbproxy1020 PS disconnected - https://phabricator.wikimedia.org/T262998 (10RobH) 05Open→03Resolved [18:20:31] (03CR) 10Dzahn: [C: 03+2] kafka: remove non-existing changeprop admin group to fix puppet run [puppet] - 10https://gerrit.wikimedia.org/r/628167 (owner: 10Dzahn) [18:21:37] RECOVERY - IPMI Sensor Status on dbproxy1020 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:22:37] (03CR) 10Mholloway: [C: 03+2] Bump mobileapps to version 2020-09-17-151051-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/628165 (owner: 10Jgiannelos) [18:22:39] (03PS1) 10Jcrespo: cli: Make /etc/wmfbackups the config dir for the main backup scripts [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/628168 (https://phabricator.wikimedia.org/T138562) [18:24:01] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: dbproxy1020 PS disconnected - https://phabricator.wikimedia.org/T262998 (10wiki_willy) Thanks @RobH for looking and fixing this. Thanks, Willy [18:25:18] (03Merged) 10jenkins-bot: Bump mobileapps to version 2020-09-17-151051-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/628165 (owner: 10Jgiannelos) [18:26:15] (03PS1) 10Dzahn: kafka: remove cpjobqueue-admin group as well [puppet] - 10https://gerrit.wikimedia.org/r/628171 [18:26:26] (03CR) 10jerkins-bot: [V: 04-1] kafka: remove cpjobqueue-admin group as well [puppet] - 10https://gerrit.wikimedia.org/r/628171 (owner: 10Dzahn) [18:27:55] (03PS2) 10Dzahn: kafka: remove cpjobqueue-admin group as well [puppet] - 10https://gerrit.wikimedia.org/r/628171 [18:28:10] (03PS3) 10Dzahn: kafka: remove cpjobqueue-admin group as well [puppet] - 10https://gerrit.wikimedia.org/r/628171 (https://phabricator.wikimedia.org/T220399) [18:28:14] 04Critical Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page [18:28:26] 04Critical Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page [18:29:01] I re-acked them in librenms [18:29:14] 04Critical Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page got acknowledged [18:29:26] 04Critical Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page got acknowledged [18:29:28] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: dbproxy1020 PS disconnected - https://phabricator.wikimedia.org/T262998 (10Marostegui) Thank you guys! [18:29:33] !log jgiannelos@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [18:29:35] maybe we should have picked another word other than # p a g e for the special signifier for the icinga check [18:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:40] (03CR) 10Dzahn: [C: 03+2] kafka: remove cpjobqueue-admin group as well [puppet] - 10https://gerrit.wikimedia.org/r/628171 (https://phabricator.wikimedia.org/T220399) (owner: 10Dzahn) [18:30:14] 04Critical Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page got acknowledged [18:30:34] (03PS1) 10Jcrespo: [WIP] Use the shared list of sections to validate backup checks [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/628172 (https://phabricator.wikimedia.org/T138562) [18:31:11] also it has been at 8G for 1h now https://librenms.wikimedia.org/graphs/to=1600367100/id=6815/type=port_bits/from=1600345500/ [18:33:33] (03CR) 10Dzahn: "the 2 admin groups removed here were still applied on kafka-main hosts. this broke puppet runs on these hosts because the groups were now " [puppet] - 10https://gerrit.wikimedia.org/r/603534 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [18:34:33] cache hitrate has been increasing, but so did rps [18:35:26] this is actually funny XioNoX rzl -- we left eqiad depooled to test being without its edge peering/transit capacity, but, maybe the actual limiting factor is the 10G between eqiad and codfw! [18:35:34] yeah but the rps doesn't go up much more than this in the usual daily cycle [18:36:03] (for the cores sites anyways) [18:36:16] and yeah it'd be nice to have a bigger interconnect [18:36:23] cdanis: was going to say "what's the probability that 1/3 of them are down" but it's not that low... [18:36:52] XioNoX: eheheh indeed... and we only use 10G max, right? there's no ECMP between sites [18:36:53] or able to load balance between the two [18:36:55] or a quick flip-switch to l4-hash connections over the avail links? [18:37:04] lol [18:37:26] without some kind of l3/l4 hashing, if it was random packets, it would make a big mess [18:37:31] (03CR) 10Ppchelko: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/603534 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [18:37:36] bblack: Juniper will do that [18:37:39] (with re-ordering of heavy flows' packet sequences from latency diffs) [18:37:41] in our current setup it's not that easy, even if the 2 links have the same ospf metric, where the VRRP master is will decide on which link traffic goes [18:38:11] yeah [18:38:14] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page [18:38:26] eg if VRRP master of private1-a-eqiad is on cr1, traffic going out of that vlan to codfw will use the link connected to cr1 [18:38:26] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80% #page [18:38:38] I really need to go eat, but, at some point later I'm curious to discuss how long it's been since we did some provisioning planning here, and what possible future steps are (even if they are just "we should buy 40G waves instead of 10G waves" or something, I've no idea the cost difference) [18:38:46] also feel free to keep talking and I'll catch up [18:39:26] !log jgiannelos@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [18:39:27] !log jgiannelos@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [18:39:31] well under "normal" circumstances we don't really face this, and so it has seemed like a lot of expense to cover the occasional corner case of big transfers x-dc, etc [18:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:44] we had those discussions when it was time for CapEx/Opex planning, before June [18:39:46] but I think the scenario today is a pretty good argument for bumping it up [18:40:06] bblack: the more 10G hosts we have in each DC, the most risk we have to saturate the 10G link [18:40:19] yeah [18:40:39] just the eqiad edge caches, that's 160Gbps of theoretical host bandwidth firing at the codfw application layers. [18:40:45] there are a few options, like getting 40G links or aggregating 10G links [18:41:00] 10Operations, 10ChangeProp, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 7 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10Dzahn) @hnowlan @akosiaris https://gerrit.wikimedia.org/r/603534 deleted 2 admin groups , changeprop-admin and cpjobqueue-admin. But... [18:41:08] or doing load balancing between the 2 primary links we have [18:41:30] IIRC the 40g option was $$$ [18:41:37] yep exactly [18:47:00] 10Operations, 10ChangeProp, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 7 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10Pchelolo) Cross-posting from gerrit https://gerrit.wikimedia.org/r/c/operations/puppet/+/603534/4#message-a15aeb97a3211f6d70133fdadc... [18:47:14] 04Critical Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page [18:47:25] that's getting annoying [18:47:29] I'm disabling the check for now [18:48:11] done [18:50:17] 10Operations, 10ChangeProp, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 7 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10Dzahn) ACK, that means either @hnowlan's change has to be partially reverted to recreate these groups or we need to make new admin g... [18:52:06] (03CR) 10Dzahn: "looks like we have to either partially revert this or create new admin groups: https://phabricator.wikimedia.org/T220399#6472047" [puppet] - 10https://gerrit.wikimedia.org/r/603534 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [18:54:11] !log jgiannelos@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [18:54:12] !log jgiannelos@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [18:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:33] 10Operations, 10ChangeProp, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 7 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10Pchelolo) > Should it just be something like "kafka-users" maybe? Sounds good. However, thinking more about it, mobrovac has left t... [18:58:28] 10Operations, 10ChangeProp, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 7 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10Dzahn) Here is the list of hosts that have kafkacat installed: https://debmonitor.wikimedia.org/packages/kafkacat [19:00:03] 10Operations, 10ChangeProp, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 7 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10Pchelolo) Thank you. I can live with that, I have access to a number of places. Sorry I didn't think of a workaround like that. In c... [19:00:04] liw and brennen: (Dis)respected human, time to deploy Mediawiki train - European+American Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200917T1900). Please do the needful. [19:01:18] 10Operations, 10ChangeProp, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 7 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10Dzahn) Ok, sounds good. Not uploading the patch to create a new group then. But if you need it just re-request as you said. [19:02:34] !log [urbanecm@mwmaint2001 ~]$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=wikidatawiki --logwiki=metawiki 'Filomena ciavarella' 'Filomena Ciavarella' #T262657 [19:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:38] T262657: Global rename "Filomena ciavarella" to "Filomena Ciavarella" is stuck - https://phabricator.wikimedia.org/T262657 [19:02:54] (03CR) 10Dzahn: "sorted out in https://phabricator.wikimedia.org/T220399#6472080" [puppet] - 10https://gerrit.wikimedia.org/r/603534 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [19:04:19] 10Operations, 10observability, 10serviceops, 10Sustainability (Incident Followup): add monitoring of sustained memcached TKO rates - https://phabricator.wikimedia.org/T253384 (10jijiki) [19:04:23] 10Operations, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [19:05:21] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-Elukey: mcrouter codfw proxies sometimes lead to TKOs - https://phabricator.wikimedia.org/T227265 (10jijiki) [19:05:24] 10Operations, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [19:08:19] 10Operations, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [19:09:04] 10Operations, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (MediaWiki-MultiDC), 10User-Elukey: Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642 (10jijiki) [19:09:09] 10Operations, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [19:09:33] 10Operations, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [19:13:55] 10Operations, 10serviceops: Recurrent TX bw saturation for mediawiki memcached shards - https://phabricator.wikimedia.org/T258679 (10jijiki) [19:13:59] 10Operations, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [19:25:11] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission [19:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:57] (03CR) 10Herron: "PCC https://puppet-compiler.wmflabs.org/compiler1001/25174/" [puppet] - 10https://gerrit.wikimedia.org/r/628158 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [19:30:25] (03CR) 10Herron: [C: 03+2] prometheus: enable rsyncd on pop hosts [puppet] - 10https://gerrit.wikimedia.org/r/628158 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [19:33:36] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [19:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:41] (03PS1) 10Andrew Bogott: Remove refs to cloudvirt1015 [puppet] - 10https://gerrit.wikimedia.org/r/628182 (https://phabricator.wikimedia.org/T260840) [19:37:36] (03PS1) 10Andrew Bogott: Removed references to cloudvirt1015 [dns] - 10https://gerrit.wikimedia.org/r/628183 (https://phabricator.wikimedia.org/T260840) [19:38:46] (03CR) 10Andrew Bogott: [C: 03+2] Remove refs to cloudvirt1015 [puppet] - 10https://gerrit.wikimedia.org/r/628182 (https://phabricator.wikimedia.org/T260840) (owner: 10Andrew Bogott) [19:38:58] (03CR) 10Andrew Bogott: [C: 03+2] Removed references to cloudvirt1015 [dns] - 10https://gerrit.wikimedia.org/r/628183 (https://phabricator.wikimedia.org/T260840) (owner: 10Andrew Bogott) [19:40:15] 10Operations, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudvirt1015.eqiad.wmnet - https://phabricator.wikimedia.org/T260840 (10Andrew) a:05Andrew→03Cmjohnson [19:43:58] (03CR) 10Andrew Bogott: [C: 03+2] "I don't know what this is used for, if anything :(" [puppet] - 10https://gerrit.wikimedia.org/r/626457 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [19:44:09] (03CR) 10Andrew Bogott: [C: 03+2] tools-clush-generator: use eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/626464 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [19:44:23] (03PS3) 10Andrew Bogott: tools-clush-generator: use eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/626464 (https://phabricator.wikimedia.org/T260614) [19:45:23] (03CR) 10RLazarus: [C: 03+2] gitlab-test: pam access for ssh for git user [puppet] - 10https://gerrit.wikimedia.org/r/626785 (https://phabricator.wikimedia.org/T262516) (owner: 10Thcipriani) [19:45:38] \o/ thanks rzl [19:45:41] ^ brennen FYI [19:46:13] andrewbogott: okay to merge yours? [19:46:22] yes please [19:46:48] ✅ [19:47:25] thcipriani: sorry I didn't follow up on getting that merged for you. Next time you have a wmcs related puppet patch, feel free to poke in -cloud-admin. [19:47:50] bd808: no worries and thanks, will do :) [19:49:02] (03PS4) 10Andrew Bogott: toolforge_canary_list.txt: use new .eqiad1.wikimedia.cloud names [puppet] - 10https://gerrit.wikimedia.org/r/626457 (https://phabricator.wikimedia.org/T260614) [19:49:02] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:50:04] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:50:33] (03PS5) 10Andrew Bogott: toolforge_canary_list.txt: use new .eqiad1.wikimedia.cloud names [puppet] - 10https://gerrit.wikimedia.org/r/626457 (https://phabricator.wikimedia.org/T260614) [19:50:44] thcipriani: nice. [19:51:58] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [19:52:03] (03CR) 10Andrew Bogott: [C: 03+2] toolforge_canary_list.txt: use new .eqiad1.wikimedia.cloud names [puppet] - 10https://gerrit.wikimedia.org/r/626457 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [19:53:02] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [19:54:49] (03PS4) 10Andrew Bogott: clush toolforge_canary_list: update! [puppet] - 10https://gerrit.wikimedia.org/r/626456 [19:56:06] (03CR) 10Andrew Bogott: [C: 03+2] clush toolforge_canary_list: update! [puppet] - 10https://gerrit.wikimedia.org/r/626456 (owner: 10Andrew Bogott) [19:57:03] (03Abandoned) 10Andrew Bogott: toolschecker: use .eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/626468 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [19:59:59] (03CR) 10BryanDavis: [C: 03+1] Revert "toolforge: Temp handling for tools.wmflabs.org/wpcleaner" [puppet] - 10https://gerrit.wikimedia.org/r/628148 (https://phabricator.wikimedia.org/T258813) (owner: 10Nskaggs) [20:11:17] MatmaRex ugggg [20:14:22] (03PS1) 10Urbanecm: Fix APCOND_FR_NEVERBLOCKED handling (part 3) [extensions/FlaggedRevs] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/628208 (https://phabricator.wikimedia.org/T262970) [20:14:39] 10Operations, 10Wikimedia-Mailing-lists: Disabling the Daily-article-hy mailing list - https://phabricator.wikimedia.org/T263105 (10RLazarus) 05Open→03Resolved p:05Triage→03Medium a:03RLazarus Done! ` rzl@lists1001:~$ sudo disable_list daily-article-hy /var/lib/mailman/data/heldmsg-daily-article-hy-... [20:14:59] Urbanecm I usually wait until the master patch merges to cherry pick so that it references the commit it was cherry picked from [20:16:10] DannyS712: didn't know that, thanks! [20:16:56] I can update them with the commit reference once it merges [20:17:01] would be nice! [20:18:34] (03PS1) 10Andrew Bogott: cloud-vps nova fullstack test: use flavor g2.cores1.ram2.disk20 [puppet] - 10https://gerrit.wikimedia.org/r/628192 [20:19:33] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps nova fullstack test: use flavor g2.cores1.ram2.disk20 [puppet] - 10https://gerrit.wikimedia.org/r/628192 (owner: 10Andrew Bogott) [20:20:31] (03PS3) 10Andrew Bogott: base::remote_syslog: use .wikimedia.cloud naming for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/626467 (https://phabricator.wikimedia.org/T260614) [20:21:48] (03CR) 10Andrew Bogott: [C: 03+2] base::remote_syslog: use .wikimedia.cloud naming for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/626467 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [20:28:04] (03PS2) 10DannyS712: Fix APCOND_FR_NEVERBLOCKED handling (part 3) [extensions/FlaggedRevs] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/628208 (https://phabricator.wikimedia.org/T262970) (owner: 10Urbanecm) [20:38:46] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:40:17] DannyS712: heh, here we go again [20:40:35] DannyS712: Urbanecm: oh, are you deploying it right now? (i was afk for a bit) [20:40:42] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [20:40:46] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:40:59] DannyS712: for now, I just backported, but I guess I can deploy that too [20:41:26] (03CR) 10Urbanecm: [C: 03+2] Fix APCOND_FR_NEVERBLOCKED handling (part 3) [extensions/FlaggedRevs] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/628208 (https://phabricator.wikimedia.org/T262970) (owner: 10Urbanecm) [20:41:40] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the un [20:41:40] 03 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:41:51] MatmaRex: ^ [20:41:56] (above icinga) [20:42:14] mhm [20:43:07] hopefully this fixes it for good :-) [20:43:36] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:44:28] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:45:08] (03Merged) 10jenkins-bot: Fix APCOND_FR_NEVERBLOCKED handling (part 3) [extensions/FlaggedRevs] (wmf/1.36.0-wmf.9) - 10https://gerrit.wikimedia.org/r/628208 (https://phabricator.wikimedia.org/T262970) (owner: 10Urbanecm) [20:46:00] PROBLEM - Thanos sidecar cannot connect to Prometheus on icinga1001 is CRITICAL: cluster=prometheus instance=prometheus5001 job=thanos-sidecar prometheus=ops site=eqsin https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar [20:46:22] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [20:47:24] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.9/extensions/FlaggedRevs/backend/FlaggedRevsHooks.php: 19b9b9877ea3f8ffa6626108941891c2454348de: Fix APCOND_FR_NEVERBLOCKED handling (part 3; T262970) (duration: 00m 57s) [20:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:29] T262970: FlaggedRevs doesn't check the 'neverBlocked' / APCOND_FR_NEVERBLOCKED option when autopromoting - https://phabricator.wikimedia.org/T262970 [20:47:30] MatmaRex: DannyS712: there we go :) [20:48:10] thanks! [20:48:15] happy to help! [20:49:06] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [20:54:49] (03PS1) 10BryanDavis: Add `--backend=gridengine` to restart command [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/628194 (https://phabricator.wikimedia.org/T263190) [20:56:06] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:57:17] (03CR) 10BryanDavis: "Tested as a live hack on tools-sgecron-01 before making a patch and submitting." [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/628194 (https://phabricator.wikimedia.org/T263190) (owner: 10BryanDavis) [20:58:45] 10Operations, 10SRE-Access-Requests: Allow Nicholas Skaggs to issue icinga commands - https://phabricator.wikimedia.org/T263191 (10nskaggs) [20:59:43] (03CR) 10Bstorm: [C: 03+1] Add `--backend=gridengine` to restart command [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/628194 (https://phabricator.wikimedia.org/T263190) (owner: 10BryanDavis) [20:59:58] (03CR) 10BryanDavis: [C: 03+2] Add `--backend=gridengine` to restart command [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/628194 (https://phabricator.wikimedia.org/T263190) (owner: 10BryanDavis) [21:00:42] (03Merged) 10jenkins-bot: Add `--backend=gridengine` to restart command [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/628194 (https://phabricator.wikimedia.org/T263190) (owner: 10BryanDavis) [21:00:43] (03PS1) 10Nskaggs: icinga: Let Nskaggs issues commands on all hosts and services [puppet] - 10https://gerrit.wikimedia.org/r/628196 (https://phabricator.wikimedia.org/T263191) [21:19:00] (03PS1) 10BryanDavis: webservice: restore setting backend via service.manifest [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/628200 (https://phabricator.wikimedia.org/T263190) [21:19:39] (03CR) 10jerkins-bot: [V: 04-1] webservice: restore setting backend via service.manifest [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/628200 (https://phabricator.wikimedia.org/T263190) (owner: 10BryanDavis) [21:20:20] (03PS1) 10BryanDavis: d/changelog: prepare 0.22 release [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/628201 [21:32:26] (03PS2) 10BryanDavis: webservice: restore setting backend via service.manifest [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/628200 (https://phabricator.wikimedia.org/T263190) [21:32:28] (03PS1) 10BryanDavis: black: reformat with black==20.8b1 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/628203 [21:34:06] (03CR) 10BryanDavis: [C: 03+2] d/changelog: prepare 0.22 release [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/628201 (owner: 10BryanDavis) [21:34:38] (03Merged) 10jenkins-bot: d/changelog: prepare 0.22 release [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/628201 (owner: 10BryanDavis) [21:37:26] (03PS3) 10DannyS712: Grant OTRSwiki bureaucrats `bigdelete` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627964 (https://phabricator.wikimedia.org/T263149) [21:57:28] (03PS1) 10BryanDavis: gitignore: ignore wmcs-package-build.py [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/628226 [21:59:37] (03CR) 10BryanDavis: [C: 03+2] gitignore: ignore wmcs-package-build.py [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/628226 (owner: 10BryanDavis) [22:00:02] (03Merged) 10jenkins-bot: gitignore: ignore wmcs-package-build.py [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/628226 (owner: 10BryanDavis) [22:00:20] (03CR) 10BryanDavis: [C: 03+2] black: reformat with black==20.8b1 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/628203 (owner: 10BryanDavis) [22:01:07] (03Merged) 10jenkins-bot: black: reformat with black==20.8b1 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/628203 (owner: 10BryanDavis) [22:02:05] 10Operations, 10ops-codfw, 10serviceops: mw2256 went down with thermal issues / fail-safe voltage is out of range - https://phabricator.wikimedia.org/T263022 (10wiki_willy) a:03Papaul [22:06:15] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1033 ipmi alert - https://phabricator.wikimedia.org/T263145 (10wiki_willy) a:03Cmjohnson Hi @Cmjohnson - based on Netbox, looks like cloudvirt1033 is in rack C8, which wasn't part of the PDU upgrades these past weeks. Hopefully it's just... [22:13:07] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops: mw2256 - CPU/board hardware issue - https://phabricator.wikimedia.org/T263065 (10wiki_willy) a:03Papaul Hi @Dzahn - it looks like this host is out of warranty and due to be refreshed in Q3. If this ends up being a CPU issue and your team is able to get... [22:13:38] 10Operations, 10ops-codfw, 10serviceops: mw2256 went down with thermal issues / fail-safe voltage is out of range - https://phabricator.wikimedia.org/T263022 (10Dzahn) duplicate of T263065 [22:14:27] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops: mw2256 - CPU/board hardware issue - https://phabricator.wikimedia.org/T263065 (10Dzahn) duplicate of T263022 [22:15:02] (03PS1) 10Dzahn: phabricator: add script to check a phab user's email address [puppet] - 10https://gerrit.wikimedia.org/r/628229 [22:16:08] (03CR) 10jerkins-bot: [V: 04-1] phabricator: add script to check a phab user's email address [puppet] - 10https://gerrit.wikimedia.org/r/628229 (owner: 10Dzahn) [22:17:22] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Allow Nicholas Skaggs to issue icinga commands - https://phabricator.wikimedia.org/T263191 (10RLazarus) p:05Triage→03Medium [22:17:39] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Allow Nicholas Skaggs to issue icinga commands - https://phabricator.wikimedia.org/T263191 (10RLazarus) I see you already put this on the agenda for the next SRE meeting on Monday, thanks. :) I expect it'll be uncontroversial, but if you don't need it... [22:19:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: RAID controller failing on frdb1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T261221 (10wiki_willy) 05Open→03Resolved a:03Jclark-ctr Just a quick reminder to add "ops-eqiad" in the project tag, so that the dc-ops task doesn't g... [22:20:53] 10Operations, 10netops: cr1-codfw<->cr1-eqiad link saturation - https://phabricator.wikimedia.org/T263206 (10CDanis) [22:23:26] (03PS2) 10Dzahn: phabricator: add script to check a phab user's email address [puppet] - 10https://gerrit.wikimedia.org/r/628229 [22:25:35] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/25179/" [puppet] - 10https://gerrit.wikimedia.org/r/628229 (owner: 10Dzahn) [22:40:39] (03CR) 10Dzahn: [C: 03+2] nagios_common: delete unused contacts-new template [puppet] - 10https://gerrit.wikimedia.org/r/627969 (owner: 10Dzahn) [22:42:17] (03CR) 10Dzahn: [C: 03+2] ntp::daemon: replace hiera() with lookup(), lint [puppet] - 10https://gerrit.wikimedia.org/r/624332 (owner: 10Dzahn) [22:46:42] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/25180/" [puppet] - 10https://gerrit.wikimedia.org/r/624332 (owner: 10Dzahn) [22:49:38] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [22:51:18] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/25172/" [puppet] - 10https://gerrit.wikimedia.org/r/626035 (owner: 10Paladox) [22:52:57] 10Operations, 10Traffic, 10netops: Wikimedia projects not reachable for some Telecom Italia users - https://phabricator.wikimedia.org/T262869 (10CDanis) 05Open→03Resolved a:03CDanis After extensive investigation by one of our network connectivity providers, we believe that the cause has been discovered... [22:54:37] 10Operations, 10netops: cr1-codfw<->cr1-eqiad link saturation - https://phabricator.wikimedia.org/T263206 (10CDanis) (an update: duh, we have ~3Gbit/s of codfw-->esams traffic that is traversing eqiad) [22:59:08] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1085 is OK: HTTP OK: HTTP/1.0 200 OK - 23598 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200917T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:07] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/25181/ https://puppet-compiler.wmflabs.org/compiler1002/25182/" [puppet] - 10https://gerrit.wikimedia.org/r/627971 (owner: 10Dzahn) [23:00:28] 10Operations, 10Analytics-Radar, 10Domains, 10Traffic, 10Wikimedia-General-or-Unknown: Blocking all third-party storage access requests - https://phabricator.wikimedia.org/T262996 (10Krinkle) Those urls don't need to change. We just need to stop accidentally setting cookies on them. I'm 99% sure this is... [23:00:35] 10Operations, 10netops: Standardize VRRP group IDs - https://phabricator.wikimedia.org/T260363 (10faidon) SGTM! [23:00:58] 10Operations, 10Analytics-Radar, 10Domains, 10Traffic, and 2 others: Blocking all third-party storage access requests - https://phabricator.wikimedia.org/T262996 (10Krinkle) [23:02:36] (03CR) 10Dzahn: nginx: add data types (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/624357 (owner: 10Dzahn) [23:04:41] (03PS2) 10Dzahn: nginx: add data types [puppet] - 10https://gerrit.wikimedia.org/r/624357 [23:15:18] 10Operations, 10netops: Consider balancing VRRP primaries to cr1/cr2 - https://phabricator.wikimedia.org/T263212 (10faidon) p:05Triage→03Medium [23:17:30] 10Operations, 10MediaWiki-extensions-CodeReview, 10Platform Engineering: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 (10greg) >>! In T205361#6170183, @Jdforrester-WMF wrote: > So, when is the Apache rewrite being put in place? That'... [23:17:55] (03PS2) 10Dzahn: base/monitoring: move monitor_screens to proper profile parameter [puppet] - 10https://gerrit.wikimedia.org/r/628152 [23:19:00] (03CR) 10jerkins-bot: [V: 04-1] base/monitoring: move monitor_screens to proper profile parameter [puppet] - 10https://gerrit.wikimedia.org/r/628152 (owner: 10Dzahn) [23:46:22] (03CR) 10Dzahn: [C: 04-1] prometheus: replace remaining hiera() with lookup() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn) [23:47:04] (03PS13) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666