[13:07:37] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Eugene Chernov from Speed & Function - https://phabricator.wikimedia.org/T275679 (10jbond) @wkandek or @thcipriani can you approve the access @KFrancis are you able to confirm NDA status [13:08:02] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Eugene Chernov from Speed & Function - https://phabricator.wikimedia.org/T275679 (10jbond) p:05Triage→03Medium [13:10:14] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10jbond) @wkandek or @thcipriani an you approve the access @KFrancis are you able to confirm NDA status @OlyKalinichenkoSpeedAndFunction The SSH ke... [13:10:32] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10jbond) p:05Triage→03Medium [13:11:51] 10ops-eqiad, 10serviceops: decommission scb100[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T275759 (10akosiaris) [13:12:02] 10ops-eqiad, 10serviceops: decommission scb100[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T275759 (10akosiaris) p:05Triage→03Medium [13:13:13] 10ops-codfw, 10serviceops: decommission scb200[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T275760 (10akosiaris) [13:13:25] 10ops-codfw, 10serviceops: decommission scb200[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T275760 (10akosiaris) p:05Triage→03Medium [13:14:52] 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Structured-Data-Backlog: Add Matthew Williams to analytics-privatedata-users - https://phabricator.wikimedia.org/T275671 (10jbond) p:05Triage→03Medium [13:15:17] !log reinitialize all of staging-codfw. kubestage2* and kubestagemaster* have been scheduled downtime in icinga. [13:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:30] 10SRE, 10SRE-Access-Requests, 10wikimedia-irc-freenode: Grant wmopbot +o permissions in #wikimedia-operations IRC channel - https://phabricator.wikimedia.org/T275711 (10jbond) p:05Triage→03Medium [13:18:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,netbox_device_statistics} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:22:15] (03PS1) 10Hnowlan: prometheus::postgres_exporter: disk metrics and custom queries [puppet] - 10https://gerrit.wikimedia.org/r/666888 (https://phabricator.wikimedia.org/T248858) [13:22:39] 10SRE: Mediawiki Swift PUTs from eqiad to codfw reported slow - https://phabricator.wikimedia.org/T275752 (10jbond) p:05Triage→03Medium looking at triaging this task do we have a gut feel of where the issue may be e.g. mw server, general networking, swift, something else? [13:23:11] 10SRE, 10ops-eqiad, 10Discovery: elastic1033's mgmt is unreachable - https://phabricator.wikimedia.org/T275733 (10jbond) p:05Triage→03Medium [13:23:23] 10SRE, 10ops-eqiad, 10Analytics: an-worker1111 PS Redundancy alert - https://phabricator.wikimedia.org/T275732 (10jbond) p:05Triage→03Medium [13:23:25] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/666615 (owner: 10Alexandros Kosiaris) [13:24:26] 10SRE, 10serviceops: Support proxying to etcd v3 storage on buster or later - https://phabricator.wikimedia.org/T275600 (10jbond) p:05Triage→03Medium [13:24:46] (03PS2) 10Gergő Tisza: GrowthExperiments: link recommendation service URL for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666860 [13:25:11] PROBLEM - Prometheus k8s-staging cache not updating on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2003&var-datasource=codfw+prometheus/ops [13:25:55] PROBLEM - Prometheus k8s-staging cache not updating on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2004&var-datasource=codfw+prometheus/ops [13:26:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:31:02] (03CR) 10Kosta Harlan: [C: 04-1] "Commit message needs to be updated, the other parts that I understand (service on part 4006, and the URL structure) seem OK to me. Thanks!" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [13:32:23] 10SRE: Mediawiki Swift PUTs from eqiad to codfw reported slow - https://phabricator.wikimedia.org/T275752 (10fgiunchedi) >>! In T275752#6860917, @jbond wrote: > looking at triaging this task do we have a gut feel of where the issue may be e.g. mw server, general networking, swift, something else? Good question,... [13:33:25] 10SRE: Mediawiki Swift PUTs from eqiad to codfw reported slow - https://phabricator.wikimedia.org/T275752 (10jbond) >>! In T275752#6860955, @fgiunchedi wrote: >>>! In T275752#6860917, @jbond wrote: >> looking at triaging this task do we have a gut feel of where the issue may be e.g. mw server, general networking... [13:51:10] 10SRE, 10Wikimedia-Mailing-lists: Request for creation: Art+Feminism Wikimedians Mailing List - https://phabricator.wikimedia.org/T275552 (10jbond) 05Open→03Resolved a:03jbond @Masssly The mailing list has now been created and you should be able to visit both the [[ https://lists.wikimedia.org/mailman/ad... [13:51:48] (03CR) 10David Caro: doc: Introduce a code reviewing guideline (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/666601 (owner: 10David Caro) [13:53:05] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2001.codfw.wmnet with reason: REIMAGE [13:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:04] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2002.codfw.wmnet with reason: REIMAGE [13:55:06] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2001.codfw.wmnet with reason: REIMAGE [13:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:18] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2002.codfw.wmnet with reason: REIMAGE [13:57:22] (03PS1) 10Kormat: install_server: Fix kubernetes-node to work with buster. [puppet] - 10https://gerrit.wikimedia.org/r/666896 [13:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:43] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Sergey Trofimovsky from Speed & Function - https://phabricator.wikimedia.org/T275722 (10thcipriani) >>! In T275722#6860775, @jbond wrote: > @wkandek or @thcipriani can you approve the access Approve. Thanks! [13:58:07] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Eugene Chernov from Speed & Function - https://phabricator.wikimedia.org/T275679 (10thcipriani) >>! In T275679#6860798, @jbond wrote: > @wkandek or @thcipriani can you approve the access Approve. Thanks! [13:58:34] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10thcipriani) >>! In T275677#6860814, @jbond wrote: > @wkandek or @thcipriani an you approve the access Approve. Thanks! [14:01:19] (03CR) 10Kormat: "Fixes the issue we were discussing yesterday." [puppet] - 10https://gerrit.wikimedia.org/r/666896 (owner: 10Kormat) [14:06:49] 10SRE: Integrate Buster 10.8 point update - https://phabricator.wikimedia.org/T274099 (10MoritzMuehlenhoff) [14:07:07] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:14] 10SRE, 10SRE-Access-Requests, 10wikimedia-irc-freenode: Grant wmopbot +o permissions in #wikimedia-operations IRC channel - https://phabricator.wikimedia.org/T275711 (10jbond) @mark or @faidon as Managers in #wikimedia-operations are you able to action or advice on this? [14:09:33] !log kormat@cumin1001 START - Cookbook sre.ganeti.makevm for new host ml-serve-ctrl1001.eqiad.wmnet [14:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:05] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:50] !log klausman@cumin1001 START - Cookbook sre.ganeti.makevm for new host ml-serve-ctrl1002.eqiad.wmnet [14:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:05] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:14:15] !log klausman@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ml-serve-ctrl1002.eqiad.wmnet [14:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:50] (03CR) 10Alexandros Kosiaris: install_server: Fix kubernetes-node to work with buster. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666896 (owner: 10Kormat) [14:16:23] (03CR) 10Jforrester: "Huh." [puppet] - 10https://gerrit.wikimedia.org/r/666787 (owner: 10Legoktm) [14:16:57] !log installing cairo security updates on buster [14:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:16] 10SRE, 10vm-requests: eqiad/codfw: 2x2 VM request for ML-Serve Kubernetes cluster - https://phabricator.wikimedia.org/T275630 (10elukey) [14:17:26] !log klausman@cumin1001 START - Cookbook sre.ganeti.makevm for new host ml-serve-ctrl1002.eqiad.wmnet [14:17:27] (03PS1) 10Klausman: modules/sudo: Add TMUX variable to kept env vars [puppet] - 10https://gerrit.wikimedia.org/r/666899 [14:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:19] (03CR) 10Jbond: modules/sudo: Add TMUX variable to kept env vars (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666899 (owner: 10Klausman) [14:20:30] (03PS2) 10Klausman: modules/sudo: Add TMUX variable to kept env vars [puppet] - 10https://gerrit.wikimedia.org/r/666899 [14:20:41] (03CR) 10Klausman: modules/sudo: Add TMUX variable to kept env vars (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666899 (owner: 10Klausman) [14:20:45] (03CR) 10Ottomata: "Oop nice." [puppet] - 10https://gerrit.wikimedia.org/r/666788 (owner: 10Elukey) [14:20:59] !log klausman@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ml-serve-ctrl1002.eqiad.wmnet [14:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:08] (03CR) 10Kormat: install_server: Fix kubernetes-node to work with buster. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666896 (owner: 10Kormat) [14:22:37] !log klausman@cumin1001 START - Cookbook sre.ganeti.makevm for new host ml-serve-ctrl1002.eqiad.wmnet [14:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:19] !log kormat@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-serve-ctrl1001.eqiad.wmnet [14:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:59] !log installing postgresql security updates on buster [14:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:38] (03PS1) 10Jbond: interactive: also check term for tmux in ensure_shell_is_durable [software/pywmflib] - 10https://gerrit.wikimedia.org/r/666902 [14:35:13] !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-serve-ctrl1002.eqiad.wmnet [14:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:07] 10SRE, 10vm-requests: eqiad/codfw: 2x2 VM request for ML-Serve Kubernetes cluster - https://phabricator.wikimedia.org/T275630 (10klausman) Created ml-serve-ctrl1001 and ml-serve-ctrl1002 in eqiad, rows B and D. [14:38:39] (03CR) 10Alexandros Kosiaris: install_server: Fix kubernetes-node to work with buster. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666896 (owner: 10Kormat) [14:38:58] (03PS1) 10Jbond: interactive: also check term for tmux in ensure_shell_is_durable [software/pywmflib] - 10https://gerrit.wikimedia.org/r/666904 [14:40:43] (03Abandoned) 10Jbond: interactive: also check term for tmux in ensure_shell_is_durable [software/pywmflib] - 10https://gerrit.wikimedia.org/r/666902 (owner: 10Jbond) [14:41:53] (03PS1) 10Klausman: Temp fix for sudo/wmflib not handling tmux correctly [puppet] - 10https://gerrit.wikimedia.org/r/666905 [14:42:08] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:42:16] (03CR) 10Klausman: [C: 03+2] Temp fix for sudo/wmflib not handling tmux correctly [puppet] - 10https://gerrit.wikimedia.org/r/666905 (owner: 10Klausman) [14:42:31] !log depool cp4032 for ats-tls/NUMA tests [14:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:46] (03CR) 10Volans: "Could you please add this use case also to the test in the parametrization for test_ensure_shell_is_durable_sty? looks good otherwise." [software/pywmflib] - 10https://gerrit.wikimedia.org/r/666904 (owner: 10Jbond) [14:43:38] (03PS3) 10Klausman: modules/sudo: Add TMUX variable to kept env vars [puppet] - 10https://gerrit.wikimedia.org/r/666899 [14:46:15] (03CR) 10Arturo Borrero Gonzalez: openstack: neutron: add wmcs-netns-events.py daemon (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/666616 (https://phabricator.wikimedia.org/T275483) (owner: 10Arturo Borrero Gonzalez) [14:49:16] 10SRE: Mediawiki Swift PUTs from eqiad to codfw reported slow - https://phabricator.wikimedia.org/T275752 (10fgiunchedi) Testing up/down loads with https://commons.wikimedia.org/wiki/File:The_Lost_World_(1925).webm (300MB) file from `mw1305` (using a different swift account, not mediawiki's to protect against ac... [14:49:27] (03PS7) 10Arturo Borrero Gonzalez: openstack: neutron: add wmcs-netns-events.py daemon [puppet] - 10https://gerrit.wikimedia.org/r/666616 (https://phabricator.wikimedia.org/T275483) [14:50:51] (03PS8) 10Arturo Borrero Gonzalez: openstack: neutron: add wmcs-netns-events.py daemon [puppet] - 10https://gerrit.wikimedia.org/r/666616 (https://phabricator.wikimedia.org/T275483) [14:52:22] (03PS9) 10Arturo Borrero Gonzalez: openstack: neutron: add wmcs-netns-events.py daemon [puppet] - 10https://gerrit.wikimedia.org/r/666616 (https://phabricator.wikimedia.org/T275483) [14:53:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1002/28245/" [puppet] - 10https://gerrit.wikimedia.org/r/666616 (https://phabricator.wikimedia.org/T275483) (owner: 10Arturo Borrero Gonzalez) [14:56:08] 10SRE, 10ops-eqiad: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10fgiunchedi) 10G for this host (or all ms-be for that matter) is needed, please move the card over, that will indeed keep the mac address! thank you [14:59:56] !log pool cp4032 [15:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:33] !log installing libmaxminddb updates from buster 10.8 point release [15:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] install_server: Fix kubernetes-node to work with buster. [puppet] - 10https://gerrit.wikimedia.org/r/666896 (owner: 10Kormat) [15:02:37] (03CR) 10Alexandros Kosiaris: install_server: Fix kubernetes-node to work with buster. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666896 (owner: 10Kormat) [15:05:24] (03PS2) 10Jbond: interactive: also check term for tmux in ensure_shell_is_durable [software/pywmflib] - 10https://gerrit.wikimedia.org/r/666904 [15:05:24] !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm for new host ml-serve-ctrl2001.codfw.wmnet [15:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:57] (03CR) 10Jbond: "> Patch Set 1:" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/666904 (owner: 10Jbond) [15:09:51] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: legoktm can't build CI docker images without using root because he's no longer in contint-admins - https://phabricator.wikimedia.org/T275731 (10jbond) p:05Triage→03Medium [15:10:34] 10SRE, 10Analytics: Kubeflow on stat machines - https://phabricator.wikimedia.org/T275551 (10jbond) p:05Triage→03Medium [15:12:42] 10SRE, 10Traffic, 10Patch-For-Review: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 (10BBlack) [15:15:19] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10akosiaris) https://github.com/helm/helm/issues/8271 says that --recreatepods won't work in helm3, we need to find an alternative. [15:18:45] (03CR) 10Vgutierrez: [V: 03+1 C: 04-1] "this isn't enough to bound traffic_server to a physical CPU" [puppet] - 10https://gerrit.wikimedia.org/r/666871 (https://phabricator.wikimedia.org/T274888) (owner: 10Vgutierrez) [15:23:07] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-serve-ctrl2001.codfw.wmnet [15:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:48] !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm for new host ml-serve-ctrl2002.codfw.wmnet [15:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:09] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2002.codfw.wmnet with reason: REIMAGE [15:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:26:12] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2002.codfw.wmnet with reason: REIMAGE [15:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:32] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: legoktm can't build CI docker images without using root because he's no longer in contint-admins - https://phabricator.wikimedia.org/T275731 (10thcipriani) >>! In T275731#6859724, @Joe wrote: > I agree that it would make sense for anyone with glob... [15:26:46] (03PS1) 10Majavah: Make Toolforge docker registry cert configurable [puppet] - 10https://gerrit.wikimedia.org/r/666915 (https://phabricator.wikimedia.org/T267701) [15:27:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:30:28] (03PS1) 10David Caro: toolforge.etcdctl: Added removal of a member [software/spicerack] - 10https://gerrit.wikimedia.org/r/666919 (https://phabricator.wikimedia.org/T274497) [15:30:48] (03PS1) 10Muehlenhoff: envoyproxy: Remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/666920 [15:35:24] (03PS1) 10Muehlenhoff: mediawiki::packages::fonts: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/666928 [15:37:32] 10SRE, 10SRE-Access-Requests: wikidata.org delegated Full Google Search Console access for abaso@wikimedia.org - https://phabricator.wikimedia.org/T275240 (10jbond) 05Open→03Resolved p:05Triage→03Medium a:03jbond Sorry for the delay, i have now added abaso@wikimedia.org to m.wikidata.org and www.wik... [15:37:54] (03PS1) 10Muehlenhoff: service::node: Remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/666930 [15:38:03] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-serve-ctrl2002.codfw.wmnet [15:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:10] (03CR) 10jerkins-bot: [V: 04-1] toolforge.etcdctl: Added removal of a member [software/spicerack] - 10https://gerrit.wikimedia.org/r/666919 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [15:38:49] 10SRE, 10vm-requests: eqiad/codfw: 2x2 VM request for ML-Serve Kubernetes cluster - https://phabricator.wikimedia.org/T275630 (10elukey) Created ml-serve-ctrl200[1,2] in codfw, rows C and D (the ones with less VMs) MAC address for ml-serve-ctrl2001.codfw.wmnet is: aa:00:00:b7:68:43 MAC address for ml-serve-ct... [15:39:03] (03CR) 10Volans: "CI is not happy, that aside, just two typos and a question inline." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/666919 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [15:39:53] (03PS1) 10Elukey: Add ml-serve-ctrl200[1,2] base config [puppet] - 10https://gerrit.wikimedia.org/r/666931 (https://phabricator.wikimedia.org/T275630) [15:40:40] klausman:--^ (if you have a moment) [15:42:35] klausman: as your lawyer, i must advise you to be momentless [15:43:08] live in the non-moment [15:43:15] * klausman freezes solid as all momentum ceases [15:43:40] * elukey questions Tobias' life choices if Kormat is a lawyer [15:43:43] (03CR) 10Klausman: [C: 03+1] Add ml-serve-ctrl200[1,2] base config [puppet] - 10https://gerrit.wikimedia.org/r/666931 (https://phabricator.wikimedia.org/T275630) (owner: 10Elukey) [15:44:01] *a* lawyer maybe, not *my* lawyer. [15:45:18] yes definitely, I know that you are a wise person [15:45:46] objection, not found in evidence [15:47:34] objection, the defense is sabotaging itself. [15:47:34] Wait, we don't even have a judge. [15:47:34] (03PS1) 10Volans: code style: improve doc and link doc from tox [software/spicerack] - 10https://gerrit.wikimedia.org/r/666934 [15:47:35] (03CR) 10Elukey: [C: 03+2] Add ml-serve-ctrl200[1,2] base config [puppet] - 10https://gerrit.wikimedia.org/r/666931 (https://phabricator.wikimedia.org/T275630) (owner: 10Elukey) [15:49:30] (03PS1) 10Muehlenhoff: aptrepo: Remove update definitions only used on jessie [puppet] - 10https://gerrit.wikimedia.org/r/666935 [15:50:23] (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/666904 (owner: 10Jbond) [15:50:54] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 50.26 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [15:51:23] (03PS1) 10Muehlenhoff: uwsgi: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/666938 [15:52:53] (03CR) 10David Caro: "> Patch Set 1:" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/666919 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [15:52:55] (03PS2) 10David Caro: toolforge.etcdctl: Added removal of a member [software/spicerack] - 10https://gerrit.wikimedia.org/r/666919 (https://phabricator.wikimedia.org/T274497) [15:53:38] (03PS1) 10Jbond: admin: add ops to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/666940 (https://phabricator.wikimedia.org/T275731) [15:53:40] (03PS1) 10Jbond: admin: add contint-roots to contint-admins using a yaml refrence [puppet] - 10https://gerrit.wikimedia.org/r/666941 (https://phabricator.wikimedia.org/T275731) [15:54:07] (03CR) 10jerkins-bot: [V: 04-1] admin: add ops to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/666940 (https://phabricator.wikimedia.org/T275731) (owner: 10Jbond) [15:54:21] (03CR) 10jerkins-bot: [V: 04-1] admin: add contint-roots to contint-admins using a yaml refrence [puppet] - 10https://gerrit.wikimedia.org/r/666941 (https://phabricator.wikimedia.org/T275731) (owner: 10Jbond) [15:54:34] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:54:52] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:55:04] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:55:04] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:55:08] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:55:10] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:55:12] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:55:20] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:55:26] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:55:28] (03CR) 10David Caro: [C: 03+1] code style: improve doc and link doc from tox (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/666934 (owner: 10Volans) [15:56:13] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:56:13] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:56:26] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:56:32] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:56:38] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:56:54] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:57:02] (03PS2) 10Jbond: admin: add contint-roots to contint-admins using a yaml reference [puppet] - 10https://gerrit.wikimedia.org/r/666941 (https://phabricator.wikimedia.org/T275731) [15:57:09] (03CR) 10MSantos: [C: 03+1] prometheus::postgres_exporter: disk metrics and custom queries [puppet] - 10https://gerrit.wikimedia.org/r/666888 (https://phabricator.wikimedia.org/T248858) (owner: 10Hnowlan) [15:57:14] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:57:23] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:57:24] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:57:28] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:57:28] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:57:31] (03PS2) 10Jbond: admin: add ops to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/666940 (https://phabricator.wikimedia.org/T275731) [15:57:32] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:57:34] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:57:48] (03CR) 10jerkins-bot: [V: 04-1] admin: add contint-roots to contint-admins using a yaml reference [puppet] - 10https://gerrit.wikimedia.org/r/666941 (https://phabricator.wikimedia.org/T275731) (owner: 10Jbond) [15:59:00] (03CR) 10jerkins-bot: [V: 04-1] admin: add ops to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/666940 (https://phabricator.wikimedia.org/T275731) (owner: 10Jbond) [16:02:04] 10SRE, 10Maps, 10Traffic: Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776 (10MSantos) [16:02:52] (03PS3) 10Jbond: admin: add ops to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/666940 (https://phabricator.wikimedia.org/T275731) [16:03:31] (03CR) 10Klausman: [C: 03+1] interactive: also check term for tmux in ensure_shell_is_durable [software/pywmflib] - 10https://gerrit.wikimedia.org/r/666904 (owner: 10Jbond) [16:03:48] (03PS3) 10Jbond: admin: add contint-roots to contint-admins using a yaml reference [puppet] - 10https://gerrit.wikimedia.org/r/666941 (https://phabricator.wikimedia.org/T275731) [16:03:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28250/console" [puppet] - 10https://gerrit.wikimedia.org/r/666940 (https://phabricator.wikimedia.org/T275731) (owner: 10Jbond) [16:04:18] (03CR) 10jerkins-bot: [V: 04-1] admin: add contint-roots to contint-admins using a yaml reference [puppet] - 10https://gerrit.wikimedia.org/r/666941 (https://phabricator.wikimedia.org/T275731) (owner: 10Jbond) [16:07:24] (03PS4) 10Jbond: admin: add contint-roots to contint-admins using a yaml reference [puppet] - 10https://gerrit.wikimedia.org/r/666941 (https://phabricator.wikimedia.org/T275731) [16:08:28] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28252/console" [puppet] - 10https://gerrit.wikimedia.org/r/666941 (https://phabricator.wikimedia.org/T275731) (owner: 10Jbond) [16:09:29] (03CR) 10Alexandros Kosiaris: "15:49:36 | kubestage2002.codfw.wmnet | [info] kubestage2002 done." [puppet] - 10https://gerrit.wikimedia.org/r/666896 (owner: 10Kormat) [16:09:38] (03PS3) 10Gergő Tisza: GrowthExperiments: link recommendation service URL for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666860 (https://phabricator.wikimedia.org/T274198) [16:09:42] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 65.38 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [16:09:49] (03PS4) 10Gergő Tisza: GrowthExperiments: link recommendation service URL for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666860 (https://phabricator.wikimedia.org/T274198) [16:10:33] (03PS1) 10Kormat: Revert "install_server: Use custom (non-puppet) recipe for d-i-test" [puppet] - 10https://gerrit.wikimedia.org/r/666701 [16:11:28] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:11:36] (03PS2) 10Kormat: Revert "install_server: Use custom (non-puppet) recipe for d-i-test" [puppet] - 10https://gerrit.wikimedia.org/r/666701 [16:11:36] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review: legoktm can't build CI docker images without using root because he's no longer in contint-admins - https://phabricator.wikimedia.org/T275731 (10jbond) > would adding *contint_roots_members explicitly to contint-admin with a c... [16:13:24] (03CR) 10Alexandros Kosiaris: [C: 04-1] Enable helmfile recreatePods for changeprop installations (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/666225 (owner: 10Ppchelko) [16:13:40] (03CR) 10Kormat: [C: 03+2] Revert "install_server: Use custom (non-puppet) recipe for d-i-test" [puppet] - 10https://gerrit.wikimedia.org/r/666701 (owner: 10Kormat) [16:15:26] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:15:45] 10SRE, 10Traffic, 10Patch-For-Review: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 (10BBlack) Updates on where we're at on some of the pain points above, in terms of solution analysis: 1. `large_objects_cutoff` and all related things - the key thing that ma... [16:16:00] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:16:04] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [16:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:10] !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [16:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:39] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [16:16:45] !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [16:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:52] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:06] (03CR) 10Huei Tan: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666680 (https://phabricator.wikimedia.org/T273674) (owner: 10Sbisson) [16:17:16] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:17:20] I'll deploy a beta-only config change [16:17:33] (03CR) 10Gergő Tisza: [C: 03+2] GrowthExperiments: link recommendation service URL for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666860 (https://phabricator.wikimedia.org/T274198) (owner: 10Gergő Tisza) [16:17:55] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [16:17:56] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:31] (03Merged) 10jenkins-bot: GrowthExperiments: link recommendation service URL for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666860 (https://phabricator.wikimedia.org/T274198) (owner: 10Gergő Tisza) [16:19:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Make Toolforge docker registry cert configurable [puppet] - 10https://gerrit.wikimedia.org/r/666915 (https://phabricator.wikimedia.org/T267701) (owner: 10Majavah) [16:23:51] (03PS1) 10Awight: parquet logging falls back to default file handler [puppet] - 10https://gerrit.wikimedia.org/r/666948 (https://phabricator.wikimedia.org/T275757) [16:23:59] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [16:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:35] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/666941 (https://phabricator.wikimedia.org/T275731) (owner: 10Jbond) [16:26:04] RECOVERY - Prometheus k8s-staging cache not updating on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2004&var-datasource=codfw+prometheus/ops [16:26:52] RECOVERY - Prometheus k8s-staging cache not updating on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2003&var-datasource=codfw+prometheus/ops [16:27:04] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 76.27 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [16:28:09] !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [16:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:28] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:28:30] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:29:02] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:29:18] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:30:32] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:30:36] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:31:04] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:31:20] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:34:23] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'production' . [16:34:23] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'staging' . [16:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:31] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall (untested, please attach a PCC run)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/666888 (https://phabricator.wikimedia.org/T248858) (owner: 10Hnowlan) [16:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:05] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/666935 (owner: 10Muehlenhoff) [16:35:20] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 21.36 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [16:36:17] (03CR) 10Razzi: [C: 03+2] Add a job for TemplateWizard metrics aggregation [puppet] - 10https://gerrit.wikimedia.org/r/649662 (https://phabricator.wikimedia.org/T262209) (owner: 10Awight) [16:36:31] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [16:36:31] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [16:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:56] (03PS1) 10Jbond: check_puppetrun: go critical puppet is disabled for more then a week [puppet] - 10https://gerrit.wikimedia.org/r/666950 [16:37:02] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [16:37:02] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [16:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:09] (03CR) 10David Caro: openstack: neutron: add wmcs-netns-events.py daemon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666616 (https://phabricator.wikimedia.org/T275483) (owner: 10Arturo Borrero Gonzalez) [16:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:30] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'production' . [16:37:30] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [16:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:00] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [16:38:00] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [16:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:18] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'production' . [16:38:18] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [16:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:28] (03PS1) 10Klausman: Add ml-serve-ctrl100[1,2] base config [puppet] - 10https://gerrit.wikimedia.org/r/666951 (https://phabricator.wikimedia.org/T275630) [16:38:45] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'production' . [16:38:46] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [16:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:17] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'staging' . [16:39:17] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'production' . [16:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:24] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:39:26] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:40] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:39:42] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [16:39:42] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [16:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:35] I have to go shortly, it looks like sth is spamming logstash (and causing indexing conflicts too, hence the alert) [16:41:28] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:41:30] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:41:31] godog: probably the RB stuff above [16:41:44] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:41:54] ah yeah, likely [16:42:03] (03CR) 10Jbond: [C: 03+1] "LGTM thanks <3" [software/spicerack] - 10https://gerrit.wikimedia.org/r/666934 (owner: 10Volans) [16:42:46] (03CR) 10Jbond: [C: 03+2] interactive: also check term for tmux in ensure_shell_is_durable [software/pywmflib] - 10https://gerrit.wikimedia.org/r/666904 (owner: 10Jbond) [16:45:09] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/666938 (owner: 10Muehlenhoff) [16:45:10] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:41] (03Merged) 10jenkins-bot: interactive: also check term for tmux in ensure_shell_is_durable [software/pywmflib] - 10https://gerrit.wikimedia.org/r/666904 (owner: 10Jbond) [16:48:30] (03PS1) 10Alexandros Kosiaris: eventgate-analytics: Open access to api-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/666956 [16:48:42] (03CR) 10jerkins-bot: [V: 04-1] eventgate-analytics: Open access to api-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/666956 (owner: 10Alexandros Kosiaris) [16:49:49] (03PS2) 10Alexandros Kosiaris: eventgate-analytics: Open access to api-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/666956 [16:49:54] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 46.58 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [16:50:04] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [16:50:04] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [16:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:14] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 146 probes of 597 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:51:55] (03PS3) 10Alexandros Kosiaris: eventgate-analytics: Open access to api-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/666956 [16:52:22] (03PS1) 10Jbond: O:idp: add netbox as an authorised servie [puppet] - 10https://gerrit.wikimedia.org/r/666957 (https://phabricator.wikimedia.org/T244849) [16:53:30] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:53:32] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:53:40] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:53:40] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:53:48] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:54:00] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventgate-analytics: Open access to api-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/666956 (owner: 10Alexandros Kosiaris) [16:54:09] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [16:54:09] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [16:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28253/console" [puppet] - 10https://gerrit.wikimedia.org/r/666957 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [16:54:50] (03Merged) 10jenkins-bot: eventgate-analytics: Open access to api-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/666956 (owner: 10Alexandros Kosiaris) [16:55:10] (03PS2) 10Jbond: O:idp: add netbox as an authorised servie [puppet] - 10https://gerrit.wikimedia.org/r/666957 (https://phabricator.wikimedia.org/T244849) [16:55:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:55:44] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:55:56] (03CR) 10Jbond: [C: 03+2] O:idp: add netbox as an authorised servie [puppet] - 10https://gerrit.wikimedia.org/r/666957 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [16:56:00] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:56:00] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:56:06] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:56:13] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:56:14] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:56:26] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 45 probes of 597 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:56:26] (03Abandoned) 10Effie Mouzeli: hieradata: remove shard06 from redis_sessions [puppet] - 10https://gerrit.wikimedia.org/r/666852 (https://phabricator.wikimedia.org/T272319) (owner: 10Effie Mouzeli) [16:56:44] (03PS1) 10Alexandros Kosiaris: eventgates: Allow access to api-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/666958 [16:57:48] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventgates: Allow access to api-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/666958 (owner: 10Alexandros Kosiaris) [16:57:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:58:02] 10SRE, 10Analytics, 10Analytics-Kanban, 10Traffic: Traffic anomalies: Factor out list of countries into a dedicated Hive table - https://phabricator.wikimedia.org/T272052 (10mforns) a:03mforns [16:58:44] (03Merged) 10jenkins-bot: eventgates: Allow access to api-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/666958 (owner: 10Alexandros Kosiaris) [16:59:50] (03PS5) 10Effie Mouzeli: hieradata: enable memcached socket mwdebug1003, mwdebug2001 [puppet] - 10https://gerrit.wikimedia.org/r/663796 (https://phabricator.wikimedia.org/T273115) [17:00:04] jbond42 and cdanis: Your horoscope predicts another unfortunate Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210225T1700). [17:00:04] tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:06] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:00:10] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [17:00:10] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [17:00:16] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:00:16] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:00:18] o/ [17:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:22] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:00:24] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:00:24] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:30] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:00:38] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:00:38] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:01:08] the change should be a no-op in production. [17:01:11] (03PS1) 10Effie Mouzeli: memcached::templates: fix typo in memcached.systemd.erb [puppet] - 10https://gerrit.wikimedia.org/r/666960 [17:01:38] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [17:01:38] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [17:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:00] (03CR) 10Elukey: [C: 03+1] "LGTM modulo the fact that I haven't checked the mac addresses :)" [puppet] - 10https://gerrit.wikimedia.org/r/666951 (https://phabricator.wikimedia.org/T275630) (owner: 10Klausman) [17:02:32] (03CR) 10Klausman: [C: 03+2] Add ml-serve-ctrl100[1,2] base config [puppet] - 10https://gerrit.wikimedia.org/r/666951 (https://phabricator.wikimedia.org/T275630) (owner: 10Klausman) [17:02:43] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' . [17:02:43] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'canary' . [17:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:02] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [17:03:02] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [17:03:02] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [17:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:09] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'production' . [17:04:09] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [17:04:12] (03CR) 10Arturo Borrero Gonzalez: openstack: neutron: add wmcs-netns-events.py daemon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666616 (https://phabricator.wikimedia.org/T275483) (owner: 10Arturo Borrero Gonzalez) [17:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:04:30] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [17:04:37] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [17:04:37] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [17:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:07] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [17:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:12] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'canary' . [17:06:13] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [17:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:34] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [17:06:50] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [17:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:19] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'sessionstore' for release 'production' . [17:07:20] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'sessionstore' for release 'staging' . [17:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:40] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:11:30] (03CR) 10Effie Mouzeli: "PCC OK https://puppet-compiler.wmflabs.org/compiler1002/28255/" [puppet] - 10https://gerrit.wikimedia.org/r/666960 (owner: 10Effie Mouzeli) [17:11:49] (03CR) 10Effie Mouzeli: [C: 03+2] memcached::templates: fix typo in memcached.systemd.erb [puppet] - 10https://gerrit.wikimedia.org/r/666960 (owner: 10Effie Mouzeli) [17:12:06] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1007 is CRITICAL: 3.337e+08 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1007 [17:12:45] ^-- This is ok, just forgot to downtime the service [17:12:46] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1008 is CRITICAL: 3.707e+08 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1008 [17:13:36] (03PS1) 10David Caro: etcdctl: Fix commands sent to control node [software/spicerack] - 10https://gerrit.wikimedia.org/r/666961 [17:13:44] (03PS5) 10Effie Mouzeli: mediawiki::mcrouter: add onhost memcached unix socket support [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) [17:14:08] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1001 is CRITICAL: 4.073e+08 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [17:14:22] (03CR) 10Effie Mouzeli: hieradata: enable memcached socket mwdebug1003, mwdebug2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663796 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [17:14:27] (03PS6) 10Effie Mouzeli: hieradata: enable memcached socket mwdebug1003, mwdebug2001 [puppet] - 10https://gerrit.wikimedia.org/r/663796 (https://phabricator.wikimedia.org/T273115) [17:17:29] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [17:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:07] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' . [17:18:07] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [17:18:07] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'production' . [17:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:43] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [17:18:43] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [17:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:04] (03CR) 10jerkins-bot: [V: 04-1] etcdctl: Fix commands sent to control node [software/spicerack] - 10https://gerrit.wikimedia.org/r/666961 (owner: 10David Caro) [17:19:08] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'production' . [17:19:09] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' . [17:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:31] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'staging' . [17:25:31] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'production' . [17:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:45] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [17:25:45] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [17:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:03] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [17:26:03] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [17:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:17] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'production' . [17:26:17] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [17:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:34] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [17:26:34] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [17:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:51] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [17:26:51] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'production' . [17:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:05] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [17:27:05] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'production' . [17:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:21] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'staging' . [17:27:21] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'production' . [17:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:33] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [17:27:33] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [17:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:26] (03PS1) 10Ottomata: Expand range of Modify Kafka max replica lag slope alert [puppet] - 10https://gerrit.wikimedia.org/r/666966 (https://phabricator.wikimedia.org/T273702) [17:33:00] (03PS1) 10Elukey: install_server: use a more specific pattern for ml-serve[12]00[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/666967 [17:34:00] (03CR) 10Elukey: [C: 03+2] install_server: use a more specific pattern for ml-serve[12]00[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/666967 (owner: 10Elukey) [17:37:53] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [17:37:53] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [17:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:26] 10SRE, 10vm-requests, 10Patch-For-Review: eqiad/codfw: 2x2 VM request for ML-Serve Kubernetes cluster - https://phabricator.wikimedia.org/T275630 (10jbond) p:05Triage→03Medium [17:43:32] (03PS1) 10Elukey: install_server: fix ml-serve-ctrl2001's MAC address [puppet] - 10https://gerrit.wikimedia.org/r/666970 [17:44:30] (03CR) 10Elukey: [C: 03+2] install_server: fix ml-serve-ctrl2001's MAC address [puppet] - 10https://gerrit.wikimedia.org/r/666970 (owner: 10Elukey) [17:45:08] bash: joy: command not found --^ [17:47:59] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [17:47:59] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [17:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:53:22] 10SRE, 10Traffic: provision more machines for eqsin caches - https://phabricator.wikimedia.org/T275046 (10BBlack) [17:55:43] (03CR) 10Legoktm: [C: 04-1] "contint-admins should be added to the list in validate_duplicated_ops_permissions() in cross-validate-accounts.py" [puppet] - 10https://gerrit.wikimedia.org/r/666940 (https://phabricator.wikimedia.org/T275731) (owner: 10Jbond) [17:58:06] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [17:58:06] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [17:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] chrisalbon and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210225T1800). [18:05:08] what's a "graphoid"? [18:06:49] (03CR) 10Klausman: [C: 03+1] install_server: fix ml-serve-ctrl2001's MAC address [puppet] - 10https://gerrit.wikimedia.org/r/666970 (owner: 10Elukey) [18:07:13] (03CR) 10MSantos: [C: 03+1] osm: add missing production step to import script [puppet] - 10https://gerrit.wikimedia.org/r/666596 (owner: 10Hnowlan) [18:07:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:07:44] https://wikitech.wikimedia.org/wiki/Graphoid [18:08:03] something that should be gone now legoktm :-D [18:08:12] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [18:08:12] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [18:08:13] Rip graphoid xD [18:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:51] Looks like jouncebot needs an update. [18:09:05] I do love its snarky attitude [18:09:43] 10SRE, 10SRE-Access-Requests: wikidata.org delegated Full Google Search Console access for abaso@wikimedia.org - https://phabricator.wikimedia.org/T275240 (10dr0ptp4kt) Thank you for the access. I see the "owner" access for those domains but am unable to add users. Would you please grant me access on the wikid... [18:11:41] (03PS1) 10Giuseppe Lavagetto: Revert "Increase concurrency for cdnPurge job to 200" [deployment-charts] - 10https://gerrit.wikimedia.org/r/666702 [18:11:49] (03PS2) 10Giuseppe Lavagetto: Revert "Increase concurrency for cdnPurge job to 200" [deployment-charts] - 10https://gerrit.wikimedia.org/r/666702 [18:13:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "Increase concurrency for cdnPurge job to 200" [deployment-charts] - 10https://gerrit.wikimedia.org/r/666702 (owner: 10Giuseppe Lavagetto) [18:13:47] (03Merged) 10jenkins-bot: Revert "Increase concurrency for cdnPurge job to 200" [deployment-charts] - 10https://gerrit.wikimedia.org/r/666702 (owner: 10Giuseppe Lavagetto) [18:18:17] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' . [18:18:18] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'canary' . [18:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:33] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [18:18:33] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [18:18:33] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [18:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:47] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [18:18:47] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'production' . [18:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:00] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [18:19:00] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [18:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:14] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [18:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:28] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [18:19:28] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'canary' . [18:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:43] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [18:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:57] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'sessionstore' for release 'production' . [18:20:57] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'sessionstore' for release 'staging' . [18:20:57] 10SRE: Integrate Buster 10.8 point update - https://phabricator.wikimedia.org/T274099 (10MoritzMuehlenhoff) [18:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:18] (03CR) 10Legoktm: [C: 03+1] "Nice, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/666915 (https://phabricator.wikimedia.org/T267701) (owner: 10Majavah) [18:23:20] (03CR) 10Muehlenhoff: [C: 03+2] aptrepo: Remove update definitions only used on jessie [puppet] - 10https://gerrit.wikimedia.org/r/666935 (owner: 10Muehlenhoff) [18:23:32] !log dns[1235]002 - upgrade gdnsd to 3.6.0 (dns4002 and authdns2001 already running it for some time!) [18:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:50] !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [18:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:47] !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [18:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:06] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [18:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:20] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' . [18:30:20] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [18:30:20] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'production' . [18:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:40] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [18:30:40] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [18:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:54] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'production' . [18:30:55] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' . [18:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:16] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [18:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:42] PROBLEM - Thanos store has high latency for series gate requests on alert1001 is CRITICAL: job=thanos-store https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store [18:36:42] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload [18:36:43] !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [18:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:00] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload [18:37:01] !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [18:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:06] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:10] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload [18:38:10] !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [18:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:40:22] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install mwmaint2002 - https://phabricator.wikimedia.org/T274170 (10Papaul) [18:41:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:45:37] (03CR) 10Dzahn: [C: 03+2] quarry: Remove absented cron [puppet] - 10https://gerrit.wikimedia.org/r/666783 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [18:46:18] RECOVERY - Thanos store has high latency for series gate requests on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store [18:50:05] !log T267927 Trying to kick off data reload on `wdqs2008` from `cumin2001` fails because of `spicerack.remote.RemoteError: No hosts provided`. Doing some spelunking through IRC history looks like this happens when a host is not present in puppetDB. I'm confirmed `wdqs2008` is absent on puppetboard, so running puppet agent to get it re-registered (hopefully) [18:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:11] T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927 [18:54:16] (03CR) 10Legoktm: [C: 03+1] "Checked using cumin that there are no jessie servers left using this class." [puppet] - 10https://gerrit.wikimedia.org/r/666928 (owner: 10Muehlenhoff) [18:56:51] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload [18:56:52] !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [18:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:24] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload [18:57:24] !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [18:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:52] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [18:57:59] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/28256/ looks good, though it still has an scb host in it" [puppet] - 10https://gerrit.wikimedia.org/r/666928 (owner: 10Muehlenhoff) [18:58:04] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28257/console" [puppet] - 10https://gerrit.wikimedia.org/r/666920 (owner: 10Muehlenhoff) [18:59:18] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "scb1002 is a Service cluster B; includes:" [puppet] - 10https://gerrit.wikimedia.org/r/666928 (owner: 10Muehlenhoff) [18:59:18] !log T267927 Manual puppet run got `wdqs2008` present in puppetdb again. Now being blocked by lack of host key for `wdqs2008` present on `cumin2001`, so I'm running puppet on `cumin2001` to get the latest state of `/etc/ssh/ssh_known_hosts` [18:59:18] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload [18:59:18] !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [18:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:25] (03CR) 10Dzahn: [V: 03+1 C: 03+2] mediawiki::packages::fonts: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/666928 (owner: 10Muehlenhoff) [18:59:25] T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927 [18:59:27] (03CR) 10JMeybohm: [V: 03+1 C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/666920 (owner: 10Muehlenhoff) [18:59:28] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Medium error reported for sda on elastic2045 - https://phabricator.wikimedia.org/T275345 (10Papaul) a:05Papaul→03RKemper @Gehel @RKemper disk replaced. Please resolve task when re-image is done. Thanks [18:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210225T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:01:24] (03PS1) 10Legoktm: ci: Use dedicated "ci-build" account for docker-registry pushes (try #2) [puppet] - 10https://gerrit.wikimedia.org/r/666703 (https://phabricator.wikimedia.org/T275559) [19:02:04] 10SRE, 10Traffic, 10Patch-For-Review: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 (10BBlack) [19:04:28] 10SRE, 10Traffic: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809 (10BBlack) [19:04:35] 10SRE, 10Traffic, 10Patch-For-Review: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 (10CDanis) >>! In T274888#6861566, @BBlack wrote: > 4. `sh` hashing - I think @CDanis already worked on some patches to transition us to maglev hashing a quarter or two ago, b... [19:05:52] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [19:07:55] 10SRE, 10Traffic, 10Patch-For-Review: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 (10BBlack) I've spun out T275809 to go into some depth on the #1 part about `large_objects_cutoff` [19:12:40] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 65.62 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [19:14:47] (03PS1) 10Kosta Harlan: Impact module: Add "not rendered" state [extensions/GrowthExperiments] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/666704 (https://phabricator.wikimedia.org/T270294) [19:16:19] !log T267927 Downloading dumps: `sudo https_proxy=webproxy.codfw.wmnet:8080 wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 -O /srv/wdqs/latest-all.ttl.bz2 && sudo https_proxy=webproxy.codfw.wmnet:8080 wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2 -O /srv/wdqs/latest-lexemes.ttl.bz2` on `ryankemper@wdqs2008` tmux session `download_latest_dumps` [19:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:26] T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927 [19:17:24] (03PS2) 10Gergő Tisza: Impact module: Add "not rendered" state [extensions/GrowthExperiments] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/666704 (https://phabricator.wikimedia.org/T270294) (owner: 10Kosta Harlan) [19:17:57] ^last-minute backport [19:18:17] tgr_: \o [19:18:18] (03CR) 10Gergő Tisza: [C: 03+2] Impact module: Add "not rendered" state [extensions/GrowthExperiments] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/666704 (https://phabricator.wikimedia.org/T270294) (owner: 10Kosta Harlan) [19:24:18] (03PS2) 10Legoktm: ci: Use dedicated "ci-build" account for docker-registry pushes (try #2) [puppet] - 10https://gerrit.wikimedia.org/r/666703 (https://phabricator.wikimedia.org/T275559) [19:25:20] (03PS1) 10Dzahn: phabricator::tools: replace cron jobs with timers [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673) [19:25:46] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 42.15 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [19:27:06] (03CR) 10jerkins-bot: [V: 04-1] phabricator::tools: replace cron jobs with timers [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [19:27:57] (03CR) 10Legoktm: [V: 03+1 C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1001/28258/contint2001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/666703 (https://phabricator.wikimedia.org/T275559) (owner: 10Legoktm) [19:29:38] (03CR) 10Dzahn: "was this empty?" [puppet] - 10https://gerrit.wikimedia.org/r/666703 (https://phabricator.wikimedia.org/T275559) (owner: 10Legoktm) [19:30:00] (03PS2) 10Dzahn: phabricator::tools: replace cron jobs with timers [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673) [19:30:09] wth.. it looked empty to me, but it's my bad connection [19:30:36] o.O [19:30:44] like it didn't show up on Gerrit properly? [19:30:44] (03Merged) 10jenkins-bot: Impact module: Add "not rendered" state [extensions/GrowthExperiments] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/666704 (https://phabricator.wikimedia.org/T270294) (owner: 10Kosta Harlan) [19:31:13] like it showed the normal Gerrit UI but had not loaded the actual Files / diff section [19:31:21] huh [19:31:35] like when you rebase something that has been done meanwhile and rebases into nothing [19:31:38] (03CR) 10jerkins-bot: [V: 04-1] phabricator::tools: replace cron jobs with timers [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [19:33:02] PROBLEM - Check systemd state on wdqs2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:33:26] * legoktm nods [19:33:27] kostajh: it's on mwdebug1001 [19:33:36] tgr_: thanks, checking [19:36:20] 10SRE, 10Traffic: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809 (10BBlack) So, to expand a little bit on the text quoted at the top with some initial insights about cutoff vs nuke-limit tradeoffs and some of my current thinking and/or assumptions: * Turn... [19:37:06] tgr_: it doesn't break anything on test.wikipedia.org. The fix could only be verified on bnwiki in group2 , which is on wmf.31 still [19:37:26] ok, deploying [19:37:29] tgr_: we could merge this patch and backport to wmf.31 as well if you'd like [19:40:37] !log tgr@deploy1001 Synchronized php-1.36.0-wmf.32/extensions/GrowthExperiments/: Backport: [[gerrit:666704|Impact module: Add "not rendered" state (T270294, T275615)]] (duration: 01m 26s) [19:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:45] T270294: Scale: deploy without impact module - https://phabricator.wikimedia.org/T270294 [19:40:45] T275615: 'impact_module_state' is a required property - https://phabricator.wikimedia.org/T275615 [19:40:53] let's do .31 so we don't have to worry about train rollbacks [19:41:02] (03PS1) 10JMeybohm: linkrecommendation: Allow egress to analytics.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/666982 [19:41:29] (03PS1) 10Gergő Tisza: Impact module: Add "not rendered" state [extensions/GrowthExperiments] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/666986 (https://phabricator.wikimedia.org/T270294) [19:41:40] (03CR) 10Gergő Tisza: [C: 03+2] Impact module: Add "not rendered" state [extensions/GrowthExperiments] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/666986 (https://phabricator.wikimedia.org/T270294) (owner: 10Gergő Tisza) [19:41:48] (03CR) 10JMeybohm: "Not sure on this, should we maybe not go though the CDN here?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/666982 (owner: 10JMeybohm) [19:41:58] (03PS3) 10Dzahn: phabricator::tools: replace cron jobs with timers [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673) [19:42:40] (03CR) 10Kosta Harlan: "Thanks for this. How was the script able to pull the datasets without this patch?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/666982 (owner: 10JMeybohm) [19:42:56] PROBLEM - WDQS SPARQL on wdqs2008 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 1.219 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:44:09] (03CR) 10JMeybohm: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/666982 (owner: 10JMeybohm) [19:44:13] (03CR) 10jerkins-bot: [V: 04-1] phabricator::tools: replace cron jobs with timers [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [19:51:07] 10SRE, 10ops-eqiad, 10Discovery, 10Discovery-Search (Current work): elastic1033's mgmt is unreachable - https://phabricator.wikimedia.org/T275733 (10Gehel) [19:56:54] ACKNOWLEDGEMENT - Check systemd state on wdqs2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ryan Kemper data-reload in progress https://phabricator.wikimedia.org/T267927 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:56:54] ACKNOWLEDGEMENT - WDQS SPARQL on wdqs2008 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 1.216 second response time Ryan Kemper data-reload in progress https://phabricator.wikimedia.org/T267927 https://wikitech.wikime [19:56:54] data_query_service/Runbook [19:57:07] (03Merged) 10jenkins-bot: Impact module: Add "not rendered" state [extensions/GrowthExperiments] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/666986 (https://phabricator.wikimedia.org/T270294) (owner: 10Gergő Tisza) [19:58:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:00:05] longma and marxarelli: It is that lovely time of the day again! You are hereby commanded to deploy Mediawiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210225T2000). [20:00:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:00:21] o/ [20:00:55] marxarelli: backport is running over, I'll be done in a couple minutes [20:01:28] ...except there are a bunch of undeployed commits in wmf.31 [20:01:50] they are all to .pipeline/blubber.yaml, I suppose that can be ignored? [20:03:07] kostajh: can you check on group2/mwdebug1001? [20:04:21] tgr_: yes, just a few minutes please [20:06:38] PROBLEM - Thanos store has high latency for series gate requests on alert1001 is CRITICAL: job=thanos-store https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store [20:08:57] tgr_: no problem. longma is doing the deploy. i'm on backup. those were my undeployed commits actually, but they were reverted yesterday [20:09:24] tgr_: hmm, not seeing the HomepageVisit event on bnwiki [20:09:34] It's mwdebug1001 for sure? [20:09:35] either way, they shouldn't have an effect [20:09:53] let me double-check [20:10:22] (03PS1) 10Legoktm: docker: Use "prod-build" account for pushing production images [puppet] - 10https://gerrit.wikimedia.org/r/666984 (https://phabricator.wikimedia.org/T275582) [20:11:43] kostajh: seems to be there [20:12:55] (03PS1) 10Legoktm: docker: Add prod-build password for deneb [labs/private] - 10https://gerrit.wikimedia.org/r/666985 [20:13:10] (03CR) 10Legoktm: [V: 03+2 C: 03+2] docker: Add prod-build password for deneb [labs/private] - 10https://gerrit.wikimedia.org/r/666985 (owner: 10Legoktm) [20:13:52] (03PS2) 10Legoktm: docker: Use "prod-build" account for pushing production images [puppet] - 10https://gerrit.wikimedia.org/r/666984 (https://phabricator.wikimedia.org/T275582) [20:13:54] tgr_: I'm still seeing the validation error [20:14:52] oh well. the homepage is working so the patch didn't make anything worse, right? [20:15:13] tgr_: right [20:15:30] let's deploy it and get out of the way of the train. Maybe there's a job involved somehow and it will work when properly deployed. [20:15:38] if not, we can figure it out next week [20:15:56] RECOVERY - Thanos store has high latency for series gate requests on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store [20:16:09] tgr_: sounds fine [20:16:59] (03PS1) 10Jdlrobson: Enable og tags on non-wikidata wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667007 (https://phabricator.wikimedia.org/T157145) [20:17:05] !log tgr@deploy1001 Synchronized php-1.36.0-wmf.31/extensions/GrowthExperiments/: Backport: [[gerrit:666704|Impact module: Add "not rendered" state (T270294, T275615)]] (duration: 01m 08s) [20:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:15] T270294: Scale: deploy without impact module - https://phabricator.wikimedia.org/T270294 [20:17:15] T275615: 'impact_module_state' is a required property - https://phabricator.wikimedia.org/T275615 [20:17:21] (03PS1) 10Dzahn: package_builder: convert cowbuilder cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/667008 (https://phabricator.wikimedia.org/T273673) [20:17:47] (03CR) 10jerkins-bot: [V: 04-1] package_builder: convert cowbuilder cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/667008 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [20:18:33] marxarelli, longma: all yours [20:18:46] thanks tgr_ [20:19:25] (03PS3) 10Legoktm: docker: Use "prod-build" account for pushing production images [puppet] - 10https://gerrit.wikimedia.org/r/666984 (https://phabricator.wikimedia.org/T275582) [20:19:38] 10SRE, 10Instrument-ClientError, 10MediaWiki-extensions-WikimediaEvents, 10observability: Edits to pt:MediaWiki:Common.js and new bugs that create client side error spike should log alerts - https://phabricator.wikimedia.org/T264665 (10Jdlrobson) @colewhite is there somebody I could pair with to write this... [20:20:28] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28260/console" [puppet] - 10https://gerrit.wikimedia.org/r/666984 (https://phabricator.wikimedia.org/T275582) (owner: 10Legoktm) [20:23:00] (03PS4) 10Dzahn: phabricator::tools: replace cron jobs with timers [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673) [20:24:01] (03PS2) 10Gergő Tisza: Add GrowthExperiments maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/655865 (https://phabricator.wikimedia.org/T261408) [20:29:06] (03PS2) 10Dzahn: package_builder: convert cowbuilder cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/667008 (https://phabricator.wikimedia.org/T273673) [20:29:08] (03PS1) 10Jeena Huneidi: all wikis to 1.36.0-wmf.32 refs T274936 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667009 [20:29:10] (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.36.0-wmf.32 refs T274936 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667009 (owner: 10Jeena Huneidi) [20:29:31] (03CR) 10jerkins-bot: [V: 04-1] package_builder: convert cowbuilder cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/667008 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [20:30:09] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.32 refs T274936 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667009 (owner: 10Jeena Huneidi) [20:44:10] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:46:18] (03CR) 10Dzahn: [C: 03+2] typos: remove 'bullsey' [puppet] - 10https://gerrit.wikimedia.org/r/667012 (owner: 10Dzahn) [20:47:06] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/667008 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [20:48:09] (03CR) 10Dzahn: [C: 03+2] "thank you:)" [puppet] - 10https://gerrit.wikimedia.org/r/665461 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [20:51:41] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1001/28262/" [puppet] - 10https://gerrit.wikimedia.org/r/665461 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [20:52:52] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "even though compiler said noop.. it is not actually noop on cloudweb2001-dev :/" [puppet] - 10https://gerrit.wikimedia.org/r/665461 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [20:53:14] (03PS1) 10Dzahn: Revert "ldap::config::labs: replace hiera_hash with lookup" [puppet] - 10https://gerrit.wikimedia.org/r/666989 [20:54:25] (03CR) 10Dzahn: [C: 03+2] Revert "ldap::config::labs: replace hiera_hash with lookup" [puppet] - 10https://gerrit.wikimedia.org/r/666989 (owner: 10Dzahn) [20:54:46] (03CR) 10Dzahn: [V: 03+2 C: 03+2] Revert "ldap::config::labs: replace hiera_hash with lookup" [puppet] - 10https://gerrit.wikimedia.org/r/666989 (owner: 10Dzahn) [20:57:59] (03PS1) 10JMeybohm: event*: Enable egress networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/667015 [21:00:55] (03CR) 10Dzahn: "I don't know what is going on, but on cloudweb2001-dev there is a puppet change on every run that removes a LVS service IP. (which was not" [puppet] - 10https://gerrit.wikimedia.org/r/665461 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [21:06:23] (03CR) 10Dzahn: [C: 03+2] docker::engine: replace hiera_hash with lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/665459 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [21:08:36] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:08:59] (03CR) 10Dzahn: "noop confirmed: deneb, kubernetes2001, kubestage1002, releases1001" [puppet] - 10https://gerrit.wikimedia.org/r/665459 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [21:09:37] (03PS2) 10Dzahn: deployment::rsync: remove references to the 'trebuchet' name [puppet] - 10https://gerrit.wikimedia.org/r/666757 [21:09:58] (03CR) 10Mforns: "Code makes sense to me, but I completely ignore the implications of this change... @elukey?" [puppet] - 10https://gerrit.wikimedia.org/r/666948 (https://phabricator.wikimedia.org/T275757) (owner: 10Awight) [21:15:18] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:15:48] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/28264/" [puppet] - 10https://gerrit.wikimedia.org/r/666757 (owner: 10Dzahn) [21:19:35] (03CR) 10Dzahn: "confirmed rsync command still working on deploy2001,ran manually" [puppet] - 10https://gerrit.wikimedia.org/r/666757 (owner: 10Dzahn) [21:20:12] !log deploy2001 - rsynced /srv/deployment from deploy1001 after gerrit:666757 [21:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:34] PROBLEM - Thanos store has high latency for series gate requests on alert1001 is CRITICAL: job=thanos-store https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store [21:20:40] 10SRE, 10SRE-swift-storage: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 (10wiki_willy) [21:21:31] 10SRE, 10User-jijiki: Put rdb200[78] into service - https://phabricator.wikimedia.org/T255681 (10wiki_willy) [21:24:40] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): elastic2054 unresponsive - https://phabricator.wikimedia.org/T274555 (10Papaul) 05Open→03Resolved [21:25:02] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Medium error reported for sda on elastic2045 - https://phabricator.wikimedia.org/T275345 (10Papaul) 05Open→03Resolved [21:25:30] 10SRE, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T268622 (10Papaul) 05Open→03Resolved [21:32:22] (03CR) 10Legoktm: [V: 03+1 C: 03+2] docker: Use "prod-build" account for pushing production images [puppet] - 10https://gerrit.wikimedia.org/r/666984 (https://phabricator.wikimedia.org/T275582) (owner: 10Legoktm) [21:34:02] RECOVERY - Thanos store has high latency for series gate requests on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store [21:36:46] (03PS1) 10Legoktm: docker: Lockdown /root/.docker a bit more [puppet] - 10https://gerrit.wikimedia.org/r/667017 [21:38:13] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28266/console" [puppet] - 10https://gerrit.wikimedia.org/r/667017 (owner: 10Legoktm) [21:38:59] !log pushed new version of docker-registry.discovery.wmnet/wikimedia-buster image [21:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:11] (03CR) 10Legoktm: [V: 03+1 C: 03+2] docker: Lockdown /root/.docker a bit more [puppet] - 10https://gerrit.wikimedia.org/r/667017 (owner: 10Legoktm) [21:39:24] (03PS1) 10Dzahn: role::deployment_server: re-order includes, add comments, clean up [puppet] - 10https://gerrit.wikimedia.org/r/667018 [21:39:53] (03CR) 10jerkins-bot: [V: 04-1] role::deployment_server: re-order includes, add comments, clean up [puppet] - 10https://gerrit.wikimedia.org/r/667018 (owner: 10Dzahn) [21:41:41] (03PS2) 10Dzahn: role::deployment_server: re-order includes, add comments, clean up [puppet] - 10https://gerrit.wikimedia.org/r/667018 [21:45:24] (03PS3) 10Dzahn: role::deployment_server: re-order includes, add comments, clean up [puppet] - 10https://gerrit.wikimedia.org/r/667018 [21:50:24] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:50:53] (03PS1) 10Odder: Add localised logos for the Altay Wikipedia. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667021 (https://phabricator.wikimedia.org/T275819) [21:52:40] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:54:26] (03Abandoned) 10Odder: Add localised logos for the Altay Wikipedia. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667021 (https://phabricator.wikimedia.org/T275819) (owner: 10Odder) [21:54:31] (03CR) 10Bstorm: [C: 03+2] Make Toolforge docker registry cert configurable [puppet] - 10https://gerrit.wikimedia.org/r/666915 (https://phabricator.wikimedia.org/T267701) (owner: 10Majavah) [21:55:48] (03CR) 10Legoktm: "Hi Odder! The process is described in logos/README, let me know if you need any help/assistance with it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667021 (https://phabricator.wikimedia.org/T275819) (owner: 10Odder) [22:01:26] PROBLEM - Thanos store has high latency for series gate requests on alert1001 is CRITICAL: job=thanos-store https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store [22:03:22] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10Papaul) @klausman any update on this? IF the install is done can you please resolve the task? Thanks. [22:05:12] (03CR) 10Ottomata: "What does it do!?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/667015 (owner: 10JMeybohm) [22:08:44] RECOVERY - Thanos store has high latency for series gate requests on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store [22:14:22] so, I have 4 deployment servers with /srv/mediwiki on them that we synced at some point. but the sizes are: 12G, 16G, 18G and 20G. yay [22:14:53] with /srv/deployment there is no such issue, they are all 43G equally because they pull from 1001 with --delete [22:15:47] now looking what to do with the /srv/mediawiki part before switching to 1002 in a couple days [22:17:09] 10SRE, 10Instrument-ClientError, 10MediaWiki-extensions-WikimediaEvents, 10observability: Edits to pt:MediaWiki:Common.js and new bugs that create client side error spike should log alerts - https://phabricator.wikimedia.org/T264665 (10colewhite) @Jdlrobson I'd be happy to help. Ping me on IRC or elsewher... [22:25:11] why is "mediawiki-staging" over 200GB in codfw but only 20 in eqiad? Anyone who would be confident what can be deleted if anything? [22:26:00] o.O [22:26:16] mutante: on which server? (the 200GB) [22:26:32] legoktm: deploy2001.codfw.wmnet:/srv/mediawiki-staging [22:28:36] (03CR) 10Bstorm: [C: 03+2] "Ok, new wiki creates are done, so I'll give this a whirl!" [puppet] - 10https://gerrit.wikimedia.org/r/665115 (owner: 10Lucas Werkmeister (WMDE)) [22:28:57] * legoktm looks [22:29:24] oh [22:29:32] someone just needs to prune the old branches [22:31:46] legoktm: thank you. so yea, I have 4 different copies of /srv/mediawiki too and they are all a bit different [22:31:58] mutante: there's a `scap clean` command but I've never used it before. probably best to ask someone from releng to do it [22:32:21] or scap prune [22:32:26] I don't remember rightly [22:32:54] I don't see prune in the out put of `scap --help` [22:33:28] ack. interesting. thank you. or.. I just copy 'all the things' over to some archive dir on new hosts.. and it can be sorted out there [22:35:46] if we're setting up new things I think we should just start with the content on deploy1001 [22:36:35] the fact that everything else is slightly different isn't great if we had to unexpectedly switch over but I suspect it's not functional differences [22:37:01] /srv/deployment is like that. they are all pulling from 1001 with --delete.. so that is identical [22:37:11] what is not is the other stuff in /srv [22:37:36] what if we expectedly had to switch over :p [22:38:17] should there be another rsync with --delete for /srv/mediawiki from 1001 [22:38:23] to ..just the new servers [22:38:43] that I can do of course [22:39:50] (03CR) 10Bstorm: "Ok, I can confirm that (on my test instance, this flipped the bit for szywiki from 0 to 1." [puppet] - 10https://gerrit.wikimedia.org/r/665115 (owner: 10Lucas Werkmeister (WMDE)) [22:40:23] I think so [22:40:27] legoktm, mutante: that's odd. i `scap clean`'d all but two branches last week. i wonder if it's not executing correctly on the second deploy server [22:41:13] hmm [22:41:34] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 202 probes of 597 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:41:37] no, I think you're right, it only has cache directories left [22:41:40] marxarelli: the one where it's so large is the codfw one, the old one [22:42:04] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:42:36] do we really need the 1.32 and 1.33 branches though? [22:42:37] so it's about which dc we are currently in [22:42:37] (03CR) 10Bstorm: "Since this doesn't run automatically for all databases on schedule, this will only update when we run the script as well. If it ever becom" [puppet] - 10https://gerrit.wikimedia.org/r/665115 (owner: 10Lucas Werkmeister (WMDE)) [22:43:28] there's 116 branches on deploy2001, each is about 2G so there's our 200G+ [22:43:38] yikes [22:43:58] no, we shouldn't need anything that isn't on deploy1001 [22:44:10] i'll take a look [22:44:19] thank you :) [22:44:26] :D [22:47:34] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 50 probes of 597 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:49:06] longma: do you mind if i `scap clean` the wmf.30 branch? i want to see if it targets deploy2001 correctly (see ^) [22:49:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:49:59] Urbanecm: oh man.. and about the other stuff i was talking about. patches, rsync.. all that.. I already fixed this stuff! but for releases*, not deploy*! but it's the exact same issue for another set of servers. that's why it felt all so familiar but still not just working! aha [22:50:25] hehe [22:50:30] that path exists on releases* too [22:51:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:54:29] marxarelli: sure, go ahead [22:54:37] k [22:55:10] I was going to clean it up tomorrow since I always feel short on time on Tuesdays [22:55:58] makes sense. i just wanted to figure out why deploy2001 is full of old versions [22:56:20] yeah that makes sense [22:59:53] (03PS1) 10Dzahn: deployment::rsync:: also sync patches directory [puppet] - 10https://gerrit.wikimedia.org/r/667031 [23:01:58] !log dduvall@deploy1001 Pruned MediaWiki: 1.36.0-wmf.30 (duration: 04m 20s) [23:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:59] (03CR) 10Dzahn: "This will make it so that any non-active deployment_server (deploy2001, deploy1002, deploy2002) will pull from deploy1001 with --delete to" [puppet] - 10https://gerrit.wikimedia.org/r/667031 (owner: 10Dzahn) [23:03:19] (03PS1) 10Razzi: hadoop: Add new worker nodes to hadoop_clusters [puppet] - 10https://gerrit.wikimedia.org/r/667032 (https://phabricator.wikimedia.org/T275767) [23:04:46] (03CR) 10Dzahn: "You dont need to review the puppet code, just the idea that we pull from one source with --delete so there can only be one version of /srv" [puppet] - 10https://gerrit.wikimedia.org/r/667031 (owner: 10Dzahn) [23:07:05] legoktm, mutante: seems `scap clean` is working properly but it doesn't target the /srv/mediawiki-staging dir on other deploy hosts. looks like ops/puppet/modules/scap/files/scap-master-sync is responsible for that [23:08:07] marxarelli: aha! thank you! and it has --delete in it [23:08:32] and it also syncs /srv/patches [23:08:42] yep yep [23:08:50] but i still need extra code in my change above to make puppet just automatically do this [23:09:02] just using the same way it already does for /srv/deployments [23:09:14] (03PS1) 10Mstyles: add new updater job properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/667034 (https://phabricator.wikimedia.org/T273095) [23:09:29] (03PS1) 10Jdlrobson: Do not log graph errors to WMF servers [extensions/Graph] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/666999 (https://phabricator.wikimedia.org/T274557) [23:09:29] so we could run /usr/local/bin/scap-master-sync on 2001 [23:12:26] seems reasonable to me, but i'm not super familiar with why all of those flags are used (`--delete-delay` and such) [23:12:54] and why all of those old php- directories are still around if it's specifying `--delete` already [23:13:09] !log deploy1002 - /usr/local/bin/scap-master-sync deploy1001.eqiad.wmnet [23:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:48] ^ " from the given deployment master to the local staging directory" [23:14:11] ah ok. and they weren't running before? [23:14:30] no, because /srv/patches as empty [23:14:35] that is what made me wonder about it [23:14:51] sorry, i'm not following your changes very closely. i just noticed you discussing the issue with old php- dirs and thought i'd help [23:14:54] got it [23:15:47] don't be sorry, it is very helpful [23:16:00] i am trying to sync between old and new eqiad server right now [23:16:04] :) [23:16:20] the thing here is that I am trying to manage it all with puppet [23:16:27] but scap also has commands for it that i did not know [23:16:37] but apparently nothing runs those automatically [23:17:19] so the main part, /srv/deployment was already puppetized but not this other part, staging and patches [23:17:43] and a big part of what I wanted review for was.. should all of this sync with --delete [23:17:54] now that I know existing scap scripts do that too.. i can safely assume it [23:17:58] (03PS2) 10Razzi: hadoop: Add new worker nodes to hadoop_clusters [puppet] - 10https://gerrit.wikimedia.org/r/667032 (https://phabricator.wikimedia.org/T275767) [23:18:04] and make it work automatically [23:20:29] yea, master sync worked. in that ..it pulled the patches. what it did NOT mean is that mediawiki-staging is now the same size on both [23:21:53] looking at --delete-delay [23:23:05] hmm, no, that does not explain it, just means deletions at the end of the transfer [23:24:59] ah, --exclude="**/cache/l10n/*.cdb" is it [23:25:16] mutante: yeah, looks like it excludes l10n [23:25:21] the .cdb files don't get copied over.. so cant expect them to be identical [23:25:31] which uses the most space by far :/ [23:25:45] ~ 2G per mw version [23:25:49] guess I should forget about this idea that I need both sides to be identical before I switch [23:26:10] i don't see why we should exclude the l10n cache [23:26:12] well, it's about the right order to switch [23:27:30] marxarelli: that means if we had to fail-over we first have to rebuild all the l10n cache and it takes ... a long time,right [23:27:57] it happens as part of the normal deploy process during `scap sync-world` [23:28:38] i'll file a task and we can discuss in releng. maybe we can ditch the exclude [23:29:37] ok, thank you! [23:30:56] the bigger issue is how to switch the centrally defined server but also already be synced [23:32:52] !log deploy2001 - scap-master-sync from deploy1001 [23:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:36] marxarelli: last comment for now. will follow-up on ticket. but ..it's trying to delete but "cannot delete non-empty directory" of old branches. that's the bug probably [23:35:12] ok. i think they're not empty because cache/l10n/*.cdb files are still there, right? [23:35:30] looks like it, they are all cache/10n and then cache [23:35:37] yes [23:35:40] k [23:39:29] !log deploy2001 - scap-master-sync from deploy1001 runs and attempts to --delete files to stay in sync but fails to do so because *.cdb files are in cache dirs and rsync does not want to delete non-empty directories, this leads to build up of the size of /srv/mediawiki-staging to 10 times the size of eqiad [23:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:29] !log deploy2001 2/2 - because rsync is --delete but also --exclude="**/cache/l10n/*.cdb" --exclude="*.swp" you can't expect /srv/mediawiki-staging to be the same size on 2 servers [23:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:13] !log deploy1002, deploy2002 - scap-master-sync deploy1001.eqiad.wmnet (T265963) [23:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:19] T265963: Replace production deployment servers and update them to Buster - https://phabricator.wikimedia.org/T265963 [23:59:40] (03Abandoned) 10Dzahn: remove deploy1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/635114 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn)