[13:07:37] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Eugene Chernov from Speed & Function - https://phabricator.wikimedia.org/T275679 (10jbond) @wkandek or @thcipriani can you approve the access @KFrancis are you able to confirm NDA status
[13:08:02] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Eugene Chernov from Speed & Function - https://phabricator.wikimedia.org/T275679 (10jbond) p:05Triage→03Medium
[13:10:14] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10jbond) @wkandek or @thcipriani an you approve the access @KFrancis are you able to confirm NDA status  @OlyKalinichenkoSpeedAndFunction The SSH ke...
[13:10:32] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10jbond) p:05Triage→03Medium
[13:11:51] <wikibugs>	 10ops-eqiad, 10serviceops: decommission scb100[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T275759 (10akosiaris)
[13:12:02] <wikibugs>	 10ops-eqiad, 10serviceops: decommission scb100[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T275759 (10akosiaris) p:05Triage→03Medium
[13:13:13] <wikibugs>	 10ops-codfw, 10serviceops: decommission scb200[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T275760 (10akosiaris)
[13:13:25] <wikibugs>	 10ops-codfw, 10serviceops: decommission scb200[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T275760 (10akosiaris) p:05Triage→03Medium
[13:14:52] <wikibugs>	 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Structured-Data-Backlog: Add Matthew Williams to analytics-privatedata-users - https://phabricator.wikimedia.org/T275671 (10jbond) p:05Triage→03Medium
[13:15:17] <akosiaris>	 !log reinitialize all of staging-codfw. kubestage2* and kubestagemaster* have been scheduled downtime in icinga.
[13:15:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:30] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10wikimedia-irc-freenode: Grant wmopbot +o permissions in #wikimedia-operations IRC channel - https://phabricator.wikimedia.org/T275711 (10jbond) p:05Triage→03Medium
[13:18:01] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,netbox_device_statistics} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:22:15] <wikibugs>	 (03PS1) 10Hnowlan: prometheus::postgres_exporter: disk metrics and custom queries [puppet] - 10https://gerrit.wikimedia.org/r/666888 (https://phabricator.wikimedia.org/T248858)
[13:22:39] <wikibugs>	 10SRE: Mediawiki Swift PUTs from eqiad to codfw reported slow - https://phabricator.wikimedia.org/T275752 (10jbond) p:05Triage→03Medium looking at triaging this task do we have a gut feel of where the issue may be e.g. mw server, general networking, swift, something else?
[13:23:11] <wikibugs>	 10SRE, 10ops-eqiad, 10Discovery: elastic1033's mgmt is unreachable - https://phabricator.wikimedia.org/T275733 (10jbond) p:05Triage→03Medium
[13:23:23] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics: an-worker1111 PS Redundancy alert - https://phabricator.wikimedia.org/T275732 (10jbond) p:05Triage→03Medium
[13:23:25] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/666615 (owner: 10Alexandros Kosiaris)
[13:24:26] <wikibugs>	 10SRE, 10serviceops: Support proxying to etcd v3 storage on buster or later - https://phabricator.wikimedia.org/T275600 (10jbond) p:05Triage→03Medium
[13:24:46] <wikibugs>	 (03PS2) 10Gergő Tisza: GrowthExperiments: link recommendation service URL for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666860
[13:25:11] <icinga-wm>	 PROBLEM - Prometheus k8s-staging cache not updating on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2003&var-datasource=codfw+prometheus/ops
[13:25:55] <icinga-wm>	 PROBLEM - Prometheus k8s-staging cache not updating on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2004&var-datasource=codfw+prometheus/ops
[13:26:19] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:31:02] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 04-1] "Commit message needs to be updated, the other parts that I understand (service on part 4006, and the URL structure) seem OK to me. Thanks!" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan)
[13:32:23] <wikibugs>	 10SRE: Mediawiki Swift PUTs from eqiad to codfw reported slow - https://phabricator.wikimedia.org/T275752 (10fgiunchedi) >>! In T275752#6860917, @jbond wrote: > looking at triaging this task do we have a gut feel of where the issue may be e.g. mw server, general networking, swift, something else?  Good question,...
[13:33:25] <wikibugs>	 10SRE: Mediawiki Swift PUTs from eqiad to codfw reported slow - https://phabricator.wikimedia.org/T275752 (10jbond) >>! In T275752#6860955, @fgiunchedi wrote: >>>! In T275752#6860917, @jbond wrote: >> looking at triaging this task do we have a gut feel of where the issue may be e.g. mw server, general networking...
[13:51:10] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Request for creation: Art+Feminism Wikimedians Mailing List - https://phabricator.wikimedia.org/T275552 (10jbond) 05Open→03Resolved a:03jbond @Masssly The mailing list has now been created and you should be able to visit both the [[ https://lists.wikimedia.org/mailman/ad...
[13:51:48] <wikibugs>	 (03CR) 10David Caro: doc: Introduce a code reviewing guideline (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/666601 (owner: 10David Caro)
[13:53:05] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2001.codfw.wmnet with reason: REIMAGE
[13:53:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:04] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2002.codfw.wmnet with reason: REIMAGE
[13:55:06] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2001.codfw.wmnet with reason: REIMAGE
[13:55:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:18] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2002.codfw.wmnet with reason: REIMAGE
[13:57:22] <wikibugs>	 (03PS1) 10Kormat: install_server: Fix kubernetes-node to work with buster. [puppet] - 10https://gerrit.wikimedia.org/r/666896
[13:57:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:43] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Sergey Trofimovsky from Speed & Function - https://phabricator.wikimedia.org/T275722 (10thcipriani) >>! In T275722#6860775, @jbond wrote: > @wkandek or @thcipriani  can you approve the access  Approve. Thanks!
[13:58:07] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Eugene Chernov from Speed & Function - https://phabricator.wikimedia.org/T275679 (10thcipriani) >>! In T275679#6860798, @jbond wrote: > @wkandek or @thcipriani can you approve the access  Approve. Thanks!
[13:58:34] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10thcipriani) >>! In T275677#6860814, @jbond wrote: > @wkandek or @thcipriani an you approve the access  Approve. Thanks!
[14:01:19] <wikibugs>	 (03CR) 10Kormat: "Fixes the issue we were discussing yesterday." [puppet] - 10https://gerrit.wikimedia.org/r/666896 (owner: 10Kormat)
[14:06:49] <wikibugs>	 10SRE: Integrate Buster 10.8 point update - https://phabricator.wikimedia.org/T274099 (10MoritzMuehlenhoff)
[14:07:07] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:09:14] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10wikimedia-irc-freenode: Grant wmopbot +o permissions in #wikimedia-operations IRC channel - https://phabricator.wikimedia.org/T275711 (10jbond) @mark or @faidon as Managers in #wikimedia-operations are you able to action or advice on this?
[14:09:33] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.ganeti.makevm for new host ml-serve-ctrl1001.eqiad.wmnet
[14:09:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:05] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:10:50] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.ganeti.makevm for new host ml-serve-ctrl1002.eqiad.wmnet
[14:10:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:05] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[14:14:15] <logmsgbot>	 !log klausman@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ml-serve-ctrl1002.eqiad.wmnet
[14:14:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:50] <wikibugs>	 (03CR) 10Alexandros Kosiaris: install_server: Fix kubernetes-node to work with buster. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666896 (owner: 10Kormat)
[14:16:23] <wikibugs>	 (03CR) 10Jforrester: "Huh." [puppet] - 10https://gerrit.wikimedia.org/r/666787 (owner: 10Legoktm)
[14:16:57] <moritzm>	 !log installing cairo security updates on buster
[14:17:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:16] <wikibugs>	 10SRE, 10vm-requests: eqiad/codfw: 2x2 VM request for ML-Serve Kubernetes cluster - https://phabricator.wikimedia.org/T275630 (10elukey)
[14:17:26] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.ganeti.makevm for new host ml-serve-ctrl1002.eqiad.wmnet
[14:17:27] <wikibugs>	 (03PS1) 10Klausman: modules/sudo: Add TMUX variable to kept env vars [puppet] - 10https://gerrit.wikimedia.org/r/666899
[14:17:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:19] <wikibugs>	 (03CR) 10Jbond: modules/sudo: Add TMUX variable to kept env vars (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666899 (owner: 10Klausman)
[14:20:30] <wikibugs>	 (03PS2) 10Klausman: modules/sudo: Add TMUX variable to kept env vars [puppet] - 10https://gerrit.wikimedia.org/r/666899
[14:20:41] <wikibugs>	 (03CR) 10Klausman: modules/sudo: Add TMUX variable to kept env vars (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666899 (owner: 10Klausman)
[14:20:45] <wikibugs>	 (03CR) 10Ottomata: "Oop nice." [puppet] - 10https://gerrit.wikimedia.org/r/666788 (owner: 10Elukey)
[14:20:59] <logmsgbot>	 !log klausman@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ml-serve-ctrl1002.eqiad.wmnet
[14:21:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:08] <wikibugs>	 (03CR) 10Kormat: install_server: Fix kubernetes-node to work with buster. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666896 (owner: 10Kormat)
[14:22:37] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.ganeti.makevm for new host ml-serve-ctrl1002.eqiad.wmnet
[14:22:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:19] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-serve-ctrl1001.eqiad.wmnet
[14:24:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:59] <moritzm>	 !log installing postgresql security updates on buster
[14:28:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:38] <wikibugs>	 (03PS1) 10Jbond: interactive: also check term for tmux in ensure_shell_is_durable [software/pywmflib] - 10https://gerrit.wikimedia.org/r/666902
[14:35:13] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-serve-ctrl1002.eqiad.wmnet
[14:35:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:07] <wikibugs>	 10SRE, 10vm-requests: eqiad/codfw: 2x2 VM request for ML-Serve Kubernetes cluster - https://phabricator.wikimedia.org/T275630 (10klausman) Created ml-serve-ctrl1001 and ml-serve-ctrl1002 in eqiad, rows B and D.
[14:38:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: install_server: Fix kubernetes-node to work with buster. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666896 (owner: 10Kormat)
[14:38:58] <wikibugs>	 (03PS1) 10Jbond: interactive: also check term for tmux in ensure_shell_is_durable [software/pywmflib] - 10https://gerrit.wikimedia.org/r/666904
[14:40:43] <wikibugs>	 (03Abandoned) 10Jbond: interactive: also check term for tmux in ensure_shell_is_durable [software/pywmflib] - 10https://gerrit.wikimedia.org/r/666902 (owner: 10Jbond)
[14:41:53] <wikibugs>	 (03PS1) 10Klausman: Temp fix for sudo/wmflib not handling tmux correctly [puppet] - 10https://gerrit.wikimedia.org/r/666905
[14:42:08] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[14:42:16] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] Temp fix for sudo/wmflib not handling tmux correctly [puppet] - 10https://gerrit.wikimedia.org/r/666905 (owner: 10Klausman)
[14:42:31] <vgutierrez>	 !log depool cp4032 for ats-tls/NUMA tests
[14:42:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:46] <wikibugs>	 (03CR) 10Volans: "Could you please add this use case also to the test in the parametrization for test_ensure_shell_is_durable_sty? looks good otherwise." [software/pywmflib] - 10https://gerrit.wikimedia.org/r/666904 (owner: 10Jbond)
[14:43:38] <wikibugs>	 (03PS3) 10Klausman: modules/sudo: Add TMUX variable to kept env vars [puppet] - 10https://gerrit.wikimedia.org/r/666899
[14:46:15] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: openstack: neutron: add wmcs-netns-events.py daemon (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/666616 (https://phabricator.wikimedia.org/T275483) (owner: 10Arturo Borrero Gonzalez)
[14:49:16] <wikibugs>	 10SRE: Mediawiki Swift PUTs from eqiad to codfw reported slow - https://phabricator.wikimedia.org/T275752 (10fgiunchedi) Testing up/down loads with https://commons.wikimedia.org/wiki/File:The_Lost_World_(1925).webm (300MB) file from `mw1305` (using a different swift account, not mediawiki's to protect against ac...
[14:49:27] <wikibugs>	 (03PS7) 10Arturo Borrero Gonzalez: openstack: neutron: add wmcs-netns-events.py daemon [puppet] - 10https://gerrit.wikimedia.org/r/666616 (https://phabricator.wikimedia.org/T275483)
[14:50:51] <wikibugs>	 (03PS8) 10Arturo Borrero Gonzalez: openstack: neutron: add wmcs-netns-events.py daemon [puppet] - 10https://gerrit.wikimedia.org/r/666616 (https://phabricator.wikimedia.org/T275483)
[14:52:22] <wikibugs>	 (03PS9) 10Arturo Borrero Gonzalez: openstack: neutron: add wmcs-netns-events.py daemon [puppet] - 10https://gerrit.wikimedia.org/r/666616 (https://phabricator.wikimedia.org/T275483)
[14:53:53] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1002/28245/" [puppet] - 10https://gerrit.wikimedia.org/r/666616 (https://phabricator.wikimedia.org/T275483) (owner: 10Arturo Borrero Gonzalez)
[14:56:08] <wikibugs>	 10SRE, 10ops-eqiad: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10fgiunchedi) 10G for this host (or all ms-be for that matter) is needed, please move the card over, that will indeed keep the mac address! thank you
[14:59:56] <vgutierrez>	 !log pool cp4032
[15:00:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:33] <moritzm>	 !log installing libmaxminddb updates from buster 10.8 point release
[15:00:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] install_server: Fix kubernetes-node to work with buster. [puppet] - 10https://gerrit.wikimedia.org/r/666896 (owner: 10Kormat)
[15:02:37] <wikibugs>	 (03CR) 10Alexandros Kosiaris: install_server: Fix kubernetes-node to work with buster. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666896 (owner: 10Kormat)
[15:05:24] <wikibugs>	 (03PS2) 10Jbond: interactive: also check term for tmux in ensure_shell_is_durable [software/pywmflib] - 10https://gerrit.wikimedia.org/r/666904
[15:05:24] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm for new host ml-serve-ctrl2001.codfw.wmnet
[15:05:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:57] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 1:" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/666904 (owner: 10Jbond)
[15:09:51] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: legoktm can't build CI docker images without using root because he's no longer in contint-admins - https://phabricator.wikimedia.org/T275731 (10jbond) p:05Triage→03Medium
[15:10:34] <wikibugs>	 10SRE, 10Analytics: Kubeflow on stat machines - https://phabricator.wikimedia.org/T275551 (10jbond) p:05Triage→03Medium
[15:12:42] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 (10BBlack)
[15:15:19] <wikibugs>	 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10akosiaris) https://github.com/helm/helm/issues/8271 says that --recreatepods won't work in helm3, we need to find an alternative.
[15:18:45] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 04-1] "this isn't enough to bound traffic_server to a physical CPU" [puppet] - 10https://gerrit.wikimedia.org/r/666871 (https://phabricator.wikimedia.org/T274888) (owner: 10Vgutierrez)
[15:23:07] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-serve-ctrl2001.codfw.wmnet
[15:23:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:48] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm for new host ml-serve-ctrl2002.codfw.wmnet
[15:23:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:24:09] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2002.codfw.wmnet with reason: REIMAGE
[15:24:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:44] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:26:12] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2002.codfw.wmnet with reason: REIMAGE
[15:26:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:26:32] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: legoktm can't build CI docker images without using root because he's no longer in contint-admins - https://phabricator.wikimedia.org/T275731 (10thcipriani) >>! In T275731#6859724, @Joe wrote: > I agree that it would make sense for anyone with glob...
[15:26:46] <wikibugs>	 (03PS1) 10Majavah: Make Toolforge docker registry cert configurable [puppet] - 10https://gerrit.wikimedia.org/r/666915 (https://phabricator.wikimedia.org/T267701)
[15:27:00] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:30:28] <wikibugs>	 (03PS1) 10David Caro: toolforge.etcdctl: Added removal of a member [software/spicerack] - 10https://gerrit.wikimedia.org/r/666919 (https://phabricator.wikimedia.org/T274497)
[15:30:48] <wikibugs>	 (03PS1) 10Muehlenhoff: envoyproxy: Remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/666920
[15:35:24] <wikibugs>	 (03PS1) 10Muehlenhoff: mediawiki::packages::fonts: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/666928
[15:37:32] <wikibugs>	 10SRE, 10SRE-Access-Requests: wikidata.org delegated Full Google Search Console access for abaso@wikimedia.org - https://phabricator.wikimedia.org/T275240 (10jbond) 05Open→03Resolved p:05Triage→03Medium a:03jbond Sorry for the delay, i have now added  abaso@wikimedia.org to m.wikidata.org and www.wik...
[15:37:54] <wikibugs>	 (03PS1) 10Muehlenhoff: service::node: Remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/666930
[15:38:03] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-serve-ctrl2002.codfw.wmnet
[15:38:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toolforge.etcdctl: Added removal of a member [software/spicerack] - 10https://gerrit.wikimedia.org/r/666919 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro)
[15:38:49] <wikibugs>	 10SRE, 10vm-requests: eqiad/codfw: 2x2 VM request for ML-Serve Kubernetes cluster - https://phabricator.wikimedia.org/T275630 (10elukey) Created ml-serve-ctrl200[1,2] in codfw, rows C and D (the ones with less VMs)  MAC address for ml-serve-ctrl2001.codfw.wmnet is: aa:00:00:b7:68:43 MAC address for ml-serve-ct...
[15:39:03] <wikibugs>	 (03CR) 10Volans: "CI is not happy, that aside, just two typos and a question inline." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/666919 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro)
[15:39:53] <wikibugs>	 (03PS1) 10Elukey: Add ml-serve-ctrl200[1,2] base config [puppet] - 10https://gerrit.wikimedia.org/r/666931 (https://phabricator.wikimedia.org/T275630)
[15:40:40] <elukey>	 klausman:--^ (if you have a moment)
[15:42:35] <kormat>	 klausman: as your lawyer, i must advise you to be momentless
[15:43:08] <kormat>	 live in the non-moment
[15:43:15] * klausman freezes solid as all momentum ceases
[15:43:40] * elukey questions Tobias' life choices if Kormat is a lawyer 
[15:43:43] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Add ml-serve-ctrl200[1,2] base config [puppet] - 10https://gerrit.wikimedia.org/r/666931 (https://phabricator.wikimedia.org/T275630) (owner: 10Elukey)
[15:44:01] <klausman>	 *a* lawyer maybe, not *my* lawyer.
[15:45:18] <elukey>	 yes definitely, I know that you are a wise person
[15:45:46] <kormat>	 objection, not found in evidence
[15:47:34] <klausman>	 objection, the defense is sabotaging itself.
[15:47:34] <klausman>	 Wait, we don't even have a judge.
[15:47:34] <wikibugs>	 (03PS1) 10Volans: code style: improve doc and link doc from tox [software/spicerack] - 10https://gerrit.wikimedia.org/r/666934
[15:47:35] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add ml-serve-ctrl200[1,2] base config [puppet] - 10https://gerrit.wikimedia.org/r/666931 (https://phabricator.wikimedia.org/T275630) (owner: 10Elukey)
[15:49:30] <wikibugs>	 (03PS1) 10Muehlenhoff: aptrepo: Remove update definitions only used on jessie [puppet] - 10https://gerrit.wikimedia.org/r/666935
[15:50:23] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/666904 (owner: 10Jbond)
[15:50:54] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 50.26 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[15:51:23] <wikibugs>	 (03PS1) 10Muehlenhoff: uwsgi: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/666938
[15:52:53] <wikibugs>	 (03CR) 10David Caro: "> Patch Set 1:" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/666919 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro)
[15:52:55] <wikibugs>	 (03PS2) 10David Caro: toolforge.etcdctl: Added removal of a member [software/spicerack] - 10https://gerrit.wikimedia.org/r/666919 (https://phabricator.wikimedia.org/T274497)
[15:53:38] <wikibugs>	 (03PS1) 10Jbond: admin: add ops to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/666940 (https://phabricator.wikimedia.org/T275731)
[15:53:40] <wikibugs>	 (03PS1) 10Jbond: admin: add contint-roots to contint-admins using a yaml refrence [puppet] - 10https://gerrit.wikimedia.org/r/666941 (https://phabricator.wikimedia.org/T275731)
[15:54:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] admin: add ops to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/666940 (https://phabricator.wikimedia.org/T275731) (owner: 10Jbond)
[15:54:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] admin: add contint-roots to contint-admins using a yaml refrence [puppet] - 10https://gerrit.wikimedia.org/r/666941 (https://phabricator.wikimedia.org/T275731) (owner: 10Jbond)
[15:54:34] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:54:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:55:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:55:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:55:08] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:55:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:55:12] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:55:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:55:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:55:28] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] code style: improve doc and link doc from tox (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/666934 (owner: 10Volans)
[15:56:13] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:56:13] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:56:26] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:56:32] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:56:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:56:54] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:57:02] <wikibugs>	 (03PS2) 10Jbond: admin: add contint-roots to contint-admins using a yaml reference [puppet] - 10https://gerrit.wikimedia.org/r/666941 (https://phabricator.wikimedia.org/T275731)
[15:57:09] <wikibugs>	 (03CR) 10MSantos: [C: 03+1] prometheus::postgres_exporter: disk metrics and custom queries [puppet] - 10https://gerrit.wikimedia.org/r/666888 (https://phabricator.wikimedia.org/T248858) (owner: 10Hnowlan)
[15:57:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:57:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:57:24] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:57:28] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:57:28] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:57:31] <wikibugs>	 (03PS2) 10Jbond: admin: add ops to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/666940 (https://phabricator.wikimedia.org/T275731)
[15:57:32] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:57:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:57:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] admin: add contint-roots to contint-admins using a yaml reference [puppet] - 10https://gerrit.wikimedia.org/r/666941 (https://phabricator.wikimedia.org/T275731) (owner: 10Jbond)
[15:59:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] admin: add ops to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/666940 (https://phabricator.wikimedia.org/T275731) (owner: 10Jbond)
[16:02:04] <wikibugs>	 10SRE, 10Maps, 10Traffic: Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776 (10MSantos)
[16:02:52] <wikibugs>	 (03PS3) 10Jbond: admin: add ops to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/666940 (https://phabricator.wikimedia.org/T275731)
[16:03:31] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] interactive: also check term for tmux in ensure_shell_is_durable [software/pywmflib] - 10https://gerrit.wikimedia.org/r/666904 (owner: 10Jbond)
[16:03:48] <wikibugs>	 (03PS3) 10Jbond: admin: add contint-roots to contint-admins using a yaml reference [puppet] - 10https://gerrit.wikimedia.org/r/666941 (https://phabricator.wikimedia.org/T275731)
[16:03:50] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28250/console" [puppet] - 10https://gerrit.wikimedia.org/r/666940 (https://phabricator.wikimedia.org/T275731) (owner: 10Jbond)
[16:04:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] admin: add contint-roots to contint-admins using a yaml reference [puppet] - 10https://gerrit.wikimedia.org/r/666941 (https://phabricator.wikimedia.org/T275731) (owner: 10Jbond)
[16:07:24] <wikibugs>	 (03PS4) 10Jbond: admin: add contint-roots to contint-admins using a yaml reference [puppet] - 10https://gerrit.wikimedia.org/r/666941 (https://phabricator.wikimedia.org/T275731)
[16:08:28] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28252/console" [puppet] - 10https://gerrit.wikimedia.org/r/666941 (https://phabricator.wikimedia.org/T275731) (owner: 10Jbond)
[16:09:29] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "15:49:36 | kubestage2002.codfw.wmnet | [info] kubestage2002 done." [puppet] - 10https://gerrit.wikimedia.org/r/666896 (owner: 10Kormat)
[16:09:38] <wikibugs>	 (03PS3) 10Gergő Tisza: GrowthExperiments: link recommendation service URL for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666860 (https://phabricator.wikimedia.org/T274198)
[16:09:42] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 65.38 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[16:09:49] <wikibugs>	 (03PS4) 10Gergő Tisza: GrowthExperiments: link recommendation service URL for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666860 (https://phabricator.wikimedia.org/T274198)
[16:10:33] <wikibugs>	 (03PS1) 10Kormat: Revert "install_server: Use custom (non-puppet) recipe for d-i-test" [puppet] - 10https://gerrit.wikimedia.org/r/666701
[16:11:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:11:36] <wikibugs>	 (03PS2) 10Kormat: Revert "install_server: Use custom (non-puppet) recipe for d-i-test" [puppet] - 10https://gerrit.wikimedia.org/r/666701
[16:11:36] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review: legoktm can't build CI docker images without using root because he's no longer in contint-admins - https://phabricator.wikimedia.org/T275731 (10jbond) > would adding *contint_roots_members explicitly to contint-admin with a c...
[16:13:24] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] Enable helmfile recreatePods for changeprop installations (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/666225 (owner: 10Ppchelko)
[16:13:40] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] Revert "install_server: Use custom (non-puppet) recipe for d-i-test" [puppet] - 10https://gerrit.wikimedia.org/r/666701 (owner: 10Kormat)
[16:15:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:15:45] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 (10BBlack) Updates on where we're at on some of the pain points above, in terms of solution analysis:  1. `large_objects_cutoff` and all related things - the key thing that ma...
[16:16:00] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:16:04] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[16:16:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:10] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[16:16:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:39] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[16:16:45] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[16:16:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:52] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:16:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:06] <wikibugs>	 (03CR) 10Huei Tan: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666680 (https://phabricator.wikimedia.org/T273674) (owner: 10Sbisson)
[16:17:16] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:17:20] <tgr_>	 I'll deploy a beta-only config change
[16:17:33] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] GrowthExperiments: link recommendation service URL for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666860 (https://phabricator.wikimedia.org/T274198) (owner: 10Gergő Tisza)
[16:17:55] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[16:17:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:18:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:31] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: link recommendation service URL for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666860 (https://phabricator.wikimedia.org/T274198) (owner: 10Gergő Tisza)
[16:19:04] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Make Toolforge docker registry cert configurable [puppet] - 10https://gerrit.wikimedia.org/r/666915 (https://phabricator.wikimedia.org/T267701) (owner: 10Majavah)
[16:23:51] <wikibugs>	 (03PS1) 10Awight: parquet logging falls back to default file handler [puppet] - 10https://gerrit.wikimedia.org/r/666948 (https://phabricator.wikimedia.org/T275757)
[16:23:59] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[16:24:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/666941 (https://phabricator.wikimedia.org/T275731) (owner: 10Jbond)
[16:26:04] <icinga-wm>	 RECOVERY - Prometheus k8s-staging cache not updating on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2004&var-datasource=codfw+prometheus/ops
[16:26:52] <icinga-wm>	 RECOVERY - Prometheus k8s-staging cache not updating on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2003&var-datasource=codfw+prometheus/ops
[16:27:04] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 76.27 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[16:28:09] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[16:28:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:28:30] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:29:02] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:29:18] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:30:32] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:30:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:31:04] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:31:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:34:23] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'production' .
[16:34:23] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'staging' .
[16:34:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: "See inline, LGTM overall (untested, please attach a PCC run)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/666888 (https://phabricator.wikimedia.org/T248858) (owner: 10Hnowlan)
[16:34:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/666935 (owner: 10Muehlenhoff)
[16:35:20] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 21.36 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[16:36:17] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] Add a job for TemplateWizard metrics aggregation [puppet] - 10https://gerrit.wikimedia.org/r/649662 (https://phabricator.wikimedia.org/T262209) (owner: 10Awight)
[16:36:31] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
[16:36:31] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
[16:36:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:36:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:36:56] <wikibugs>	 (03PS1) 10Jbond: check_puppetrun: go critical puppet is disabled for more then a week [puppet] - 10https://gerrit.wikimedia.org/r/666950
[16:37:02] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
[16:37:02] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
[16:37:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:09] <wikibugs>	 (03CR) 10David Caro: openstack: neutron: add wmcs-netns-events.py daemon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666616 (https://phabricator.wikimedia.org/T275483) (owner: 10Arturo Borrero Gonzalez)
[16:37:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:30] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'production' .
[16:37:30] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
[16:37:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:00] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
[16:38:00] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
[16:38:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:18] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'production' .
[16:38:18] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' .
[16:38:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:28] <wikibugs>	 (03PS1) 10Klausman: Add ml-serve-ctrl100[1,2] base config [puppet] - 10https://gerrit.wikimedia.org/r/666951 (https://phabricator.wikimedia.org/T275630)
[16:38:45] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'production' .
[16:38:46] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
[16:38:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:17] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'staging' .
[16:39:17] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'production' .
[16:39:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:24] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:39:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:39:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:39:42] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
[16:39:42] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
[16:39:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:40:35] <godog>	 I have to go shortly, it looks like sth is spamming logstash (and causing indexing conflicts too, hence the alert)
[16:41:28] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:41:30] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:41:31] <bblack>	 godog: probably the RB stuff above
[16:41:44] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:41:54] <godog>	 ah yeah, likely
[16:42:03] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM thanks <3" [software/spicerack] - 10https://gerrit.wikimedia.org/r/666934 (owner: 10Volans)
[16:42:46] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] interactive: also check term for tmux in ensure_shell_is_durable [software/pywmflib] - 10https://gerrit.wikimedia.org/r/666904 (owner: 10Jbond)
[16:45:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/666938 (owner: 10Muehlenhoff)
[16:45:10] <icinga-wm>	 PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:45:41] <wikibugs>	 (03Merged) 10jenkins-bot: interactive: also check term for tmux in ensure_shell_is_durable [software/pywmflib] - 10https://gerrit.wikimedia.org/r/666904 (owner: 10Jbond)
[16:48:30] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: eventgate-analytics: Open access to api-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/666956
[16:48:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] eventgate-analytics: Open access to api-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/666956 (owner: 10Alexandros Kosiaris)
[16:49:49] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: eventgate-analytics: Open access to api-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/666956
[16:49:54] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 46.58 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[16:50:04] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
[16:50:04] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
[16:50:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:14] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 146 probes of 597 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[16:51:55] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: eventgate-analytics: Open access to api-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/666956
[16:52:22] <wikibugs>	 (03PS1) 10Jbond: O:idp: add netbox as an authorised servie [puppet] - 10https://gerrit.wikimedia.org/r/666957 (https://phabricator.wikimedia.org/T244849)
[16:53:30] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:53:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:53:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:53:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:53:48] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:54:00] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] eventgate-analytics: Open access to api-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/666956 (owner: 10Alexandros Kosiaris)
[16:54:09] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
[16:54:09] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
[16:54:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:20] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28253/console" [puppet] - 10https://gerrit.wikimedia.org/r/666957 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond)
[16:54:50] <wikibugs>	 (03Merged) 10jenkins-bot: eventgate-analytics: Open access to api-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/666956 (owner: 10Alexandros Kosiaris)
[16:55:10] <wikibugs>	 (03PS2) 10Jbond: O:idp: add netbox as an authorised servie [puppet] - 10https://gerrit.wikimedia.org/r/666957 (https://phabricator.wikimedia.org/T244849)
[16:55:44] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:55:44] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:55:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] O:idp: add netbox as an authorised servie [puppet] - 10https://gerrit.wikimedia.org/r/666957 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond)
[16:56:00] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:56:00] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:56:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:56:13] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:56:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:56:26] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 45 probes of 597 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[16:56:26] <wikibugs>	 (03Abandoned) 10Effie Mouzeli: hieradata: remove shard06 from redis_sessions [puppet] - 10https://gerrit.wikimedia.org/r/666852 (https://phabricator.wikimedia.org/T272319) (owner: 10Effie Mouzeli)
[16:56:44] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: eventgates: Allow access to api-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/666958
[16:57:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] eventgates: Allow access to api-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/666958 (owner: 10Alexandros Kosiaris)
[16:57:56] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:58:02] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10Traffic: Traffic anomalies: Factor out list of countries into a dedicated Hive table - https://phabricator.wikimedia.org/T272052 (10mforns) a:03mforns
[16:58:44] <wikibugs>	 (03Merged) 10jenkins-bot: eventgates: Allow access to api-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/666958 (owner: 10Alexandros Kosiaris)
[16:59:50] <wikibugs>	 (03PS5) 10Effie Mouzeli: hieradata: enable memcached socket mwdebug1003, mwdebug2001 [puppet] - 10https://gerrit.wikimedia.org/r/663796 (https://phabricator.wikimedia.org/T273115)
[17:00:04] <jouncebot>	 jbond42 and cdanis: Your horoscope predicts another unfortunate Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210225T1700).
[17:00:04] <jouncebot>	 tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[17:00:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:00:10] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
[17:00:10] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
[17:00:16] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:00:16] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:00:18] <tgr_>	 o/
[17:00:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:22] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:00:24] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:00:24] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:00:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:30] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:00:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:00:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:01:08] <tgr_>	 the change should be a no-op in production.
[17:01:11] <wikibugs>	 (03PS1) 10Effie Mouzeli: memcached::templates: fix typo in memcached.systemd.erb [puppet] - 10https://gerrit.wikimedia.org/r/666960
[17:01:38] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
[17:01:38] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
[17:01:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:01:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:02:00] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM modulo the fact that I haven't checked the mac addresses :)" [puppet] - 10https://gerrit.wikimedia.org/r/666951 (https://phabricator.wikimedia.org/T275630) (owner: 10Klausman)
[17:02:32] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] Add ml-serve-ctrl100[1,2] base config [puppet] - 10https://gerrit.wikimedia.org/r/666951 (https://phabricator.wikimedia.org/T275630) (owner: 10Klausman)
[17:02:43] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' .
[17:02:43] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'canary' .
[17:02:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:02:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:03:02] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[17:03:02] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
[17:03:02] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[17:04:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:09] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'production' .
[17:04:09] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' .
[17:04:12] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: openstack: neutron: add wmcs-netns-events.py daemon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666616 (https://phabricator.wikimedia.org/T275483) (owner: 10Arturo Borrero Gonzalez)
[17:04:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:28] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:04:30] <icinga-wm>	 PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton
[17:04:37] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
[17:04:37] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
[17:04:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:05:07] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' .
[17:05:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:12] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'canary' .
[17:06:13] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' .
[17:06:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:34] <icinga-wm>	 RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[17:06:50] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'recommendation-api' for release 'production' .
[17:06:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:19] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'sessionstore' for release 'production' .
[17:07:20] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'sessionstore' for release 'staging' .
[17:07:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:40] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:11:30] <wikibugs>	 (03CR) 10Effie Mouzeli: "PCC OK  https://puppet-compiler.wmflabs.org/compiler1002/28255/" [puppet] - 10https://gerrit.wikimedia.org/r/666960 (owner: 10Effie Mouzeli)
[17:11:49] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] memcached::templates: fix typo in memcached.systemd.erb [puppet] - 10https://gerrit.wikimedia.org/r/666960 (owner: 10Effie Mouzeli)
[17:12:06] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1007 is CRITICAL: 3.337e+08 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1007
[17:12:45] <razzi>	 ^-- This is ok, just forgot to downtime the service
[17:12:46] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1008 is CRITICAL: 3.707e+08 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1008
[17:13:36] <wikibugs>	 (03PS1) 10David Caro: etcdctl: Fix commands sent to control node [software/spicerack] - 10https://gerrit.wikimedia.org/r/666961
[17:13:44] <wikibugs>	 (03PS5) 10Effie Mouzeli: mediawiki::mcrouter: add onhost memcached unix socket support [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115)
[17:14:08] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1001 is CRITICAL: 4.073e+08 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001
[17:14:22] <wikibugs>	 (03CR) 10Effie Mouzeli: hieradata: enable memcached socket mwdebug1003, mwdebug2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663796 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli)
[17:14:27] <wikibugs>	 (03PS6) 10Effie Mouzeli: hieradata: enable memcached socket mwdebug1003, mwdebug2001 [puppet] - 10https://gerrit.wikimedia.org/r/663796 (https://phabricator.wikimedia.org/T273115)
[17:17:29] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[17:17:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:07] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' .
[17:18:07] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
[17:18:07] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'production' .
[17:18:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:43] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
[17:18:43] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
[17:18:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] etcdctl: Fix commands sent to control node [software/spicerack] - 10https://gerrit.wikimedia.org/r/666961 (owner: 10David Caro)
[17:19:08] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'production' .
[17:19:09] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' .
[17:19:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:31] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'staging' .
[17:25:31] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'production' .
[17:25:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:45] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
[17:25:45] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
[17:25:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:03] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
[17:26:03] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
[17:26:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:17] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'production' .
[17:26:17] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
[17:26:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:34] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
[17:26:34] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
[17:26:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:51] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' .
[17:26:51] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'production' .
[17:26:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:05] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
[17:27:05] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'production' .
[17:27:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:21] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'staging' .
[17:27:21] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'production' .
[17:27:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:33] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
[17:27:33] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
[17:27:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:26] <wikibugs>	 (03PS1) 10Ottomata: Expand range of Modify Kafka max replica lag slope alert [puppet] - 10https://gerrit.wikimedia.org/r/666966 (https://phabricator.wikimedia.org/T273702)
[17:33:00] <wikibugs>	 (03PS1) 10Elukey: install_server: use a more specific pattern for ml-serve[12]00[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/666967
[17:34:00] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] install_server: use a more specific pattern for ml-serve[12]00[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/666967 (owner: 10Elukey)
[17:37:53] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
[17:37:53] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
[17:37:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:26] <wikibugs>	 10SRE, 10vm-requests, 10Patch-For-Review: eqiad/codfw: 2x2 VM request for ML-Serve Kubernetes cluster - https://phabricator.wikimedia.org/T275630 (10jbond) p:05Triage→03Medium
[17:43:32] <wikibugs>	 (03PS1) 10Elukey: install_server: fix ml-serve-ctrl2001's MAC address [puppet] - 10https://gerrit.wikimedia.org/r/666970
[17:44:30] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] install_server: fix ml-serve-ctrl2001's MAC address [puppet] - 10https://gerrit.wikimedia.org/r/666970 (owner: 10Elukey)
[17:45:08] <elukey>	 bash: joy: command not found --^
[17:47:59] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
[17:47:59] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
[17:48:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:48:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:12] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:53:22] <wikibugs>	 10SRE, 10Traffic: provision more machines for eqsin caches - https://phabricator.wikimedia.org/T275046 (10BBlack)
[17:55:43] <wikibugs>	 (03CR) 10Legoktm: [C: 04-1] "contint-admins should be added to the list in validate_duplicated_ops_permissions() in cross-validate-accounts.py" [puppet] - 10https://gerrit.wikimedia.org/r/666940 (https://phabricator.wikimedia.org/T275731) (owner: 10Jbond)
[17:58:06] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
[17:58:06] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
[17:58:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:58:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:04] <jouncebot>	 chrisalbon and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210225T1800).
[18:05:08] <legoktm>	 what's a "graphoid"?
[18:06:49] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] install_server: fix ml-serve-ctrl2001's MAC address [puppet] - 10https://gerrit.wikimedia.org/r/666970 (owner: 10Elukey)
[18:07:13] <wikibugs>	 (03CR) 10MSantos: [C: 03+1] osm: add missing production step to import script [puppet] - 10https://gerrit.wikimedia.org/r/666596 (owner: 10Hnowlan)
[18:07:38] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:07:44] <dancy>	 https://wikitech.wikimedia.org/wiki/Graphoid 
[18:08:03] <apergos>	 something that should be gone now legoktm :-D  
[18:08:12] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
[18:08:12] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
[18:08:13] <Zppix>	 Rip graphoid xD
[18:08:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:51] <dancy>	 Looks like jouncebot needs an update.
[18:09:05] <dancy>	 I do love its snarky attitude
[18:09:43] <wikibugs>	 10SRE, 10SRE-Access-Requests: wikidata.org delegated Full Google Search Console access for abaso@wikimedia.org - https://phabricator.wikimedia.org/T275240 (10dr0ptp4kt) Thank you for the access. I see the "owner" access for those domains but am unable to add users. Would you please grant me access on the wikid...
[18:11:41] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Revert "Increase concurrency for cdnPurge job to 200" [deployment-charts] - 10https://gerrit.wikimedia.org/r/666702
[18:11:49] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Revert "Increase concurrency for cdnPurge job to 200" [deployment-charts] - 10https://gerrit.wikimedia.org/r/666702
[18:13:04] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "Increase concurrency for cdnPurge job to 200" [deployment-charts] - 10https://gerrit.wikimedia.org/r/666702 (owner: 10Giuseppe Lavagetto)
[18:13:47] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Increase concurrency for cdnPurge job to 200" [deployment-charts] - 10https://gerrit.wikimedia.org/r/666702 (owner: 10Giuseppe Lavagetto)
[18:18:17] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' .
[18:18:18] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'canary' .
[18:18:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:33] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
[18:18:33] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[18:18:33] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[18:18:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:47] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' .
[18:18:47] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'production' .
[18:18:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:00] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
[18:19:00] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
[18:19:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:14] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' .
[18:19:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:28] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' .
[18:19:28] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'canary' .
[18:19:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:43] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'recommendation-api' for release 'production' .
[18:19:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:57] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'sessionstore' for release 'production' .
[18:20:57] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'sessionstore' for release 'staging' .
[18:20:57] <wikibugs>	 10SRE: Integrate Buster 10.8 point update - https://phabricator.wikimedia.org/T274099 (10MoritzMuehlenhoff)
[18:21:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:18] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] "Nice, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/666915 (https://phabricator.wikimedia.org/T267701) (owner: 10Majavah)
[18:23:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] aptrepo: Remove update definitions only used on jessie [puppet] - 10https://gerrit.wikimedia.org/r/666935 (owner: 10Muehlenhoff)
[18:23:32] <bblack>	 !log dns[1235]002 - upgrade gdnsd to 3.6.0 (dns4002 and authdns2001 already running it for some time!)
[18:23:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:50] <logmsgbot>	 !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
[18:25:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:47] <logmsgbot>	 !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
[18:27:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:06] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[18:30:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:20] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' .
[18:30:20] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' .
[18:30:20] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'production' .
[18:30:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:40] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
[18:30:40] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
[18:30:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:54] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'production' .
[18:30:55] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' .
[18:30:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:31:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:31:16] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.dns.netbox
[18:31:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:34:42] <icinga-wm>	 PROBLEM - Thanos store has high latency for series gate requests on alert1001 is CRITICAL: job=thanos-store https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store
[18:36:42] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload
[18:36:43] <logmsgbot>	 !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[18:36:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:00] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload
[18:37:01] <logmsgbot>	 !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[18:37:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:06] <logmsgbot>	 !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:37:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:10] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload
[18:38:10] <logmsgbot>	 !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[18:38:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:39:59] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:40:22] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install mwmaint2002 - https://phabricator.wikimedia.org/T274170 (10Papaul)
[18:41:42] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:45:37] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] quarry: Remove absented cron [puppet] - 10https://gerrit.wikimedia.org/r/666783 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup)
[18:46:18] <icinga-wm>	 RECOVERY - Thanos store has high latency for series gate requests on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store
[18:50:05] <ryankemper>	 !log T267927 Trying to kick off data reload on `wdqs2008` from `cumin2001` fails because of `spicerack.remote.RemoteError: No hosts provided`. Doing some spelunking through IRC history looks like this happens when a host is not present in puppetDB. I'm confirmed `wdqs2008` is absent on puppetboard, so running puppet agent to get it re-registered (hopefully)
[18:50:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:50:11] <stashbot>	 T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927
[18:54:16] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] "Checked using cumin that there are no jessie servers left using this class." [puppet] - 10https://gerrit.wikimedia.org/r/666928 (owner: 10Muehlenhoff)
[18:56:51] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload
[18:56:52] <logmsgbot>	 !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[18:56:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:57:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:57:24] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload
[18:57:24] <logmsgbot>	 !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[18:57:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:57:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:57:52] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[18:57:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/28256/ looks good, though it still has an scb host in it" [puppet] - 10https://gerrit.wikimedia.org/r/666928 (owner: 10Muehlenhoff)
[18:58:04] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28257/console" [puppet] - 10https://gerrit.wikimedia.org/r/666920 (owner: 10Muehlenhoff)
[18:59:18] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "scb1002 is a Service cluster B; includes:" [puppet] - 10https://gerrit.wikimedia.org/r/666928 (owner: 10Muehlenhoff)
[18:59:18] <ryankemper>	 !log T267927 Manual puppet run got `wdqs2008` present in puppetdb again. Now being blocked by lack of host key for `wdqs2008` present on `cumin2001`, so I'm running puppet on `cumin2001` to get the latest state of `/etc/ssh/ssh_known_hosts`
[18:59:18] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload
[18:59:18] <logmsgbot>	 !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[18:59:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:59:25] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] mediawiki::packages::fonts: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/666928 (owner: 10Muehlenhoff)
[18:59:25] <stashbot>	 T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927
[18:59:27] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/666920 (owner: 10Muehlenhoff)
[18:59:28] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Medium error reported for sda on elastic2045 - https://phabricator.wikimedia.org/T275345 (10Papaul) a:05Papaul→03RKemper @Gehel @RKemper disk replaced. Please resolve task when re-image is done.   Thanks
[18:59:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:59:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210225T1900).
[19:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[19:01:24] <wikibugs>	 (03PS1) 10Legoktm: ci: Use dedicated "ci-build" account for docker-registry pushes (try #2) [puppet] - 10https://gerrit.wikimedia.org/r/666703 (https://phabricator.wikimedia.org/T275559)
[19:02:04] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 (10BBlack)
[19:04:28] <wikibugs>	 10SRE, 10Traffic: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809 (10BBlack)
[19:04:35] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 (10CDanis) >>! In T274888#6861566, @BBlack wrote: > 4. `sh` hashing - I think @CDanis already worked on some patches to transition us to maglev hashing a quarter or two ago, b...
[19:05:52] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[19:07:55] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 (10BBlack) I've spun out T275809 to go into some depth on the #1 part about `large_objects_cutoff`
[19:12:40] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 65.62 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[19:14:47] <wikibugs>	 (03PS1) 10Kosta Harlan: Impact module: Add "not rendered" state [extensions/GrowthExperiments] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/666704 (https://phabricator.wikimedia.org/T270294)
[19:16:19] <ryankemper>	 !log T267927 Downloading dumps: `sudo https_proxy=webproxy.codfw.wmnet:8080 wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 -O /srv/wdqs/latest-all.ttl.bz2 && sudo https_proxy=webproxy.codfw.wmnet:8080 wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2 -O /srv/wdqs/latest-lexemes.ttl.bz2` on `ryankemper@wdqs2008` tmux session `download_latest_dumps`
[19:16:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:16:26] <stashbot>	 T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927
[19:17:24] <wikibugs>	 (03PS2) 10Gergő Tisza: Impact module: Add "not rendered" state [extensions/GrowthExperiments] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/666704 (https://phabricator.wikimedia.org/T270294) (owner: 10Kosta Harlan)
[19:17:57] <tgr_>	 ^last-minute backport
[19:18:17] <kostajh>	 tgr_: \o
[19:18:18] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] Impact module: Add "not rendered" state [extensions/GrowthExperiments] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/666704 (https://phabricator.wikimedia.org/T270294) (owner: 10Kosta Harlan)
[19:24:18] <wikibugs>	 (03PS2) 10Legoktm: ci: Use dedicated "ci-build" account for docker-registry pushes (try #2) [puppet] - 10https://gerrit.wikimedia.org/r/666703 (https://phabricator.wikimedia.org/T275559)
[19:25:20] <wikibugs>	 (03PS1) 10Dzahn: phabricator::tools: replace cron jobs with timers [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673)
[19:25:46] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 42.15 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[19:27:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] phabricator::tools: replace cron jobs with timers [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[19:27:57] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1 C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1001/28258/contint2001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/666703 (https://phabricator.wikimedia.org/T275559) (owner: 10Legoktm)
[19:29:38] <wikibugs>	 (03CR) 10Dzahn: "was this empty?" [puppet] - 10https://gerrit.wikimedia.org/r/666703 (https://phabricator.wikimedia.org/T275559) (owner: 10Legoktm)
[19:30:00] <wikibugs>	 (03PS2) 10Dzahn: phabricator::tools: replace cron jobs with timers [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673)
[19:30:09] <mutante>	 wth.. it looked empty to me, but it's my bad connection
[19:30:36] <legoktm>	 o.O
[19:30:44] <legoktm>	 like it didn't show up on Gerrit properly?
[19:30:44] <wikibugs>	 (03Merged) 10jenkins-bot: Impact module: Add "not rendered" state [extensions/GrowthExperiments] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/666704 (https://phabricator.wikimedia.org/T270294) (owner: 10Kosta Harlan)
[19:31:13] <mutante>	 like it showed the normal Gerrit UI but had not loaded the actual Files / diff section
[19:31:21] <legoktm>	 huh
[19:31:35] <mutante>	 like when you rebase something that has been done meanwhile and rebases into nothing
[19:31:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] phabricator::tools: replace cron jobs with timers [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[19:33:02] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:33:26] * legoktm nods
[19:33:27] <tgr_>	 kostajh: it's on mwdebug1001
[19:33:36] <kostajh>	 tgr_: thanks, checking
[19:36:20] <wikibugs>	 10SRE, 10Traffic: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809 (10BBlack) So, to expand a little bit on the text quoted at the top with some initial insights about cutoff vs nuke-limit tradeoffs and some of my current thinking and/or assumptions:  * Turn...
[19:37:06] <kostajh>	 tgr_: it doesn't break anything on test.wikipedia.org. The fix could only be verified on bnwiki in group2 , which is on wmf.31 still
[19:37:26] <tgr_>	 ok, deploying
[19:37:29] <kostajh>	 tgr_: we could merge this patch and backport to wmf.31 as well if you'd like
[19:40:37] <logmsgbot>	 !log tgr@deploy1001 Synchronized php-1.36.0-wmf.32/extensions/GrowthExperiments/: Backport: [[gerrit:666704|Impact module: Add "not rendered" state (T270294, T275615)]] (duration: 01m 26s)
[19:40:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:45] <stashbot>	 T270294: Scale: deploy without impact module - https://phabricator.wikimedia.org/T270294
[19:40:45] <stashbot>	 T275615: 'impact_module_state' is a required property - https://phabricator.wikimedia.org/T275615
[19:40:53] <tgr_>	 let's do .31 so we don't have to worry about train rollbacks
[19:41:02] <wikibugs>	 (03PS1) 10JMeybohm: linkrecommendation: Allow egress to analytics.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/666982
[19:41:29] <wikibugs>	 (03PS1) 10Gergő Tisza: Impact module: Add "not rendered" state [extensions/GrowthExperiments] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/666986 (https://phabricator.wikimedia.org/T270294)
[19:41:40] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] Impact module: Add "not rendered" state [extensions/GrowthExperiments] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/666986 (https://phabricator.wikimedia.org/T270294) (owner: 10Gergő Tisza)
[19:41:48] <wikibugs>	 (03CR) 10JMeybohm: "Not sure on this, should we maybe not go though the CDN here?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/666982 (owner: 10JMeybohm)
[19:41:58] <wikibugs>	 (03PS3) 10Dzahn: phabricator::tools: replace cron jobs with timers [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673)
[19:42:40] <wikibugs>	 (03CR) 10Kosta Harlan: "Thanks for this. How was the script able to pull the datasets without this patch?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/666982 (owner: 10JMeybohm)
[19:42:56] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs2008 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 1.219 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[19:44:09] <wikibugs>	 (03CR) 10JMeybohm: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/666982 (owner: 10JMeybohm)
[19:44:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] phabricator::tools: replace cron jobs with timers [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[19:51:07] <wikibugs>	 10SRE, 10ops-eqiad, 10Discovery, 10Discovery-Search (Current work): elastic1033's mgmt is unreachable - https://phabricator.wikimedia.org/T275733 (10Gehel)
[19:56:54] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on wdqs2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ryan Kemper data-reload in progress https://phabricator.wikimedia.org/T267927 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:56:54] <icinga-wm>	 ACKNOWLEDGEMENT - WDQS SPARQL on wdqs2008 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 1.216 second response time Ryan Kemper data-reload in progress https://phabricator.wikimedia.org/T267927 https://wikitech.wikime
[19:56:54] <icinga-wm>	 data_query_service/Runbook
[19:57:07] <wikibugs>	 (03Merged) 10jenkins-bot: Impact module: Add "not rendered" state [extensions/GrowthExperiments] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/666986 (https://phabricator.wikimedia.org/T270294) (owner: 10Gergő Tisza)
[19:58:06] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:00:05] <jouncebot>	 longma and marxarelli: It is that lovely time of the day again! You are hereby commanded to deploy Mediawiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210225T2000).
[20:00:18] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:00:21] <marxarelli>	 o/
[20:00:55] <tgr_>	 marxarelli: backport is running over, I'll be done in a couple minutes
[20:01:28] <tgr_>	 ...except there are a bunch of undeployed commits in wmf.31
[20:01:50] <tgr_>	 they are all to .pipeline/blubber.yaml, I suppose that can be ignored?
[20:03:07] <tgr_>	 kostajh: can you check on group2/mwdebug1001?
[20:04:21] <kostajh>	 tgr_: yes, just a few minutes please
[20:06:38] <icinga-wm>	 PROBLEM - Thanos store has high latency for series gate requests on alert1001 is CRITICAL: job=thanos-store https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store
[20:08:57] <marxarelli>	 tgr_: no problem. longma is doing the deploy. i'm on backup. those were my undeployed commits actually, but they were reverted yesterday
[20:09:24] <kostajh>	 tgr_: hmm, not seeing the HomepageVisit event on bnwiki
[20:09:34] <kostajh>	 It's mwdebug1001 for sure?
[20:09:35] <marxarelli>	 either way, they shouldn't have an effect
[20:09:53] <tgr_>	 let me double-check
[20:10:22] <wikibugs>	 (03PS1) 10Legoktm: docker: Use "prod-build" account for pushing production images [puppet] - 10https://gerrit.wikimedia.org/r/666984 (https://phabricator.wikimedia.org/T275582)
[20:11:43] <tgr_>	 kostajh: seems to be there
[20:12:55] <wikibugs>	 (03PS1) 10Legoktm: docker: Add prod-build password for deneb [labs/private] - 10https://gerrit.wikimedia.org/r/666985
[20:13:10] <wikibugs>	 (03CR) 10Legoktm: [V: 03+2 C: 03+2] docker: Add prod-build password for deneb [labs/private] - 10https://gerrit.wikimedia.org/r/666985 (owner: 10Legoktm)
[20:13:52] <wikibugs>	 (03PS2) 10Legoktm: docker: Use "prod-build" account for pushing production images [puppet] - 10https://gerrit.wikimedia.org/r/666984 (https://phabricator.wikimedia.org/T275582)
[20:13:54] <kostajh>	 tgr_: I'm still seeing the validation error
[20:14:52] <tgr_>	 oh well. the homepage is working so the patch didn't make anything worse, right?
[20:15:13] <kostajh>	 tgr_: right
[20:15:30] <tgr_>	 let's deploy it and get out of the way of the train. Maybe there's a job involved somehow and it will work when properly deployed.
[20:15:38] <tgr_>	 if not, we can figure it out next week
[20:15:56] <icinga-wm>	 RECOVERY - Thanos store has high latency for series gate requests on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store
[20:16:09] <kostajh>	 tgr_: sounds fine
[20:16:59] <wikibugs>	 (03PS1) 10Jdlrobson: Enable og tags on non-wikidata wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667007 (https://phabricator.wikimedia.org/T157145)
[20:17:05] <logmsgbot>	 !log tgr@deploy1001 Synchronized php-1.36.0-wmf.31/extensions/GrowthExperiments/: Backport: [[gerrit:666704|Impact module: Add "not rendered" state (T270294, T275615)]] (duration: 01m 08s)
[20:17:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:17:15] <stashbot>	 T270294: Scale: deploy without impact module - https://phabricator.wikimedia.org/T270294
[20:17:15] <stashbot>	 T275615: 'impact_module_state' is a required property - https://phabricator.wikimedia.org/T275615
[20:17:21] <wikibugs>	 (03PS1) 10Dzahn: package_builder: convert cowbuilder cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/667008 (https://phabricator.wikimedia.org/T273673)
[20:17:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] package_builder: convert cowbuilder cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/667008 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[20:18:33] <tgr_>	 marxarelli, longma: all yours
[20:18:46] <longma>	 thanks tgr_ 
[20:19:25] <wikibugs>	 (03PS3) 10Legoktm: docker: Use "prod-build" account for pushing production images [puppet] - 10https://gerrit.wikimedia.org/r/666984 (https://phabricator.wikimedia.org/T275582)
[20:19:38] <wikibugs>	 10SRE, 10Instrument-ClientError, 10MediaWiki-extensions-WikimediaEvents, 10observability: Edits to pt:MediaWiki:Common.js and new bugs that create client side error spike should log alerts - https://phabricator.wikimedia.org/T264665 (10Jdlrobson) @colewhite is there somebody I could pair with to write this...
[20:20:28] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28260/console" [puppet] - 10https://gerrit.wikimedia.org/r/666984 (https://phabricator.wikimedia.org/T275582) (owner: 10Legoktm)
[20:23:00] <wikibugs>	 (03PS4) 10Dzahn: phabricator::tools: replace cron jobs with timers [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673)
[20:24:01] <wikibugs>	 (03PS2) 10Gergő Tisza: Add GrowthExperiments maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/655865 (https://phabricator.wikimedia.org/T261408)
[20:29:06] <wikibugs>	 (03PS2) 10Dzahn: package_builder: convert cowbuilder cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/667008 (https://phabricator.wikimedia.org/T273673)
[20:29:08] <wikibugs>	 (03PS1) 10Jeena Huneidi: all wikis to 1.36.0-wmf.32  refs T274936 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667009
[20:29:10] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.36.0-wmf.32  refs T274936 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667009 (owner: 10Jeena Huneidi)
[20:29:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] package_builder: convert cowbuilder cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/667008 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[20:30:09] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.32  refs T274936 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667009 (owner: 10Jeena Huneidi)
[20:44:10] <icinga-wm>	 PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:46:18] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] typos: remove 'bullsey' [puppet] - 10https://gerrit.wikimedia.org/r/667012 (owner: 10Dzahn)
[20:47:06] <wikibugs>	 (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/667008 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[20:48:09] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "thank you:)" [puppet] - 10https://gerrit.wikimedia.org/r/665461 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[20:51:41] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1001/28262/" [puppet] - 10https://gerrit.wikimedia.org/r/665461 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[20:52:52] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "even though compiler said noop.. it is not actually noop on cloudweb2001-dev :/" [puppet] - 10https://gerrit.wikimedia.org/r/665461 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[20:53:14] <wikibugs>	 (03PS1) 10Dzahn: Revert "ldap::config::labs: replace hiera_hash with lookup" [puppet] - 10https://gerrit.wikimedia.org/r/666989
[20:54:25] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "ldap::config::labs: replace hiera_hash with lookup" [puppet] - 10https://gerrit.wikimedia.org/r/666989 (owner: 10Dzahn)
[20:54:46] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] Revert "ldap::config::labs: replace hiera_hash with lookup" [puppet] - 10https://gerrit.wikimedia.org/r/666989 (owner: 10Dzahn)
[20:57:59] <wikibugs>	 (03PS1) 10JMeybohm: event*: Enable egress networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/667015
[21:00:55] <wikibugs>	 (03CR) 10Dzahn: "I don't know what is going on, but on cloudweb2001-dev there is a puppet change on every run that removes a LVS service IP. (which was not" [puppet] - 10https://gerrit.wikimedia.org/r/665461 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[21:06:23] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] docker::engine: replace hiera_hash with lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/665459 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[21:08:36] <icinga-wm>	 RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:08:59] <wikibugs>	 (03CR) 10Dzahn: "noop confirmed: deneb, kubernetes2001, kubestage1002, releases1001" [puppet] - 10https://gerrit.wikimedia.org/r/665459 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[21:09:37] <wikibugs>	 (03PS2) 10Dzahn: deployment::rsync: remove references to the 'trebuchet' name [puppet] - 10https://gerrit.wikimedia.org/r/666757
[21:09:58] <wikibugs>	 (03CR) 10Mforns: "Code makes sense to me, but I completely ignore the implications of this change... @elukey?" [puppet] - 10https://gerrit.wikimedia.org/r/666948 (https://phabricator.wikimedia.org/T275757) (owner: 10Awight)
[21:15:18] <icinga-wm>	 PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:15:48] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/28264/" [puppet] - 10https://gerrit.wikimedia.org/r/666757 (owner: 10Dzahn)
[21:19:35] <wikibugs>	 (03CR) 10Dzahn: "confirmed rsync command still working on deploy2001,ran manually" [puppet] - 10https://gerrit.wikimedia.org/r/666757 (owner: 10Dzahn)
[21:20:12] <mutante>	 !log deploy2001 - rsynced /srv/deployment from deploy1001 after gerrit:666757
[21:20:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:20:34] <icinga-wm>	 PROBLEM - Thanos store has high latency for series gate requests on alert1001 is CRITICAL: job=thanos-store https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store
[21:20:40] <wikibugs>	 10SRE, 10SRE-swift-storage: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 (10wiki_willy)
[21:21:31] <wikibugs>	 10SRE, 10User-jijiki: Put rdb200[78] into service - https://phabricator.wikimedia.org/T255681 (10wiki_willy)
[21:24:40] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery-Search (Current work): elastic2054 unresponsive - https://phabricator.wikimedia.org/T274555 (10Papaul) 05Open→03Resolved
[21:25:02] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Medium error reported for sda on elastic2045 - https://phabricator.wikimedia.org/T275345 (10Papaul) 05Open→03Resolved
[21:25:30] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T268622 (10Papaul) 05Open→03Resolved
[21:32:22] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1 C: 03+2] docker: Use "prod-build" account for pushing production images [puppet] - 10https://gerrit.wikimedia.org/r/666984 (https://phabricator.wikimedia.org/T275582) (owner: 10Legoktm)
[21:34:02] <icinga-wm>	 RECOVERY - Thanos store has high latency for series gate requests on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store
[21:36:46] <wikibugs>	 (03PS1) 10Legoktm: docker: Lockdown /root/.docker a bit more [puppet] - 10https://gerrit.wikimedia.org/r/667017
[21:38:13] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28266/console" [puppet] - 10https://gerrit.wikimedia.org/r/667017 (owner: 10Legoktm)
[21:38:59] <legoktm>	 !log pushed new version of docker-registry.discovery.wmnet/wikimedia-buster image
[21:39:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:11] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1 C: 03+2] docker: Lockdown /root/.docker a bit more [puppet] - 10https://gerrit.wikimedia.org/r/667017 (owner: 10Legoktm)
[21:39:24] <wikibugs>	 (03PS1) 10Dzahn: role::deployment_server: re-order includes, add comments, clean up [puppet] - 10https://gerrit.wikimedia.org/r/667018
[21:39:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] role::deployment_server: re-order includes, add comments, clean up [puppet] - 10https://gerrit.wikimedia.org/r/667018 (owner: 10Dzahn)
[21:41:41] <wikibugs>	 (03PS2) 10Dzahn: role::deployment_server: re-order includes, add comments, clean up [puppet] - 10https://gerrit.wikimedia.org/r/667018
[21:45:24] <wikibugs>	 (03PS3) 10Dzahn: role::deployment_server: re-order includes, add comments, clean up [puppet] - 10https://gerrit.wikimedia.org/r/667018
[21:50:24] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:50:53] <wikibugs>	 (03PS1) 10Odder: Add localised logos for the Altay Wikipedia. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667021 (https://phabricator.wikimedia.org/T275819)
[21:52:40] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:54:26] <wikibugs>	 (03Abandoned) 10Odder: Add localised logos for the Altay Wikipedia. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667021 (https://phabricator.wikimedia.org/T275819) (owner: 10Odder)
[21:54:31] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] Make Toolforge docker registry cert configurable [puppet] - 10https://gerrit.wikimedia.org/r/666915 (https://phabricator.wikimedia.org/T267701) (owner: 10Majavah)
[21:55:48] <wikibugs>	 (03CR) 10Legoktm: "Hi Odder! The process is described in logos/README, let me know if you need any help/assistance with it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667021 (https://phabricator.wikimedia.org/T275819) (owner: 10Odder)
[22:01:26] <icinga-wm>	 PROBLEM - Thanos store has high latency for series gate requests on alert1001 is CRITICAL: job=thanos-store https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store
[22:03:22] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10Papaul) @klausman any update on this? IF the install is done can you please resolve the task? Thanks.
[22:05:12] <wikibugs>	 (03CR) 10Ottomata: "What does it do!?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/667015 (owner: 10JMeybohm)
[22:08:44] <icinga-wm>	 RECOVERY - Thanos store has high latency for series gate requests on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store
[22:14:22] <mutante>	 so, I have 4 deployment servers with /srv/mediwiki on them that we synced at some point. but the sizes are: 12G, 16G, 18G and 20G.  yay
[22:14:53] <mutante>	 with /srv/deployment there is no such issue, they are all 43G equally because they pull from 1001 with --delete
[22:15:47] <mutante>	 now looking what to do with the /srv/mediawiki part before switching to 1002 in a couple days
[22:17:09] <wikibugs>	 10SRE, 10Instrument-ClientError, 10MediaWiki-extensions-WikimediaEvents, 10observability: Edits to pt:MediaWiki:Common.js and new bugs that create client side error spike should log alerts - https://phabricator.wikimedia.org/T264665 (10colewhite) @Jdlrobson I'd be happy to help.  Ping me on IRC or elsewher...
[22:25:11] <mutante>	 why is "mediawiki-staging" over 200GB in codfw but only 20 in eqiad? Anyone who would be confident what can be deleted if anything?
[22:26:00] <legoktm>	 o.O
[22:26:16] <legoktm>	 mutante: on which server? (the 200GB)
[22:26:32] <mutante>	 legoktm: deploy2001.codfw.wmnet:/srv/mediawiki-staging
[22:28:36] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] "Ok, new wiki creates are done, so I'll give this a whirl!" [puppet] - 10https://gerrit.wikimedia.org/r/665115 (owner: 10Lucas Werkmeister (WMDE))
[22:28:57] * legoktm looks
[22:29:24] <legoktm>	 oh
[22:29:32] <legoktm>	 someone just needs to prune the old branches
[22:31:46] <mutante>	 legoktm: thank you. so yea, I have 4 different copies of /srv/mediawiki too and they are all a bit different
[22:31:58] <legoktm>	 mutante: there's a `scap clean` command but I've never used it before. probably best to ask someone from releng to do it
[22:32:21] <tabbycat>	 or scap prune
[22:32:26] <tabbycat>	 I don't remember rightly
[22:32:54] <legoktm>	 I don't see prune in the out put of `scap --help`
[22:33:28] <mutante>	 ack. interesting. thank you. or.. I just copy 'all the things' over to some archive dir on new hosts.. and it can be sorted out there
[22:35:46] <legoktm>	 if we're setting up new things I think we should just start with the content on deploy1001
[22:36:35] <legoktm>	 the fact that everything else is slightly different isn't great if we had to unexpectedly switch over but I suspect it's not functional differences
[22:37:01] <mutante>	 /srv/deployment is like that. they are all pulling from 1001 with --delete.. so that is identical
[22:37:11] <mutante>	 what is not is the other stuff in /srv
[22:37:36] <mutante>	 what if we expectedly had to switch over :p
[22:38:17] <mutante>	 should there be another rsync with --delete for /srv/mediawiki from 1001
[22:38:23] <mutante>	 to ..just the new servers
[22:38:43] <mutante>	 that I can do of course
[22:39:50] <wikibugs>	 (03CR) 10Bstorm: "Ok, I can confirm that (on my test instance, this flipped the bit for szywiki from 0 to 1." [puppet] - 10https://gerrit.wikimedia.org/r/665115 (owner: 10Lucas Werkmeister (WMDE))
[22:40:23] <legoktm>	 I think so
[22:40:27] <marxarelli>	 legoktm, mutante: that's odd. i `scap clean`'d all but two branches last week. i wonder if it's not executing correctly on the second deploy server
[22:41:13] <legoktm>	 hmm
[22:41:34] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 202 probes of 597 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[22:41:37] <legoktm>	 no, I think you're right, it only has cache directories left
[22:41:40] <mutante>	 marxarelli: the one where it's so large is the codfw one, the old one
[22:42:04] <icinga-wm>	 RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:42:36] <legoktm>	 do we really need the 1.32 and 1.33 branches though?
[22:42:37] <mutante>	 so it's about which dc we are currently in 
[22:42:37] <wikibugs>	 (03CR) 10Bstorm: "Since this doesn't run automatically for all databases on schedule, this will only update when we run the script as well. If it ever becom" [puppet] - 10https://gerrit.wikimedia.org/r/665115 (owner: 10Lucas Werkmeister (WMDE))
[22:43:28] <legoktm>	 there's 116 branches on deploy2001, each is about 2G so there's our 200G+
[22:43:38] <marxarelli>	 yikes
[22:43:58] <marxarelli>	 no, we shouldn't need anything that isn't on deploy1001
[22:44:10] <marxarelli>	 i'll take a look
[22:44:19] <mutante>	 thank you :)
[22:44:26] <legoktm>	 :D
[22:47:34] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 50 probes of 597 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[22:49:06] <marxarelli>	 longma: do you mind if i `scap clean` the wmf.30 branch? i want to see if it targets deploy2001 correctly (see ^)
[22:49:50] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:49:59] <mutante>	 Urbanecm: oh man.. and about the other stuff i was talking about. patches, rsync.. all that.. I already fixed this stuff! but for releases*, not deploy*! but it's the exact same issue for another set of servers. that's why it felt all so familiar but still not just working! aha
[22:50:25] <Urbanecm>	 hehe
[22:50:30] <mutante>	 that path exists on releases* too
[22:51:08] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:54:29] <longma>	 marxarelli: sure, go ahead
[22:54:37] <marxarelli>	 k
[22:55:10] <longma>	 I was going to clean it up tomorrow since I always feel short on time on Tuesdays
[22:55:58] <marxarelli>	 makes sense. i just wanted to figure out why deploy2001 is full of old versions
[22:56:20] <longma>	 yeah that makes sense
[22:59:53] <wikibugs>	 (03PS1) 10Dzahn: deployment::rsync:: also sync patches directory [puppet] - 10https://gerrit.wikimedia.org/r/667031
[23:01:58] <logmsgbot>	 !log dduvall@deploy1001 Pruned MediaWiki: 1.36.0-wmf.30 (duration: 04m 20s)
[23:02:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:02:59] <wikibugs>	 (03CR) 10Dzahn: "This will make it so that any non-active deployment_server (deploy2001, deploy1002, deploy2002) will pull from deploy1001 with --delete to" [puppet] - 10https://gerrit.wikimedia.org/r/667031 (owner: 10Dzahn)
[23:03:19] <wikibugs>	 (03PS1) 10Razzi: hadoop: Add new worker nodes to hadoop_clusters [puppet] - 10https://gerrit.wikimedia.org/r/667032 (https://phabricator.wikimedia.org/T275767)
[23:04:46] <wikibugs>	 (03CR) 10Dzahn: "You dont need to review the puppet code, just the idea that we pull from one source with --delete so there can only be one version of /srv" [puppet] - 10https://gerrit.wikimedia.org/r/667031 (owner: 10Dzahn)
[23:07:05] <marxarelli>	 legoktm, mutante: seems `scap clean` is working properly but it doesn't target the /srv/mediawiki-staging dir on other deploy hosts. looks like ops/puppet/modules/scap/files/scap-master-sync is responsible for that
[23:08:07] <mutante>	 marxarelli: aha! thank you! and it has --delete in it
[23:08:32] <mutante>	 and it also syncs /srv/patches
[23:08:42] <marxarelli>	 yep yep
[23:08:50] <mutante>	 but i still need extra code in my change above to make puppet just automatically do this
[23:09:02] <mutante>	 just using the same way it already does for /srv/deployments
[23:09:14] <wikibugs>	 (03PS1) 10Mstyles: add new updater job properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/667034 (https://phabricator.wikimedia.org/T273095)
[23:09:29] <wikibugs>	 (03PS1) 10Jdlrobson: Do not log graph errors to WMF servers [extensions/Graph] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/666999 (https://phabricator.wikimedia.org/T274557)
[23:09:29] <mutante>	 so we could run  /usr/local/bin/scap-master-sync  on 2001
[23:12:26] <marxarelli>	 seems reasonable to me, but i'm not super familiar with why all of those flags are used (`--delete-delay` and such)
[23:12:54] <marxarelli>	 and why all of those old php- directories are still around if it's specifying `--delete` already
[23:13:09] <mutante>	 !log deploy1002 - /usr/local/bin/scap-master-sync deploy1001.eqiad.wmnet
[23:13:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:13:48] <mutante>	 ^ " from the given deployment master to the local staging directory"
[23:14:11] <marxarelli>	 ah ok. and they weren't running before?
[23:14:30] <mutante>	 no, because /srv/patches as empty
[23:14:35] <mutante>	 that is what made me wonder about it
[23:14:51] <marxarelli>	 sorry, i'm not following your changes very closely. i just noticed you discussing the issue with old php- dirs and thought i'd help
[23:14:54] <marxarelli>	 got it
[23:15:47] <mutante>	 don't be sorry, it is very helpful
[23:16:00] <mutante>	 i am trying to sync between old and new eqiad server right now
[23:16:04] <marxarelli>	 :)
[23:16:20] <mutante>	 the thing here is that I am trying to manage it all with puppet
[23:16:27] <mutante>	 but scap also has commands for it that i did not know
[23:16:37] <mutante>	 but apparently nothing runs those automatically
[23:17:19] <mutante>	 so the main part, /srv/deployment was already puppetized but not this other part, staging and patches
[23:17:43] <mutante>	 and a big part of what I wanted review for was.. should all of this sync with --delete
[23:17:54] <mutante>	 now that I know existing scap scripts do that too.. i can safely assume it
[23:17:58] <wikibugs>	 (03PS2) 10Razzi: hadoop: Add new worker nodes to hadoop_clusters [puppet] - 10https://gerrit.wikimedia.org/r/667032 (https://phabricator.wikimedia.org/T275767)
[23:18:04] <mutante>	 and make it work automatically
[23:20:29] <mutante>	 yea, master sync worked. in that ..it pulled the patches. what it did NOT mean is that mediawiki-staging is now the same size on both
[23:21:53] <mutante>	 looking at --delete-delay
[23:23:05] <mutante>	 hmm, no, that does not explain it, just means deletions at the end of the transfer
[23:24:59] <mutante>	 ah,  --exclude="**/cache/l10n/*.cdb"    is it
[23:25:16] <marxarelli>	 mutante: yeah, looks like it excludes l10n
[23:25:21] <mutante>	 the .cdb files don't get copied over.. so cant expect them to be identical
[23:25:31] <marxarelli>	 which uses the most space by far :/
[23:25:45] <marxarelli>	 ~ 2G per mw version
[23:25:49] <mutante>	 guess I should forget about this idea that I need both sides to be identical before I switch
[23:26:10] <marxarelli>	 i don't see why we should exclude the l10n cache
[23:26:12] <mutante>	 well, it's about the right order to switch 
[23:27:30] <mutante>	 marxarelli: that means if we had to fail-over we first have to rebuild all the l10n cache and it takes ... a long time,right
[23:27:57] <marxarelli>	 it happens as part of the normal deploy process during `scap sync-world`
[23:28:38] <marxarelli>	 i'll file a task and we can discuss in releng. maybe we can ditch the exclude
[23:29:37] <mutante>	 ok, thank you!
[23:30:56] <mutante>	 the bigger issue is how to switch the centrally defined server but also already be synced
[23:32:52] <mutante>	 !log deploy2001 - scap-master-sync from deploy1001
[23:34:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:34:36] <mutante>	 marxarelli: last comment for now. will follow-up on ticket. but ..it's trying to delete but "cannot delete non-empty directory" of old branches. that's the bug probably
[23:35:12] <marxarelli>	 ok. i think they're not empty because cache/l10n/*.cdb files are still there, right?
[23:35:30] <mutante>	 looks like it, they are all cache/10n and then cache
[23:35:37] <mutante>	 yes
[23:35:40] <marxarelli>	 k
[23:39:29] <mutante>	 !log deploy2001 - scap-master-sync from deploy1001 runs and attempts to --delete files to stay in sync but fails to do so because *.cdb files are in cache dirs and rsync does not want to delete non-empty directories, this leads to build up of the size of /srv/mediawiki-staging to 10 times the size of eqiad
[23:39:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:41:29] <mutante>	 !log deploy2001 2/2 - because rsync is --delete but also --exclude="**/cache/l10n/*.cdb" --exclude="*.swp"  you can't expect /srv/mediawiki-staging to be the same size on 2 servers
[23:41:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:55:13] <mutante>	 !log deploy1002, deploy2002 - scap-master-sync deploy1001.eqiad.wmnet (T265963)
[23:55:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:55:19] <stashbot>	 T265963: Replace production deployment servers and update them to Buster - https://phabricator.wikimedia.org/T265963
[23:59:40] <wikibugs>	 (03Abandoned) 10Dzahn: remove deploy1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/635114 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn)