[00:37:02] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:38:42] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:54:02] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:55:44] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:33:10] <librenms-wmf>	 08Warning Alert for device scs-c1-codfw.mgmt.codfw.wmnet - Processor usage over 85%
[01:55:28] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:57:08] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:07:10] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.13 [core] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/633598
[02:11:04] <wikibugs>	 (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.13 [core] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/633598 (https://phabricator.wikimedia.org/T263179) (owner: 10TrainBranchBot)
[03:23:32] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:28:42] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:36:04] <icinga-wm>	 PROBLEM - Disk space on ms-be2036 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=89%): /tmp 0 MB (0% inode=89%): /var/tmp 0 MB (0% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2036&var-datasource=codfw+prometheus/ops
[04:05:16] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:07:44] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:08:36] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:11:06] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:13:40] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:18:52] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:33:10] <librenms-wmf>	 08Warning Alert for device scs-c1-codfw.mgmt.codfw.wmnet - Processor usage over 85%
[05:11:13] <marostegui>	 Urbanecm: Sorry, I was out yesterday, will check the task where you pinged me
[05:35:22] <marostegui>	 !log Set global innodb_change_buffering = inserts; on pc2009 T263443
[05:35:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:35:30] <stashbot>	 T263443: Evaluate the impact of changing innodb_change_buffering to inserts  - https://phabricator.wikimedia.org/T263443
[05:41:53] <wikibugs>	 (03CR) 10Muehlenhoff: "We can simply drop timidity and freepats from the Puppet manifest; installed packages will stick around and things will get cleaned up whe" [puppet] - 10https://gerrit.wikimedia.org/r/445604 (owner: 10Reedy)
[05:50:30] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Remove s3 comment [puppet] - 10https://gerrit.wikimedia.org/r/633603
[05:51:28] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Remove s3 comment [puppet] - 10https://gerrit.wikimedia.org/r/633603 (owner: 10Marostegui)
[05:55:46] <icinga-wm>	 PROBLEM - Check systemd state on dbprov1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:19:12] <wikibugs>	 (03PS1) 10Elukey: Remove analytics1048 from the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/633605 (https://phabricator.wikimedia.org/T255140)
[06:19:43] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Remove analytics1048 from the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/633605 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey)
[06:39:54] <moritzm>	 !log Installing httpcomponents-client security updates for Stretch
[06:39:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:40:28] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:42:08] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:02:56] <moritzm>	 !log installing PHP 7.0 security updates
[07:03:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:06:07] <wikibugs>	 (03PS1) 10Ayounsi: Add cloud-in4 filters to cloudsw interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/633683 (https://phabricator.wikimedia.org/T265288)
[07:06:53] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add cloud-in4 filters to cloudsw interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/633683 (https://phabricator.wikimedia.org/T265288) (owner: 10Ayounsi)
[07:07:17] <wikibugs>	 (03Merged) 10jenkins-bot: Add cloud-in4 filters to cloudsw interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/633683 (https://phabricator.wikimedia.org/T265288) (owner: 10Ayounsi)
[07:15:01] <wikibugs>	 (03CR) 10Volans: "Nice! Thanks for working on this. Replies inline about naming and a couple of nits." (034 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel)
[07:23:41] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] documentation: refactor configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/633477 (owner: 10Volans)
[07:24:41] <wikibugs>	 (03CR) 10Volans: [C: 03+2] documentation: refactor configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/633477 (owner: 10Volans)
[07:24:51] <wikibugs>	 (03CR) 10Volans: [C: 03+2] pylint: allow 'logger' as module-scope name [software/spicerack] - 10https://gerrit.wikimedia.org/r/633478 (owner: 10Volans)
[07:27:55] <wikibugs>	 (03Merged) 10jenkins-bot: documentation: refactor configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/633477 (owner: 10Volans)
[07:28:33] <wikibugs>	 (03Merged) 10jenkins-bot: pylint: allow 'logger' as module-scope name [software/spicerack] - 10https://gerrit.wikimedia.org/r/633478 (owner: 10Volans)
[07:33:10] <librenms-wmf>	 08Warning Alert for device scs-c1-codfw.mgmt.codfw.wmnet - Processor usage over 85%
[07:33:25] <wikibugs>	 10Operations, 10DBA, 10Data-Persistence, 10Release-Engineering-Team-TODO, and 2 others: Create integration test env for wmfmariadbpy - https://phabricator.wikimedia.org/T265266 (10hashar) From yesterday discussion:  * we do not run arbitrary images * the CI job usually just run an image using Docker, the i...
[07:35:41] <godog>	 jayme: thanks for the fix for ms-be2036 !
[07:36:35] <wikibugs>	 10Operations, 10Data-Persistence-Backup, 10SRE-tools: Add toil::systemd_scope_cleanup to dbprov hosts - https://phabricator.wikimedia.org/T265323 (10Marostegui)
[07:37:24] <jayme>	 godog: sure, yw!
[07:37:37] <moritzm>	 !log installing ruby security updates on stretch
[07:37:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:14] <jayme>	 godog: I did not look into what has gone wrong, though. root's bash_history suggests there where other directories with broken permissions as well...do you have a theory?
[07:40:36] <wikibugs>	 10Operations, 10MW-on-K8s, 10serviceops: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10Joe)
[07:41:54] <icinga-wm>	 PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 79 probes of 651 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:42:03] <godog>	 jayme: sadly not a good one no, the host was in trouble over the weekend and the hw controller I think freaked out, with one of the ssd in sw raid timing out
[07:43:20] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[07:43:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:24] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[07:43:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:39] <kormat>	 !log running schema change against s3 in eqiad T259831
[07:43:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:44] <stashbot>	 T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831
[07:49:22] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on dbprov1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Marostegui T265323 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:49:53] <wikibugs>	 10Operations, 10MW-on-K8s, 10serviceops: Create a basic helm chart to test MediaWiki on kubernetes - https://phabricator.wikimedia.org/T265327 (10Joe) p:05Triage→03High
[07:52:24] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 70 probes of 567 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:55:19] <godog>	 mmhh looks like the ms-be2036 fs full is back, I'll take a look
[07:55:38] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.downtime
[07:55:39] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[07:55:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:56:31] <wikibugs>	 10Operations, 10Machine Learning Platform, 10SRE-Access-Requests: Requesting adding to ores-admin for Ladsgroup - https://phabricator.wikimedia.org/T265172 (10Lydia_Pintscher) It'd be <3 if we could get this approved so that we can get T261326 done in time for Wikidata's 8th birthday on the 29th.
[07:57:10] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] (Mostly) convert to pytest. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633534 (owner: 10Kormat)
[08:00:02] <wikibugs>	 (03Merged) 10jenkins-bot: (Mostly) convert to pytest. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633534 (owner: 10Kormat)
[08:04:26] <icinga-wm>	 RECOVERY - Disk space on ms-be2036 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2036&var-datasource=codfw+prometheus/ops
[08:11:17] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.downtime
[08:11:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:45] <wikibugs>	 10Operations, 10Analytics-Clusters, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10ema) The function `VUT_Main` is the main loop of VUT programs. The [[  https://github.com/varnishcache/varnish-cache/blob/6d4df3639725bbec6d1657b07867ec44f4ba14f8/lib/libvarnish...
[08:13:15] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[08:13:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:15:02] <wikibugs>	 (03PS1) 10Kormat: tox: Output format diffs [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633691
[08:18:26] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] tox: Output format diffs [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633691 (owner: 10Kormat)
[08:19:53] <wikibugs>	 (03Merged) 10jenkins-bot: tox: Output format diffs [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633691 (owner: 10Kormat)
[08:21:41] <wikibugs>	 10Operations, 10netops: cr3-esams linecard diversity issue - https://phabricator.wikimedia.org/T262524 (10ayounsi) Steps to go directly with the most sustainable option. No need for a site depool if done carefully.  {F32383060}  [] Configure cr2/3:ae0 with `link-speed mixed` [] Disconnect cable #20042 between...
[08:24:10] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:25:24] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:29:02] <wikibugs>	 10Operations, 10DBA, 10Data-Persistence, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat)
[08:40:10] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[08:40:11] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[08:40:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:08] <wikibugs>	 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for FY2020-2021 Q2 DC switchback - https://phabricator.wikimedia.org/T264364 (10Trizek-WMF)
[08:56:04] <wikibugs>	 (03PS1) 10Elukey: Update tests and docs to Varnish 6 and Debian Buster [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/633695
[08:59:11] <wikibugs>	 (03PS2) 10Elukey: Update tests and docs to Varnish 6 and Debian Buster [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/633695
[09:01:12] <wikibugs>	 (03PS1) 10Elukey: Set VUT grouping parameter to 'request' by default [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/633696 (https://phabricator.wikimedia.org/T264074)
[09:02:01] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Update tests and docs to Varnish 6 and Debian Buster [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/633695 (owner: 10Elukey)
[09:05:53] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:07:11] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:15:00] <wikibugs>	 (03CR) 10Ema: [C: 03+1] Set VUT grouping parameter to 'request' by default [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/633696 (https://phabricator.wikimedia.org/T264074) (owner: 10Elukey)
[09:20:27] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:22:05] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:22:25] <wikibugs>	 10Operations, 10Traffic, 10observability, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Document and/or improve navigation of the various HTTP frontend Grafana dashboards - https://phabricator.wikimedia.org/T253655 (10Ladsgroup)
[09:23:41] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.44 [software/spicerack] - 10https://gerrit.wikimedia.org/r/633697
[09:26:56] <wikibugs>	 10Operations, 10Traffic, 10observability, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Document and/or improve navigation of the various HTTP frontend Grafana dashboards - https://phabricator.wikimedia.org/T253655 (10Ladsgroup)
[09:27:49] <wikibugs>	 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic: Limit maps serving to Wikimedia hosted sites only - https://phabricator.wikimedia.org/T261424 (10Elitre) @JMinor @JKatzWMF please give your Go to get this done. Thanks!
[09:27:54] <wikibugs>	 (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.44 [software/spicerack] - 10https://gerrit.wikimedia.org/r/633697 (owner: 10Volans)
[09:29:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Add an apt proxy config for deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/633172 (https://phabricator.wikimedia.org/T262647) (owner: 10Muehlenhoff)
[09:29:49] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 38 probes of 567 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:30:37] <icinga-wm>	 RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 9 probes of 651 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:30:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] admin: urbanecm's home: Update .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/633365 (owner: 10Urbanecm)
[09:30:57] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.44 [software/spicerack] - 10https://gerrit.wikimedia.org/r/633697 (owner: 10Volans)
[09:31:16] <wikibugs>	 (03CR) 10Jbond: "merged" [puppet] - 10https://gerrit.wikimedia.org/r/633365 (owner: 10Urbanecm)
[09:32:39] <ema>	 !log cp3050: set grouping by request (vut->g_arg = 2) on varnishkafka-webrequest T264074
[09:32:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:46] <stashbot>	 T264074: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074
[09:36:07] <wikibugs>	 (03PS12) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783)
[09:37:48] <wikibugs>	 (03PS1) 10Volans: Upstream release v0.0.44 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/633698
[09:38:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel)
[09:38:47] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[09:38:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:55] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:38:58] <kormat>	 !log running schema change against s1 in eqiad T259831
[09:38:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:05] <stashbot>	 T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831
[09:39:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Upstream release v0.0.44 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/633698 (owner: 10Volans)
[09:40:15] <wikibugs>	 (03PS13) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783)
[09:42:53] <wikibugs>	 10Operations, 10Traffic, 10observability, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Document and/or improve navigation of the various HTTP frontend Grafana dashboards - https://phabricator.wikimedia.org/T253655 (10Ladsgroup) After talking to @ema, I'm cleaning this up a bit. Delet...
[09:46:08] <wikibugs>	 (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/633156 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond)
[09:46:49] <wikibugs>	 (03CR) 10Volans: "recheck" [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/633698 (owner: 10Volans)
[09:49:38] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10JMeybohm) p:05Triage→03Medium
[09:50:42] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Upstream release v0.0.44 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/633698 (owner: 10Volans)
[09:51:09] <wikibugs>	 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10JMeybohm) p:05Triage→03Medium
[09:51:36] <ema>	 !log cp3052: systemctl restart varnishkafka-webrequest.service T264074
[09:51:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:43] <stashbot>	 T264074: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074
[09:51:49] <wikibugs>	 (03PS1) 10Evrifaessa: Merge branch 'master' of https://gerrit.wikimedia.org/r/operations/mediawiki-config Change-Id: Ifa63ea16a07d8a39f71676e1300f38b6492afddb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633702
[09:51:51] <wikibugs>	 (03PS1) 10Evrifaessa: Add namespace aliases for Turkish Wikipedia (trwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633703 (https://phabricator.wikimedia.org/T265336)
[09:52:01] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10JMeybohm) p:05Triage→03Medium
[09:52:50] <wikibugs>	 (03Abandoned) 10Evrifaessa: Merge branch 'master' of https://gerrit.wikimedia.org/r/operations/mediawiki-config Change-Id: Ifa63ea16a07d8a39f71676e1300f38b6492afddb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633702 (owner: 10Evrifaessa)
[09:53:07] <wikibugs>	 (03Merged) 10jenkins-bot: Upstream release v0.0.44 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/633698 (owner: 10Volans)
[09:53:43] <wikibugs>	 10Operations, 10MW-on-K8s, 10serviceops: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10JMeybohm) p:05Triage→03Medium
[09:55:24] <wikibugs>	 10Operations, 10CAS-SSO, 10User-jbond: Apereo CAS expose CASCookieSameSite via profile::idp::client::http - https://phabricator.wikimedia.org/T264605 (10JMeybohm) p:05Triage→03Medium
[09:55:30] <ema>	 !log cp3054: systemctl restart varnishkafka-webrequest.service T264074
[09:55:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:32] <wikibugs>	 10Operations, 10homer, 10netops: Homer: Netbox driven switch interfaces - https://phabricator.wikimedia.org/T250429 (10ayounsi)
[10:02:22] <wikibugs>	 (03CR) 10Gehel: "A few minor comments inline. I've only done a high level review, someone should review in more dtails." (038 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans)
[10:04:33] <volans>	 !log uploaded spicerack_0.0.44 to apt.wikimedia.org buster-wikimedia
[10:04:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:43] <wikibugs>	 10Operations, 10Wikidata, 10serviceops: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10JMeybohm) p:05Triage→03Medium
[10:07:57] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:08:10] <wikibugs>	 10Operations, 10DBA, 10Data-Persistence, 10Release-Engineering-Team-TODO, and 2 others: Create integration test env for wmfmariadbpy - https://phabricator.wikimedia.org/T265266 (10JMeybohm) p:05Triage→03Medium
[10:09:22] <ema>	 !log cp3050: *reload* varnishkafka-webrequest T264074
[10:09:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:09:27] <stashbot>	 T264074: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074
[10:09:33] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:19:24] <ema>	 !log cp3050: clear varnishkafka-webrequest's vut->sighup via stap T264074
[10:19:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:29] <wikibugs>	 (03PS1) 10Filippo Giunchedi: install_server: use standard partman recipe for nvme cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/633704 (https://phabricator.wikimedia.org/T156955)
[10:19:29] <stashbot>	 T264074: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074
[10:29:45] <volans>	 !log upgrading spicerack on cumin2001 to 0.0.44
[10:29:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:10] <librenms-wmf>	 08Warning Alert for device scs-c1-codfw.mgmt.codfw.wmnet - Processor usage over 85%
[10:37:55] <wikibugs>	 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for FY2020-2021 Q2 DC switchback - https://phabricator.wikimedia.org/T264364 (10Trizek-WMF)
[10:40:11] <wikibugs>	 10Operations, 10Traffic, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Aggregated metrics for ats-tls <-> clients ttfb percentiles - https://phabricator.wikimedia.org/T263536 (10fgiunchedi) >>! In T263536#6522137, @fgiunchedi wrote: > With 50 percentile added I'm considering this closed! >  > A...
[10:41:03] <wikibugs>	 10Operations, 10Traffic, 10observability, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Document and/or improve navigation of the various HTTP frontend Grafana dashboards - https://phabricator.wikimedia.org/T253655 (10fgiunchedi) I wanted to mention/add that as part of {T263536} there...
[10:43:47] <wikibugs>	 (03PS1) 10Volans: doc: add missing link to wmflib package [software/spicerack] - 10https://gerrit.wikimedia.org/r/633709
[10:46:14] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] doc: add missing link to wmflib package [software/spicerack] - 10https://gerrit.wikimedia.org/r/633709 (owner: 10Volans)
[10:46:39] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_cxserver_cluster_codfw,swagger_check_restbase_esams} site={codfw,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:47:26] <jayme>	 !log no-change rolling restart of push-notifications in codfw - T265258
[10:47:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:33] <stashbot>	 T265258: High latency on push notification service initialization - https://phabricator.wikimedia.org/T265258
[10:47:41] <wikibugs>	 (03PS2) 10Volans: sre.hosts.decommission: import from new library [cookbooks] - 10https://gerrit.wikimedia.org/r/629692 (https://phabricator.wikimedia.org/T257905)
[10:48:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.decommission: import from new library [cookbooks] - 10https://gerrit.wikimedia.org/r/629692 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans)
[10:50:32] <wikibugs>	 10Operations, 10Analytics-Clusters, 10Traffic, 10Patch-For-Review: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10ema) >>! In T264074#6537906, @ema wrote: > So varnishkafka seems to be correctly looping continuously in the do-while part of VUT_Main. Why is VSM_Status b...
[10:51:41] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:54:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/633156 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond)
[10:55:03] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_cxserver_cluster_codfw,swagger_check_restbase_esams} site={codfw,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:57:15] <wikibugs>	 (03PS3) 10Volans: sre.hosts.decommission: import from new library [cookbooks] - 10https://gerrit.wikimedia.org/r/629692 (https://phabricator.wikimedia.org/T257905)
[10:58:29] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:59:42] <wikibugs>	 10Operations, 10DBA: Puppetize grants for mysql hosts that are the source of recovery (dbstore, passive misc) - https://phabricator.wikimedia.org/T111929 (10LSobanski) @jcrespo could you weigh in on what is the exact work that needs to happen here?
[11:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201013T1100).
[11:00:04] <jouncebot>	 Zoranzoki21 and Evrifaessa: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:13] <Urbanecm>	 I can deploy today!
[11:00:36] <Urbanecm>	 Evrifaessa: hello, are you around?
[11:01:33] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: refresh routing configuration [puppet] - 10https://gerrit.wikimedia.org/r/633711 (https://phabricator.wikimedia.org/T261724)
[11:01:46] <Evrifaessa>	 Urbanecm: I'm here
[11:01:48] <Evrifaessa>	 o/
[11:01:59] <Urbanecm>	 cool
[11:02:12] <wikibugs>	 (03PS2) 10Urbanecm: Add namespace aliases for Turkish Wikipedia (trwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633703 (https://phabricator.wikimedia.org/T265336) (owner: 10Evrifaessa)
[11:02:16] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add namespace aliases for Turkish Wikipedia (trwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633703 (https://phabricator.wikimedia.org/T265336) (owner: 10Evrifaessa)
[11:03:01] <wikibugs>	 (03Merged) 10jenkins-bot: Add namespace aliases for Turkish Wikipedia (trwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633703 (https://phabricator.wikimedia.org/T265336) (owner: 10Evrifaessa)
[11:03:19] <Urbanecm>	 Evrifaessa: can you test at mwdebug2001, please?
[11:03:52] <Evrifaessa>	 w8
[11:04:10] <wikibugs>	 10Operations, 10DBA, 10Growth-Team: Move echo tables from local wiki databases onto extension1 cluster for mediawikiwiki, metawiki, and officewiki - https://phabricator.wikimedia.org/T119154 (10LSobanski) 05Open→03Declined We consider this task to be very risky and with the limited gain suggest against g...
[11:04:10] <Urbanecm>	 w8?
[11:04:14] <Evrifaessa>	 wait
[11:04:28] <Evrifaessa>	 it works, but we need to move the pages I guess
[11:04:42] <Urbanecm>	 I'll do that with a script
[11:04:46] <Urbanecm>	 so, that's fine
[11:05:22] <Urbanecm>	 godog: sorry, I just devoiced you, as you're not a bot actually :-)
[11:05:28] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: refresh routing configuration [puppet] - 10https://gerrit.wikimedia.org/r/633711 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez)
[11:05:32] <Kizule>	 Hello, sorry for lating. :)
[11:06:01] <kormat>	 Urbanecm: how can you be sure? :)
[11:06:23] <wikibugs>	 (03CR) 10Volans: [C: 03+2] doc: add missing link to wmflib package [software/spicerack] - 10https://gerrit.wikimedia.org/r/633709 (owner: 10Volans)
[11:06:25] <Urbanecm>	 kormat: I think I talked with the real godog a few times, but maybe it was an impostor? :-)
[11:06:42] <Urbanecm>	 Evrifaessa: if that's all, I think we can go ahead?
[11:06:43] <Urbanecm>	 Hi Kizule 
[11:07:18] <Kizule>	 Hello Urbanecm :)
[11:07:44] <Kizule>	 I have only https://gerrit.wikimedia.org/r/c/633250/ :)
[11:07:56] <Urbanecm>	 Kizule: I know, but I started with Evrifaessa, since you were not arouns
[11:08:00] <Urbanecm>	 you're in the queue :)
[11:09:04] <Kizule>	 Okay, my postman was a little while ago, it is reason why I haven't joined in the time.
[11:09:14] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: e61fcebe7315f73d1fb4d531da37d2c1253115ee: Add namespace aliases for Turkish Wikipedia (T265336) (duration: 00m 59s)
[11:09:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:20] <stashbot>	 T265336: Add namespace aliases to Turkish Wikipedia - https://phabricator.wikimedia.org/T265336
[11:09:53] <wikibugs>	 (03Merged) 10jenkins-bot: doc: add missing link to wmflib package [software/spicerack] - 10https://gerrit.wikimedia.org/r/633709 (owner: 10Volans)
[11:10:42] <Urbanecm>	 Evrifaessa: script started, patch deployed
[11:10:50] <wikibugs>	 (03PS3) 10Urbanecm: Add suppressredirect right to reviewers on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633250 (https://phabricator.wikimedia.org/T265169) (owner: 10Zoranzoki21)
[11:10:54] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add suppressredirect right to reviewers on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633250 (https://phabricator.wikimedia.org/T265169) (owner: 10Zoranzoki21)
[11:11:00] <Evrifaessa>	 ty
[11:11:06] <Urbanecm>	 np
[11:11:37] <wikibugs>	 (03Merged) 10jenkins-bot: Add suppressredirect right to reviewers on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633250 (https://phabricator.wikimedia.org/T265169) (owner: 10Zoranzoki21)
[11:11:59] <Urbanecm>	 Kizule: can you try at mwdebug2001, please?
[11:12:06] <Evrifaessa>	 Urbanecm: how much time is it going to take for the script to be finished?
[11:12:36] <Urbanecm>	 several minutes - I've actually started just dryrun now, and will start the full run once the dry one completes
[11:13:06] <Urbanecm>	 I'll log here the start and en
[11:13:06] <wikibugs>	 (03CR) 10Gehel: Introduce an interface for progress bars. (033 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel)
[11:13:07] <Urbanecm>	 *end
[11:13:24] <volans>	 !log installed spicerack_0.0.43-1+deb10u1_amd64.deb on cumin2001 , need to wait a long-rnning cookbook to end to upgrade both hosts
[11:13:24] <Kizule>	 Urbanecm: Okay
[11:13:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:14:01] <wikibugs>	 (03CR) 10Gehel: Introduce an interface for progress bars. (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel)
[11:14:15] <Urbanecm>	 !log Start of `urbanecm@mwmaint2001:~$ mwscript namespaceDupes.php --wiki=trwiki --fix # T265336`
[11:14:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:14:21] <stashbot>	 T265336: Add namespace aliases to Turkish Wikipedia - https://phabricator.wikimedia.org/T265336
[11:15:24] <Urbanecm>	 Kizule: how is it going?
[11:15:28] <Kizule>	 Urbanecm: My patch should be good, on Special:UserGroupRights I see suppressredirect added in "reviewers", as should be.
[11:15:32] <Urbanecm>	 cool
[11:15:58] <Urbanecm>	 syncing
[11:16:05] <Kizule>	 Urbanecm: Patch is good to go, yea. :)
[11:16:55] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 90028b4c3c1cd4407e0834d603ccb8b256f2498e: Add suppressredirect right to reviewers on bnwiki (T265169) (duration: 00m 58s)
[11:17:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:01] <stashbot>	 T265169: Add suppressredirect right to reviewers on bnwiki - https://phabricator.wikimedia.org/T265169
[11:17:02] <Urbanecm>	 Kizule: should be live :)
[11:17:03] <Evrifaessa>	 Urbanecm: there seems to be conflicts
[11:17:12] <Kizule>	 Urbanecm: Checking...
[11:17:21] <Urbanecm>	 Evrifaessa: the script will report that, and I'll say so on the task :-)
[11:17:22] <wikibugs>	 (03CR) 10Volans: [C: 04-2] "Need to wait a long-running cookbook on cumin1001 to finish before installing spicerack 0.0.44 and deploying this" [cookbooks] - 10https://gerrit.wikimedia.org/r/629692 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans)
[11:17:33] <Urbanecm>	 (still running through)
[11:18:02] <Kizule>	 Urbanecm: It is live, I'll close task as resolved now.
[11:18:07] <Urbanecm>	 cool
[11:18:47] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:20:27] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:22:57] <moritzm>	 !log imported php-defaults, php-excimer, php-luasandbox, php-geoip to component/icu63 T264991
[11:23:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:03] <stashbot>	 T264991: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991
[11:24:42] <Evrifaessa>	 Urbanecm: did the script end yet?
[11:24:46] <Urbanecm>	 no
[11:24:46] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Thanks for the fixes, LGTM, just few more nits that came to mind, if you are in for it, all totally optional." (034 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel)
[11:24:51] <Urbanecm>	 still running
[11:25:00] <wikibugs>	 10Operations, 10GrowthExperiments-NewcomerTasks, 10Product-Infrastructure-Team-Backlog, 10serviceops, 10Growth-Team (Current Sprint): Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh)
[11:31:19] <Evrifaessa>	 Urbanecm: finished yet?
[11:31:33] <Urbanecm>	 Evrifaessa: no :-). I'll !log once it's completed :)
[11:52:30] <wikibugs>	 (03PS1) 10Ayounsi: Add switch interface support to decom script [cookbooks] - 10https://gerrit.wikimedia.org/r/633723 (https://phabricator.wikimedia.org/T265341)
[11:53:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add switch interface support to decom script [cookbooks] - 10https://gerrit.wikimedia.org/r/633723 (https://phabricator.wikimedia.org/T265341) (owner: 10Ayounsi)
[11:59:15] <wikibugs>	 (03CR) 10Gehel: Introduce an interface for progress bars. (033 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel)
[11:59:32] <wikibugs>	 (03CR) 10Gehel: Introduce an interface for progress bars. (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel)
[12:00:00] <wikibugs>	 (03PS14) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783)
[12:01:42] <Evrifaessa>	 Urbanecm: finished yet?
[12:04:39] <wikibugs>	 (03PS9) 10Volans: cookbook API: add class API [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212)
[12:05:00] <wikibugs>	 (03CR) 10Volans: "Replies inline" (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans)
[12:07:05] <wikibugs>	 (03PS2) 10Ayounsi: Add switch interface support to decom script [cookbooks] - 10https://gerrit.wikimedia.org/r/633723 (https://phabricator.wikimedia.org/T265341)
[12:08:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add switch interface support to decom script [cookbooks] - 10https://gerrit.wikimedia.org/r/633723 (https://phabricator.wikimedia.org/T265341) (owner: 10Ayounsi)
[12:08:15] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM!" [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel)
[12:08:31] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel)
[12:14:22] <godog>	 Urbanecm: haha! no worries, I don't know how I got voiced tbh
[12:14:33] <Evrifaessa>	 @seen Urbanecm
[12:14:33] <wm-bot>	 Evrifaessa: Urbanecm is in here, right now
[12:14:44] <godog>	 but yes definitely not a bot, that's what a bot would say
[12:20:55] <moritzm>	 !log imported dh-php, php-acpu, php-imagick to component/icu63 T264991
[12:21:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:03] <stashbot>	 T264991: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991
[12:27:31] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10fgiunchedi) I spotted a problem with `sdd` in dmesg too, perhaps that disk isn't healthy  ` [100075.068371] sd 0:1:0:3: [sdd] tag#13 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [100075.068375] sd 0:1...
[12:30:32] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] redis::instance: raise the file limit to match maxclients [puppet] - 10https://gerrit.wikimedia.org/r/632662 (https://phabricator.wikimedia.org/T263910) (owner: 10Giuseppe Lavagetto)
[12:42:24] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: redis::instance: switch to use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/632661
[12:42:34] <wikibugs>	 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10jbond) > no sure why but some host take a looong time While debugging this i noticed that we where receiving a lot of messages like the following (avalible...
[12:42:50] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: Degraded RAID on mw2279 - https://phabricator.wikimedia.org/T264698 (10Papaul) p:05Triage→03Medium
[12:43:03] <wikibugs>	 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) I've captured 30 minutes of data using varnishlog simultaneously on cp3052 and cp3054, using 4 variants of this command for hit-front...
[12:43:13] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10Papaul) p:05Triage→03Medium
[12:43:21] <Evrifaessa>	 Urbanecm: you here?
[12:43:37] <icinga-wm>	 PROBLEM - Disk space on ms-be2036 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=89%): /tmp 0 MB (0% inode=89%): /var/tmp 0 MB (0% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2036&var-datasource=codfw+prometheus/ops
[12:44:02] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] redis::instance: switch to use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/632661 (owner: 10Giuseppe Lavagetto)
[12:44:07] <godog>	 yeah that's me ^ (ms-be2036)
[12:44:27] <wikibugs>	 (03Abandoned) 10JMeybohm: api-gateway: use default envoy 1.15.1 image [deployment-charts] - 10https://gerrit.wikimedia.org/r/632483 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm)
[12:46:00] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Papaul) @Marostegui since the server is under warranty, it is best to use a disk that is under warranty as well.
[12:46:50] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) @Papaul sounds good, so maybe let's remove the old disk, give it 5 minutes, and then place the new one in?
[12:48:52] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Papaul) Will do that once on site
[12:49:27] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) Going to depool the host just in case, thanks!
[12:49:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2026 for on-site maintenance T263837 ', diff saved to https://phabricator.wikimedia.org/P12975 and previous config saved to /var/cache/conftool/dbconfig/20201013-124940-marostegui.json
[12:49:45] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::redis::multidc_instance: correct reference to base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/633735
[12:49:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:51] <stashbot>	 T263837: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837
[12:49:59] <Urbanecm>	 !log End of `urbanecm@mwmaint2001:~$ mwscript namespaceDupes.php --wiki=trwiki --fix` # T265336
[12:50:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:04] <stashbot>	 T265336: Add namespace aliases to Turkish Wikipedia - https://phabricator.wikimedia.org/T265336
[12:50:44] <Urbanecm>	 !log urbanecm@mwmaint2001:~$ mwscript namespaceDupes.php --wiki=trwiki --add-prefix=FIXME --fix # T265336
[12:50:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:04] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] sre.hosts.decommission: import from new library [cookbooks] - 10https://gerrit.wikimedia.org/r/629692 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans)
[12:51:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::redis::multidc_instance: correct reference to base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/633735 (owner: 10Giuseppe Lavagetto)
[12:55:18] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10elukey) >>! In T264472#6525167, @Kormat wrote:  > I've created your kerberos principal earlier today, you should receive an email telling you how to...
[12:57:41] <wikibugs>	 10Operations, 10DBA, 10Data-Persistence, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat)
[12:58:16] <wikibugs>	 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Kormat)
[12:58:21] <icinga-wm>	 PROBLEM - Check size of conntrack table on cescout1001 is CRITICAL: CRITICAL: nf_conntrack is 99 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[12:58:27] <wikibugs>	 10Operations, 10DBA, 10Data-Persistence, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) 05Open→03Stalled All of eqiad is now done. The remaining hosts/sections in codfw will be done after the dc swit...
[12:59:32] <jbond42>	 ^^^ conntrack error is me 
[12:59:43] <sukhe>	 oh great, thanks, was about to look
[13:01:11] <jbond42>	 sukhe: ok i have finished playing around on cescout my nmap scan just finished so that alert should clear soon.  ill remove nmap and let you know if i need it again.  thanks :)
[13:01:45] <icinga-wm>	 RECOVERY - Check size of conntrack table on cescout1001 is OK: OK: nf_conntrack is 76 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[13:02:20] <wikibugs>	 (03CR) 10Mforns: "LGTM! Good to merge on our side!" [puppet] - 10https://gerrit.wikimedia.org/r/633510 (https://phabricator.wikimedia.org/T254332) (owner: 10Ayounsi)
[13:03:20] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: redis::instance: raise the file limit to match maxclients [puppet] - 10https://gerrit.wikimedia.org/r/632662 (https://phabricator.wikimedia.org/T263910)
[13:04:15] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) @ayounsi  > Let me know when it's fine to merge the relevant change (for src_net + dst_net at least). Please, merge...
[13:05:44] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] redis::instance: raise the file limit to match maxclients [puppet] - 10https://gerrit.wikimedia.org/r/632662 (https://phabricator.wikimedia.org/T263910) (owner: 10Giuseppe Lavagetto)
[13:05:57] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:07:20] <marostegui>	 there's a huge spike on fatals
[13:07:26] <sukhe>	 jbond42: np :)
[13:07:58] <wikibugs>	 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Kormat)
[13:08:08] <moritzm>	 !log imported php-mailparse, php-mongodb, php-msgpack to component/icu63 T264991
[13:08:09] <wikibugs>	 10Operations, 10DBA, 10Data-Persistence, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) 05Stalled→03Resolved
[13:08:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:14] <stashbot>	 T264991: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991
[13:08:38] <wikibugs>	 10Operations, 10DBA, 10Data-Persistence, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) 05Resolved→03Stalled
[13:08:41] <wikibugs>	 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Kormat)
[13:09:21] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 14 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:11:16] <wikibugs>	 (03PS1) 10Kormat: dbutil: Allow .csv path to be overridden in env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633736
[13:15:01] <Urbanecm>	 !log [urbanecm@mwmaint2001 ~]$ mwscript namespaceDupes.php --wiki=trwiki --add-prefix=BROKEN --fix # T265336
[13:15:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:08] <stashbot>	 T265336: Add namespace aliases to Turkish Wikipedia - https://phabricator.wikimedia.org/T265336
[13:16:50] <wikibugs>	 (03PS4) 10Filippo Giunchedi: swift: change ownership depending on mountpoint status [puppet] - 10https://gerrit.wikimedia.org/r/516615 (https://phabricator.wikimedia.org/T225613)
[13:17:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] swift: change ownership depending on mountpoint status [puppet] - 10https://gerrit.wikimedia.org/r/516615 (https://phabricator.wikimedia.org/T225613) (owner: 10Filippo Giunchedi)
[13:19:07] <wikibugs>	 (03PS5) 10Filippo Giunchedi: swift: change ownership depending on mountpoint status [puppet] - 10https://gerrit.wikimedia.org/r/516615 (https://phabricator.wikimedia.org/T225613)
[13:22:59] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] dbutil: Allow .csv path to be overridden in env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633736 (owner: 10Kormat)
[13:24:26] <wikibugs>	 (03Merged) 10jenkins-bot: dbutil: Allow .csv path to be overridden in env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633736 (owner: 10Kormat)
[13:29:53] <icinga-wm>	 PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:30:01] <wikibugs>	 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10MSantos) @CDanis and @Dzahn as per T261424#6538173, is there anything else to be done for the 3rd party block in the tr...
[13:33:09] <librenms-wmf>	 08Warning Alert for device scs-c1-codfw.mgmt.codfw.wmnet - Processor usage over 85%
[13:33:15] <icinga-wm>	 RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:33:22] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Nfacctd, add src_net, dst_net [puppet] - 10https://gerrit.wikimedia.org/r/633510 (https://phabricator.wikimedia.org/T254332) (owner: 10Ayounsi)
[13:35:28] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] service_proxy: add node.js keepalive to push-notifications [puppet] - 10https://gerrit.wikimedia.org/r/633199 (owner: 10JMeybohm)
[13:37:58] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10ayounsi) 05Resolved→03Open This has been alerting again this time for scs-c1-codfw. See https://librenms.wikimedia.org/graphs/device=170/type=device_processor/from=1601991300/legend=yes/pop...
[13:39:10] <librenms-wmf>	 08Warning Alert for device scs-c1-codfw.mgmt.codfw.wmnet - Processor usage over 85% got acknowledged
[13:40:25] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10fgiunchedi) Plus these usb recurring messages in dmesg  ` [105275.802560] usb 3-3: USB disconnect, device number 13 [105276.254453] usb 3-3: new high-speed USB device number 14 using xhci_hcd [105276.394569] us...
[13:40:34] <logmsgbot>	 !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'push-notifications' for release 'main' .
[13:40:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:55] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10ayounsi) Merged, note that it's not in a CIDR notation, so `src_mask` + `dst_mask` would be needed to generate the CIDR form.
[13:42:55] <logmsgbot>	 !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'push-notifications' for release 'main' .
[13:43:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:39] <icinga-wm>	 RECOVERY - Disk space on ms-be2036 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2036&var-datasource=codfw+prometheus/ops
[13:46:37] <wikibugs>	 (03PS1) 10Ayounsi: Nfacct: add src_mask + dst_mask [puppet] - 10https://gerrit.wikimedia.org/r/633737 (https://phabricator.wikimedia.org/T254332)
[13:55:58] <wikibugs>	 (03PS1) 10Vgutierrez: vcl: Bump ECDHE-ECDSA-AES128-SHA pageview replacement to 10% [puppet] - 10https://gerrit.wikimedia.org/r/633739 (https://phabricator.wikimedia.org/T258405)
[13:57:27] <wikibugs>	 (03PS1) 10Andrew Bogott: clouddvirt102[1-9]: apply libvirt-backy-ssd partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/633741 (https://phabricator.wikimedia.org/T260692)
[13:59:31] <wikibugs>	 (03PS2) 10Andrew Bogott: clouddvirt102[1-9]: apply libvirt-backy-ssd partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/633741 (https://phabricator.wikimedia.org/T260692)
[14:07:58] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] clouddvirt102[1-9]: apply libvirt-backy-ssd partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/633741 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott)
[14:12:52] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "nova-fullstack monitoring: turn on debug logging" [puppet] - 10https://gerrit.wikimedia.org/r/633742 (https://phabricator.wikimedia.org/T265140)
[14:13:55] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Revert "nova-fullstack monitoring: turn on debug logging" [puppet] - 10https://gerrit.wikimedia.org/r/633742 (https://phabricator.wikimedia.org/T265140) (owner: 10Andrew Bogott)
[14:16:10] <wikibugs>	 (03PS1) 10Urbanecm: Add setmentor to wgAvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633743
[14:16:23] <Urbanecm>	 jouncebot: now
[14:16:23] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 43 minute(s)
[14:16:32] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add setmentor to wgAvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633743 (owner: 10Urbanecm)
[14:17:14] <wikibugs>	 (03Merged) 10jenkins-bot: Add setmentor to wgAvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633743 (owner: 10Urbanecm)
[14:18:34] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: 5b28fd685b9cb8d8e93650b5d02bc41b81d0883c: Add setmentor to wgAvailableRights (duration: 00m 59s)
[14:18:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:14] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudvirt1022: move to virt_ceph_and_backy [puppet] - 10https://gerrit.wikimedia.org/r/633744 (https://phabricator.wikimedia.org/T260692)
[14:23:17] <wikibugs>	 (03PS1) 10Elukey: sre.hadoop.reboot-workers: allow to limit workers to reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/633766 (https://phabricator.wikimedia.org/T255138)
[14:25:17] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1002 job=burrow partition={3,4,5} prometheus=ops site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=l
[14:25:17] <icinga-wm>	 topic=All&var-consumer_group=All
[14:26:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre.hadoop.reboot-workers: allow to limit workers to reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/633766 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey)
[14:27:03] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[14:37:15] <icinga-wm>	 RECOVERY - Disk space on sretest1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=sretest1001&var-datasource=eqiad+prometheus/ops
[14:39:30] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[14:39:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:36] <wikibugs>	 (03CR) 10CDanis: VCL: A heavy hammer for dire circumstances. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/631848 (owner: 10CDanis)
[14:40:43] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:40:58] <wikibugs>	 (03PS4) 10CDanis: VCL: A heavy hammer for dire circumstances. [puppet] - 10https://gerrit.wikimedia.org/r/631848
[14:41:25] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[14:41:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:59] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:42:00] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1022: move to virt_ceph_and_backy [puppet] - 10https://gerrit.wikimedia.org/r/633744 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott)
[14:43:05] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1002 job=burrow partition={1,3,4,5} prometheus=ops site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster
[14:43:05] <icinga-wm>	 r-topic=All&var-consumer_group=All
[14:46:20] <wikibugs>	 (03CR) 10Ema: VCL: A heavy hammer for dire circumstances. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/631848 (owner: 10CDanis)
[14:46:21] <wikibugs>	 (03CR) 10Ema: [C: 03+1] VCL: A heavy hammer for dire circumstances. [puppet] - 10https://gerrit.wikimedia.org/r/631848 (owner: 10CDanis)
[14:51:33] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[14:55:27] <wikibugs>	 (03PS1) 10Kormat: mariadb: Remove unused hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/633768 (https://phabricator.wikimedia.org/T256972)
[14:58:59] <wikibugs>	 (03PS2) 10Andrew Bogott: cloud-vps: update resolv.conf and associated domains for .wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/631511
[14:59:24] <wikibugs>	 (03CR) 10Kormat: "PCC is no-op: https://puppet-compiler.wmflabs.org/compiler1003/25827/" [puppet] - 10https://gerrit.wikimedia.org/r/633768 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat)
[15:02:23] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Papaul) new disk in place     Status  Name  State  Slot Number  Size  Security Status  Bus Protocol  Media Type  Hot Spare  Remaining Rated Write Endurance    Physical Disk 0:1:0  On...
[15:02:37] <godog>	 !log bounce logstash on logstash1007, GC death
[15:02:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:57] <wikibugs>	 (03CR) 10Andrew Bogott: "I'm confident about the search path thing being safe, less confident about the domain bit" [puppet] - 10https://gerrit.wikimedia.org/r/631511 (owner: 10Andrew Bogott)
[15:04:32] <wikibugs>	 (03PS2) 10Elukey: sre.hadoop.reboot-workers: allow to limit workers to reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/633766 (https://phabricator.wikimedia.org/T255138)
[15:04:48] <wikibugs>	 (03PS1) 10Kormat: [WIP] mariadb: Convert role::mariadb::core to profile. [puppet] - 10https://gerrit.wikimedia.org/r/633769 (https://phabricator.wikimedia.org/T256972)
[15:05:15] <wikibugs>	 (03CR) 10Volans: "Minor nits inline, LGMT" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/633766 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey)
[15:05:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre.hadoop.reboot-workers: allow to limit workers to reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/633766 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey)
[15:05:51] <papaul>	 marostegui: hello can you also take the possessed server (db2125) down so i can replace the CPU
[15:06:45] <kormat>	 papaul: hi, i'll do it.
[15:07:02] <marostegui>	 thanks so much kormat
[15:07:06] <marostegui>	 <3
[15:07:07] <papaul>	 kormat: thanks
[15:08:09] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Papaul) ` pt1979@es2026:~$ sudo megacli -PDRbld -ShowProg -physdrv[32:2] -aALL  Rebuild Progress on Device at Enclosure 32, Slot 2 Completed 3% in 6 Minutes.
[15:08:23] <icinga-wm>	 PROBLEM - Disk space on sretest1001 is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/cd7dd8b4f8190a4d3d7e08b4304dcd82cdfd76206313416e6ad144eb359e7ea9/mounts/shm is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=sretest1001&var-datasource=eqiad+prometheus/ops
[15:08:29] <kormat>	 papaul: poweroff is running now
[15:08:36] <papaul>	 kormat: cool thanks
[15:10:56] <wikibugs>	 (03PS5) 10CDanis: VCL: A heavy hammer for dire circumstances. [puppet] - 10https://gerrit.wikimedia.org/r/631848
[15:11:32] <wikibugs>	 (03CR) 10Volans: "LGTM, one suggestion inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/633550 (owner: 10Elukey)
[15:14:29] <icinga-wm>	 PROBLEM - Host db2125.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:15:05] <kormat>	 ^ acking
[15:15:27] <icinga-wm>	 ACKNOWLEDGEMENT - Host db2125.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Kormat maintenance
[15:16:36] <wikibugs>	 (03CR) 10Elukey: sre.hadoop.reboot-workers: allow to limit workers to reboot (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/633766 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey)
[15:17:01] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/631511 (owner: 10Andrew Bogott)
[15:20:57] <wikibugs>	 (03PS6) 10CDanis: VCL: A heavy hammer for dire circumstances. [puppet] - 10https://gerrit.wikimedia.org/r/631848
[15:24:58] <wikibugs>	 (03CR) 10CDanis: "With the new flag disabled, text and upload:" [puppet] - 10https://gerrit.wikimedia.org/r/631848 (owner: 10CDanis)
[15:25:21] <wikibugs>	 (03PS7) 10CDanis: VCL: A heavy hammer for dire circumstances. [puppet] - 10https://gerrit.wikimedia.org/r/631848
[15:26:12] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudvirt1021: add backy support [puppet] - 10https://gerrit.wikimedia.org/r/633771 (https://phabricator.wikimedia.org/T260692)
[15:26:45] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1021: add backy support [puppet] - 10https://gerrit.wikimedia.org/r/633771 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott)
[15:27:42] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: cloud-vps: update resolv.conf and associated domains for .wikimedia.cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/631511 (owner: 10Andrew Bogott)
[15:29:09] <icinga-wm>	 PROBLEM - MegaRAID on es2026 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:29:10] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on es2026 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T265368 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:29:17] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T265368 (10ops-monitoring-bot)
[15:30:00] <Spookreeeno>	 Same as T263837?
[15:30:00] <stashbot>	 T263837: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837
[15:30:37] <wikibugs>	 (03CR) 10Andrew Bogott: cloud-vps: update resolv.conf and associated domains for .wikimedia.cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/631511 (owner: 10Andrew Bogott)
[15:30:41] <wikibugs>	 (03PS3) 10Andrew Bogott: cloud-vps: update resolv.conf and associated domains for .wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/631511
[15:33:51] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T265368 (10Marostegui) 05Open→03Declined Handled at https://phabricator.wikimedia.org/T263837
[15:36:06] <wikibugs>	 (03PS2) 10Elukey: sre.hadoop.change-distro-from-cdh: allow to select workers/journal [cookbooks] - 10https://gerrit.wikimedia.org/r/633550
[15:36:08] <wikibugs>	 (03PS3) 10Elukey: sre.hadoop.reboot-workers: allow to limit workers to reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/633766 (https://phabricator.wikimedia.org/T255138)
[15:36:35] <icinga-wm>	 RECOVERY - Host db2125.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.99 ms
[15:37:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre.hadoop.change-distro-from-cdh: allow to select workers/journal [cookbooks] - 10https://gerrit.wikimedia.org/r/633550 (owner: 10Elukey)
[15:37:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre.hadoop.reboot-workers: allow to limit workers to reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/633766 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey)
[15:38:58] <wikibugs>	 (03PS3) 10Elukey: sre.hadoop.change-distro-from-cdh: allow to select workers/journal [cookbooks] - 10https://gerrit.wikimedia.org/r/633550
[15:39:00] <wikibugs>	 (03PS4) 10Elukey: sre.hadoop.reboot-workers: allow to limit workers to reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/633766 (https://phabricator.wikimedia.org/T255138)
[15:39:13] <elukey>	 the -1s are due to another cr waiting to be merged, sorry for the spam
[15:40:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre.hadoop.change-distro-from-cdh: allow to select workers/journal [cookbooks] - 10https://gerrit.wikimedia.org/r/633550 (owner: 10Elukey)
[15:40:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre.hadoop.reboot-workers: allow to limit workers to reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/633766 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey)
[15:40:49] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) Both CPU replaced, servers is back up
[15:43:18] <wikibugs>	 (03PS4) 10Volans: tests: import dns from new wmflib package [cookbooks] - 10https://gerrit.wikimedia.org/r/629692 (https://phabricator.wikimedia.org/T257905)
[15:43:38] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: update resolv.conf and associated domains for .wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/631511 (owner: 10Andrew Bogott)
[15:44:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] tests: import dns from new wmflib package [cookbooks] - 10https://gerrit.wikimedia.org/r/629692 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans)
[15:46:35] <wikibugs>	 (03PS1) 10Andrew Bogott: WMCS nfs: remove the last mounts for wikidata-dev [puppet] - 10https://gerrit.wikimedia.org/r/633773 (https://phabricator.wikimedia.org/T208416)
[15:47:15] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] WMCS nfs: remove the last mounts for wikidata-dev [puppet] - 10https://gerrit.wikimedia.org/r/633773 (https://phabricator.wikimedia.org/T208416) (owner: 10Andrew Bogott)
[15:50:08] <wikibugs>	 (03PS5) 10Volans: sre.hosts.decommission: import from new library [cookbooks] - 10https://gerrit.wikimedia.org/r/629692 (https://phabricator.wikimedia.org/T257905)
[15:50:10] <wikibugs>	 (03PS1) 10Volans: Temporary limit spicerack dependency [cookbooks] - 10https://gerrit.wikimedia.org/r/633774 (https://phabricator.wikimedia.org/T257905)
[15:52:00] <wikibugs>	 (03CR) 10CRusnov: [C: 03+1] "seems legitimate" [cookbooks] - 10https://gerrit.wikimedia.org/r/633774 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans)
[15:52:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.decommission: import from new library [cookbooks] - 10https://gerrit.wikimedia.org/r/629692 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans)
[15:52:33] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Temporary limit spicerack dependency [cookbooks] - 10https://gerrit.wikimedia.org/r/633774 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans)
[15:53:09] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: refresh network configuration [puppet] - 10https://gerrit.wikimedia.org/r/633775 (https://phabricator.wikimedia.org/T261724)
[15:53:38] <wikibugs>	 (03Merged) 10jenkins-bot: Temporary limit spicerack dependency [cookbooks] - 10https://gerrit.wikimedia.org/r/633774 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans)
[15:53:59] <wikibugs>	 (03PS6) 10Volans: sre.hosts.decommission: import from new library [cookbooks] - 10https://gerrit.wikimedia.org/r/629692 (https://phabricator.wikimedia.org/T257905)
[15:55:16] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: refresh network configuration [puppet] - 10https://gerrit.wikimedia.org/r/633775 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez)
[15:55:43] <wikibugs>	 10Operations, 10Phatality, 10observability, 10Developer Productivity: Deploying "Phatality" plugin for Kibana invokes oom-killer on logstash::collector nodes - https://phabricator.wikimedia.org/T237706 (10mmodell) @fgiunchedi it's unlikely that it will work in kibana 7 without significant changes. Kibana's...
[15:56:11] <papaul>	 !log power down ms-be2036 for maintenance 
[15:56:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:52] <icinga-wm>	 PROBLEM - Host ms-be2036 is DOWN: PING CRITICAL - Packet loss = 100%
[15:59:13] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] wikidough: enable OCSP stapling in dnsdist [puppet] - 10https://gerrit.wikimedia.org/r/632735 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh)
[16:00:04] <jouncebot>	 jbond42 and cdanis: I, the Bot under the Fountain, allow thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201013T1600).
[16:07:03] <wikibugs>	 (03PS2) 10Ssingh: wikidough: enable OCSP stapling in dnsdist [puppet] - 10https://gerrit.wikimedia.org/r/632735 (https://phabricator.wikimedia.org/T252132)
[16:07:37] <marxarelli>	 tgr_: just setting up over here to perform the wmf.11 promotions
[16:07:56] <wikibugs>	 (03CR) 10Ssingh: "Patch rebased, no other changes." [puppet] - 10https://gerrit.wikimedia.org/r/632735 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh)
[16:08:00] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] wikidough: enable OCSP stapling in dnsdist [puppet] - 10https://gerrit.wikimedia.org/r/632735 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh)
[16:09:16] <tgr_>	 marxarelli: ack
[16:13:20] <marxarelli>	 longma: ^
[16:18:00] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Eventstreams: expose revision-visibility-change events. [deployment-charts] - 10https://gerrit.wikimedia.org/r/630262 (https://phabricator.wikimedia.org/T262479) (owner: 10Ppchelko)
[16:18:52] <wikibugs>	 (03PS4) 10Elukey: sre.hadoop.change-distro-from-cdh: allow to select workers/journal [cookbooks] - 10https://gerrit.wikimedia.org/r/633550
[16:18:54] <wikibugs>	 (03PS5) 10Elukey: sre.hadoop.reboot-workers: allow to limit workers to reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/633766 (https://phabricator.wikimedia.org/T255138)
[16:23:33] <wikibugs>	 (03PS1) 10Jbond: wmflib: new fact listening ports [puppet] - 10https://gerrit.wikimedia.org/r/633779
[16:24:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmflib: new fact listening ports [puppet] - 10https://gerrit.wikimedia.org/r/633779 (owner: 10Jbond)
[16:24:26] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) Thank you Papaul, I will start repooling the host tomorrow and see how not goes with load
[16:25:57] <marxarelli>	 tgr_, longma: sorry for the delay. thought i'd email the lists with the latest deployment plan (cc thcipriani). continuing...
[16:26:27] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@77febb6]: airflow: parameterize active mediawiki dc
[16:26:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:17] <marxarelli>	 tgr_: were your wmf.11 sync'd?
[16:27:24] <marxarelli>	 wmf.11 *backports*
[16:31:57] <logmsgbot>	 !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@77febb6]: airflow: parameterize active mediawiki dc (duration: 05m 29s)
[16:32:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:32] <tgr_>	 marxarelli: yeah
[16:34:14] <marxarelli>	 k. just double checking. looks good from here. rolling
[16:35:03] <wikibugs>	 (03PS1) 10Ssingh: dnsdist: update permissions for dnsdist.conf [puppet] - 10https://gerrit.wikimedia.org/r/633781
[16:35:17] <marxarelli>	 oh fun. looks like deploy-promote is confused
[16:36:27] <wikibugs>	 (03PS1) 10Dduvall: group0 wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633784
[16:36:29] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] group0 wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633784 (owner: 10Dduvall)
[16:36:51] <wikibugs>	 (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/25836/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/633781 (owner: 10Ssingh)
[16:36:57] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] dnsdist: update permissions for dnsdist.conf [puppet] - 10https://gerrit.wikimedia.org/r/633781 (owner: 10Ssingh)
[16:37:08] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633784 (owner: 10Dduvall)
[16:38:15] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10Papaul) @fgiunchedi  ` Embedded Flash/SD-CARD   Controller firmware revision 2.10.00 Embedded media manager failed media attach
[16:39:12] <wikibugs>	 (03Abandoned) 10Volans: sre.hosts.downtime: convert to class-based API [cookbooks] - 10https://gerrit.wikimedia.org/r/506954 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans)
[16:39:31] <logmsgbot>	 !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.11
[16:39:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:47] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10Papaul) Upgrade ILO from 2.50 to 2.74
[16:41:11] <wikibugs>	 (03PS1) 10Gergő Tisza: GrowthExperiments: Keep users with no explicit variant on variant A [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633786 (https://phabricator.wikimedia.org/T265372)
[16:44:23] <wikibugs>	 (03PS2) 10Gergő Tisza: GrowthExperiments: Keep users with no explicit variant on variant A [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633786 (https://phabricator.wikimedia.org/T265372)
[16:50:39] <wikibugs>	 (03PS2) 10Jbond: wmflib: new fact listening ports [puppet] - 10https://gerrit.wikimedia.org/r/633779
[16:51:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmflib: new fact listening ports [puppet] - 10https://gerrit.wikimedia.org/r/633779 (owner: 10Jbond)
[16:52:01] <wikibugs>	 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10nettrom_WMF) >>! In T252391#6536174, @kostajh wrote: > Hmm, I spoke too soon. We rely on the `wgWMEUnderstandingFirstDay` bei...
[16:52:59] <wikibugs>	 (03PS3) 10Jbond: wmflib: new fact listening ports [puppet] - 10https://gerrit.wikimedia.org/r/633779
[16:53:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmflib: new fact listening ports [puppet] - 10https://gerrit.wikimedia.org/r/633779 (owner: 10Jbond)
[16:55:17] <marxarelli>	 tgr_: so far so good. i'm thinking of rolling wmf.11 to group1 in about 30 min. does that work for you?
[16:55:22] <marxarelli>	 longma: ^
[16:55:47] <wikibugs>	 (03PS4) 10Jbond: wmflib: new fact listening ports [puppet] - 10https://gerrit.wikimedia.org/r/633779
[16:56:17] <marxarelli>	 i mean, so far so good in terms of logging. if there's more to do to verify the session issue i can hold off
[16:56:53] <wikibugs>	 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic: Limit maps serving to Wikimedia hosted sites only - https://phabricator.wikimedia.org/T261424 (10JMinor) Yes, I think everyone who requested to be added to the allow list has been added. There were a couple questions on the mailing lis...
[16:57:43] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/633766 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey)
[16:58:38] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/633550 (owner: 10Elukey)
[17:00:04] <jouncebot>	 chrisalbon and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201013T1700).
[17:00:21] <wikibugs>	 (03PS5) 10Jbond: wmflib: new fact listening ports [puppet] - 10https://gerrit.wikimedia.org/r/633779
[17:00:45] <wikibugs>	 (03CR) 10Jbond: wmflib: new fact listening ports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633779 (owner: 10Jbond)
[17:01:44] <tgr_>	 marxarelli: looks fine to me
[17:02:24] <marxarelli>	 tgr_: ack. thanks!
[17:05:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/633768 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat)
[17:07:24] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 129.2 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37
[17:09:36] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+2] Eventstreams: expose revision-visibility-change events. [deployment-charts] - 10https://gerrit.wikimedia.org/r/630262 (https://phabricator.wikimedia.org/T262479) (owner: 10Ppchelko)
[17:11:43] <wikibugs>	 (03Merged) 10jenkins-bot: Eventstreams: expose revision-visibility-change events. [deployment-charts] - 10https://gerrit.wikimedia.org/r/630262 (https://phabricator.wikimedia.org/T262479) (owner: 10Ppchelko)
[17:11:54] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10wiki_willy) a:05RKemper→03Cmjohnson
[17:15:51] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Papaul) return tracking information  {F32383537}
[17:15:52] <logmsgbot>	 !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
[17:15:52] <logmsgbot>	 !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
[17:15:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:16:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:16:56] <logmsgbot>	 !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
[17:16:56] <logmsgbot>	 !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
[17:17:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:17:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:17:08] <icinga-wm>	 RECOVERY - Host ms-be2036 is UP: PING OK - Packet loss = 0%, RTA = 33.42 ms
[17:17:32] <wikibugs>	 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10Tgr) >>! In T252391#6536174, @kostajh wrote: > I think instead of checking to see if `wgWMEUnderstandingFirstDay` is true, we...
[17:18:28] <logmsgbot>	 !log ppchelko@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
[17:18:28] <logmsgbot>	 !log ppchelko@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
[17:18:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:48] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] geoip: archive MaxMind database to hdfs only [puppet] - 10https://gerrit.wikimedia.org/r/631896 (https://phabricator.wikimedia.org/T264152) (owner: 10Razzi)
[17:30:37] <marxarelli>	 !log 1.36.0-wmf.11 promoted to group0. no new errors (T263177). preparing to promote to group1
[17:30:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:43] <stashbot>	 T263177: 1.36.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T263177
[17:31:23] <wikibugs>	 (03PS1) 10Dduvall: group1 wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633793
[17:31:25] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633793 (owner: 10Dduvall)
[17:31:48] <marxarelli>	 tgr_, longma: ^ rolling to group1
[17:32:06] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633793 (owner: 10Dduvall)
[17:32:15] <longma>	 👀
[17:32:29] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime
[17:32:30] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[17:32:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:26] <DannyS712>	 marxarelli should I turn on xwikimediadebug and make some test edits in case the issue comes back?
[17:34:12] <marxarelli>	 DannyS712: that'd be great, yeah
[17:34:25] <logmsgbot>	 !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.11
[17:34:27] <DannyS712>	 with verbose logging, or no?
[17:34:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:42] <icinga-wm>	 PROBLEM - Host ms-be2036 is DOWN: PING CRITICAL - Packet loss = 100%
[17:35:24] <marxarelli>	 ^ tgr_ re: debug logging. would that be helpful?
[17:35:34] <logmsgbot>	 !log dduvall@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.11 (duration: 01m 07s)
[17:35:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:39:39] <marxarelli>	 DannyS712: i think whatever you can do to try and repro the issue would be helpful. tgr_ has set up some additional logging on the session channel but i don't see how additional debug logging would hurt
[17:40:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/633704 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi)
[17:40:24] <DannyS712>	 I turned on verbose and am just doing things on meta (recent changes patrolling) which should hopefully trigger whatever code paths caused it last time
[17:41:50] <icinga-wm>	 PROBLEM - Host ms-be2036.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:42:47] <wikibugs>	 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic: Limit maps serving to Wikimedia hosted sites only - https://phabricator.wikimedia.org/T261424 (10CDanis) >>! In T261424#6539494, @JMinor wrote: > Yes, I think everyone who requested to be added to the allow list has been added. There w...
[17:42:58] <wikibugs>	 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic: Limit maps serving to Wikimedia hosted sites only - https://phabricator.wikimedia.org/T261424 (10CDanis) a:03CDanis
[17:47:01] <wikibugs>	 (03CR) 10Bstorm: [C: 04-1] "The neovim add is fine. We need to add a bit more to make the others work." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633583 (https://phabricator.wikimedia.org/T219501) (owner: 10MichaelSchoenitzer)
[17:47:36] <icinga-wm>	 RECOVERY - Host ms-be2036.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.67 ms
[17:47:41] <marxarelli>	 !log 1.36.0-wmf.13 branched at a6be801fc6331a6a6b96f02f368750200d50ab09 for T263179
[17:47:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:48] <stashbot>	 T263179: 1.36.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T263179
[17:48:44] <icinga-wm>	 RECOVERY - Host ms-be2036 is UP: PING OK - Packet loss = 0%, RTA = 33.33 ms
[17:49:46] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on mw2279 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mw2279&var-datasource=codfw+prometheus/ops
[17:51:44] <wikibugs>	 (03CR) 10Bstorm: [C: 04-1] "I see the problem. It's available in buster, but it isn't in stretch. Bastions and gridengine are still on Buster for now." [puppet] - 10https://gerrit.wikimedia.org/r/633583 (https://phabricator.wikimedia.org/T219501) (owner: 10MichaelSchoenitzer)
[17:52:51] <wikibugs>	 (03PS1) 10Andrew Bogott: site.pp: rearrange slightly to clarify different cloudvirt roles [puppet] - 10https://gerrit.wikimedia.org/r/633798
[17:54:10] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] Branch commit for wmf/1.36.0-wmf.13 [core] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/633598 (https://phabricator.wikimedia.org/T263179) (owner: 10TrainBranchBot)
[17:55:33] <tgr_>	 marxarelli: TBH the chances of it happening to the same user twice in a row are slim. If it was specifically related to X-Wikimedia-Debug, maybe, but DannyS712 said he didn't use XWD before the incident.
[17:56:14] <DannyS712>	 I didn't have it turned on, but I had used it before
[17:56:14] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10Papaul) The Flash/SD-CARD  problem was fixed by formatting the NAND and draining the power    ` Embedded Flash/SD-CARD   Controller firmware revision 2.10.00
[17:56:18] <DannyS712>	 but yeah, its unlikely
[17:56:31] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] site.pp: rearrange slightly to clarify different cloudvirt roles [puppet] - 10https://gerrit.wikimedia.org/r/633798 (owner: 10Andrew Bogott)
[17:56:37] <marxarelli>	 ok
[17:56:48] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10Papaul) @fgiunchedi looks like icinga is happy now   `    MD RAID   View Extra Service Notes  OK  2020-10-13 17:51:33  0d 0h 3m 51s  1/3  OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[17:57:37] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: Degraded RAID on mw2279 - https://phabricator.wikimedia.org/T264698 (10Papaul) @jijiki  I will request a disk replacement
[17:58:34] <marxarelli>	 tgr_: i'm prepping wmf.13 to go out during the normal window. i would like to get wmf.11 to all wikis before then, but you mentioned having a meeting to attend in 30 min. if i rolled wmf.11 to all wikis now, would that work for you or would you rather i wait until after your meeting?
[17:59:02] <marxarelli>	 i haven't seen anything of concern after rolling to group1
[18:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201013T1800)
[18:00:13] <wikibugs>	 (03CR) 10Muehlenhoff: Add neovim, fd and ripgrep to toolforge (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633583 (https://phabricator.wikimedia.org/T219501) (owner: 10MichaelSchoenitzer)
[18:00:54] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10RobH) >>! In T238036#6538876, @ayounsi wrote: > This has been alerting again this time for scs-c1-codfw. See https://librenms.wikimedia.org/graphs/device=170/type=device_processor/from=16019913...
[18:01:47] <robh>	 !log scs-c1-codfw firmware update via T238036
[18:01:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:01:54] <stashbot>	 T238036: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036
[18:02:52] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:03:10] <librenms-wmf>	 08̶W̶a̶r̶n̶i̶n̶g Device scs-c1-codfw.mgmt.codfw.wmnet recovered from Processor usage over 85%
[18:03:54] <robh>	 doesnt count its only recovered cuz its rebooting ;D
[18:04:36] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:06:27] <wikibugs>	 (03PS5) 10Dzahn: wikistats: rm php7.0 pre-buster support, make PHP version parameter [puppet] - 10https://gerrit.wikimedia.org/r/633286
[18:08:32] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[18:08:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:42] <robh>	 !log scs-c1-codfw mgmt firmware updated, updating scs-a1-codfw T238036
[18:09:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:47] <stashbot>	 T238036: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036
[18:10:31] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[18:10:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:11:31] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/25838/wikistats-wild-tiger.wikistats.eqiad.wmflabs/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/633286 (owner: 10Dzahn)
[18:12:23] <wikibugs>	 (03CR) 10Bstorm: [C: 04-1] Add neovim, fd and ripgrep to toolforge (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633583 (https://phabricator.wikimedia.org/T219501) (owner: 10MichaelSchoenitzer)
[18:12:28] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: admin: Update some of my dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/633802
[18:15:50] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Update some of my dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/633802 (owner: 10Alexandros Kosiaris)
[18:19:12] <tgr_>	 marxarelli: sorry, missed the ping. It would work; I don't think I can do much other than sit and wait to see if someone reports an issue, anyway.
[18:19:36] <marxarelli>	 tgr_: np. i'll roll it then
[18:19:38] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.13 [core] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/633598 (https://phabricator.wikimedia.org/T263179) (owner: 10TrainBranchBot)
[18:20:31] <wikibugs>	 (03PS1) 10Dduvall: all wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633804
[18:20:33] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] all wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633804 (owner: 10Dduvall)
[18:20:44] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:21:07] <marxarelli>	 !log 1.36.0-wmf.11 promoted to group1. no new errors (T263177). promoting to all wikis
[18:21:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:13] <stashbot>	 T263177: 1.36.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T263177
[18:21:15] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633804 (owner: 10Dduvall)
[18:21:58] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:23:10] <logmsgbot>	 !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.11
[18:23:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:40] <icinga-wm>	 PROBLEM - Long running screen/tmux on an-launcher1002 is CRITICAL: CRIT: Long running SCREEN process. (user: milimetric PID: 22072, 1741990s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[18:24:56] <milimetric>	 ah, my apologies
[18:25:30] <milimetric>	 ok, closed, sorry again
[18:25:34] <milimetric>	 what a nifty tool :)
[18:26:36] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is OK: (C)100 gt (W)80 gt 74.24 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37
[18:37:36] <wikibugs>	 (03CR) 10Dzahn: calico: hiera->lookup, add data types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633033 (owner: 10Dzahn)
[18:39:10] <icinga-wm>	 RECOVERY - MegaRAID on es2026 is OK: OK: optimal, 1 logical, 12 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:40:42] <wikibugs>	 (03PS2) 10Dzahn: calico: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/633033
[18:41:34] <mutante>	 milimetric: hehe, thanks. (https://xkcd.com/838/)
[18:42:28] <mutante>	 it's possible to whitelist hosts/roles fwiw
[18:44:26] <milimetric>	 lol, yeah, no I belong on the naughty list this time.  If I ever submit something that long running, it better be running on a distributed system and not a screen :)
[18:45:21] <mutante>	 ok;)
[18:52:03] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "role::calico::builder not currently on any instances. would be prefix "calico" in packaging project in cloud: https://openstack-browser.to" [puppet] - 10https://gerrit.wikimedia.org/r/633033 (owner: 10Dzahn)
[18:52:50] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 105.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37
[18:53:24] <logmsgbot>	 !log dduvall@deploy1001 Pruned MediaWiki: 1.36.0-wmf.6 (duration: 13m 00s)
[18:53:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:56:18] <logmsgbot>	 !log dduvall@deploy1001 Pruned MediaWiki: 1.36.0-wmf.8 (duration: 02m 10s)
[18:56:18] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "relforge: https://puppet-compiler.wmflabs.org/compiler1003/25841/" [puppet] - 10https://gerrit.wikimedia.org/r/633022 (owner: 10Dzahn)
[18:56:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:58:15] <logmsgbot>	 !log dduvall@deploy1001 Pruned MediaWiki: 1.36.0-wmf.9 (duration: 01m 56s)
[18:58:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:58:53] <wikibugs>	 (03CR) 10Dzahn: [V: 04-1] "https://puppet-compiler.wmflabs.org/compiler1001/25843/lvs2009.codfw.wmnet/change.lvs2009.codfw.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/633026 (owner: 10Dzahn)
[18:58:55] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+1] api-gateway: more instances in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/633559 (owner: 10Hnowlan)
[18:59:01] <wikibugs>	 (03PS1) 10Dduvall: testwikis wikis to 1.36.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633810
[18:59:03] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] testwikis wikis to 1.36.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633810 (owner: 10Dduvall)
[18:59:41] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633810 (owner: 10Dduvall)
[19:00:03] <logmsgbot>	 !log dduvall@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.13
[19:00:04] <jouncebot>	 marxarelli and longma: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Mediawiki train - American Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201013T1900).
[19:00:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:37] <wikibugs>	 (03PS2) 10Dzahn: pybal: replace hiera with lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/633026
[19:02:24] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10RobH) I've successfully upgraded the scs firmware fleetwide, with the exception of two devices:  * [[ https://netbox.wikimedia.org/dcim/devices/1327/ | scs-a8-eqiad ]] - old model CM4148, needs...
[19:02:34] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/25844/" [puppet] - 10https://gerrit.wikimedia.org/r/633026 (owner: 10Dzahn)
[19:11:22] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Netbox Error for asw2-d4-eqiad - https://phabricator.wikimedia.org/T265393 (10wiki_willy)
[19:11:59] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Netbox Error for asw2-d4-eqiad - https://phabricator.wikimedia.org/T265393 (10wiki_willy) Email sent to Julianne to re-check the invoice data pulled for asw2-d4-eqiad  Thanks, Willy
[19:12:34] <wikibugs>	 (03CR) 10Dzahn: [V: 04-1] "https://puppet-compiler.wmflabs.org/compiler1001/25845/deploy1001.eqiad.wmnet/change.deploy1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[19:23:31] <wikibugs>	 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10kostajh) >>! In T252391#6539461, @nettrom_WMF wrote: >>>! In T252391#6536174, @kostajh wrote: >> Hmm, I spoke too soon. We re...
[19:23:45] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: Degraded RAID on mw2279 - https://phabricator.wikimedia.org/T264698 (10jijiki) Thank you!
[19:24:02] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:24:16] <wikibugs>	 (03PS3) 10Kosta Harlan: Disable wgWMEUnderstandingFirstDay (EditorJourney) logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633514 (https://phabricator.wikimedia.org/T252391)
[19:24:22] <wikibugs>	 (03PS2) 10Dzahn: mediawiki: replace hiera with lookup, add data types in all profiles [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953)
[19:24:52] <wikibugs>	 (03CR) 10Kosta Harlan: "> Patch Set 1: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633514 (https://phabricator.wikimedia.org/T252391) (owner: 10Kosta Harlan)
[19:26:09] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] GrowthExperiments: Keep users with no explicit variant on variant A [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633786 (https://phabricator.wikimedia.org/T265372) (owner: 10Gergő Tisza)
[19:26:53] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] VCL: A heavy hammer for dire circumstances. [puppet] - 10https://gerrit.wikimedia.org/r/631848 (owner: 10CDanis)
[19:27:22] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:31:59] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] Disable wgWMEUnderstandingFirstDay (EditorJourney) logging (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633514 (https://phabricator.wikimedia.org/T252391) (owner: 10Kosta Harlan)
[19:32:20] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:33:20] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 04-1] "Roan pointed out that this wouldn't actually undo the patch since it removes A from the valid variant list, which is not configurable." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633786 (https://phabricator.wikimedia.org/T265372) (owner: 10Gergő Tisza)
[19:33:40] <icinga-wm>	 PROBLEM - Check systemd state on elastic1063 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:34:23] <wikibugs>	 (03PS1) 10Andrew Bogott: Update profile::openstack::base::nova::instance_dev for several cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/633816
[19:34:25] <wikibugs>	 (03PS1) 10Andrew Bogott: Remove hiera for labvirt1010/1011 [puppet] - 10https://gerrit.wikimedia.org/r/633817
[19:35:20] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Remove hiera for labvirt1010/1011 [puppet] - 10https://gerrit.wikimedia.org/r/633817 (owner: 10Andrew Bogott)
[19:35:25] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Update profile::openstack::base::nova::instance_dev for several cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/633816 (owner: 10Andrew Bogott)
[19:35:44] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:38:49] <wikibugs>	 (03CR) 10Kosta Harlan: "> Patch Set 2: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633786 (https://phabricator.wikimedia.org/T265372) (owner: 10Gergő Tisza)
[19:39:41] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1013 with 10G interfaces - https://phabricator.wikimedia.org/T243414 (10Andrew) From the parent task:  >>! In T216195#6524284, @ayounsi wrote: > Note that now racks `C8` and `D5` are dedicated to WMCS s...
[19:40:37] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 04-1] "> Patch Set 2: -Code-Review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633786 (https://phabricator.wikimedia.org/T265372) (owner: 10Gergő Tisza)
[19:40:40] <logmsgbot>	 !log dduvall@deploy1001 Finished scap: testwikis wikis to 1.36.0-wmf.13 (duration: 40m 51s)
[19:40:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:57] <hashar>	 marxarelli: lovely :)
[19:43:25] <wikibugs>	 (03Abandoned) 10Gergő Tisza: GrowthExperiments: Keep users with no explicit variant on variant A [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633786 (https://phabricator.wikimedia.org/T265372) (owner: 10Gergő Tisza)
[19:44:55] <marxarelli>	 hashar: so far so good :)
[19:45:09] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery-Search: Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10dcausse) happened again today:  ` [Tue Oct 13 18:14:55 2020] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [Tue Oct 13 1...
[19:45:28] <icinga-wm>	 RECOVERY - Check systemd state on elastic1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:49:38] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[19:49:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:51:17] <marxarelli>	 testwiki and logs look ok to me. rolling wmf.13 to group0, cc: longma 
[19:52:04] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[19:52:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:52:15] <wikibugs>	 (03PS1) 10Dduvall: group0 wikis to 1.36.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633822
[19:52:17] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] group0 wikis to 1.36.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633822 (owner: 10Dduvall)
[19:52:54] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633822 (owner: 10Dduvall)
[19:54:20] <logmsgbot>	 !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.13
[19:54:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:56:37] <wikibugs>	 (03CR) 10Dzahn: [V: 04-1] "works on most prod hosts but there seems to be some special case with cloud/labweb" [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[20:01:28] <wikibugs>	 (03PS3) 10Dzahn: mediawiki: replace hiera with lookup, add data types in all profiles [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953)
[20:02:24] <wikibugs>	 (03CR) 10Dzahn: "it's because hieradata/cloud/eqiad1/deployment-prep/common.yaml does not have a real FQDN for memcached_servers but it should be. fixing i" [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[20:03:19] <ebernhardson>	 !log add elastic2029-production-search-psi-codfw to cluster.routing.allocatin.exclude._name to drain active shards, instance currently in gc hell
[20:03:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:36] <marxarelli>	 !log 1.36.0-wmf.13 promoted to group0. no new or concerning errors or changes in error rates (T263179)
[20:06:40] <wikibugs>	 10Operations, 10Technical-blog-posts, 10Traffic: Blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T264729 (10srodlund) @ema are you the sole author would you like additional authors added?
[20:06:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:42] <stashbot>	 T263179: 1.36.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T263179
[20:07:03] <longma>	 logs look alright to me as well
[20:08:16] * marxarelli nods
[20:08:20] <marxarelli>	 until tomorrow
[20:11:28] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37
[20:12:01] <wikibugs>	 (03PS4) 10Dzahn: mediawiki: replace hiera with lookup, add data types in all profiles [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953)
[20:14:05] <ebernhardson>	 !log restart production-search-psi-codfw on elastic2029 to reset any wonkiness from gc hell
[20:14:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:15:58] <ebernhardson>	 !log unban elastic2029 from production-search-psi-codfw
[20:16:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:51] <wikibugs>	 (03PS1) 10Catrope: Revert "Make variant D the default, and remove variant A" [extensions/GrowthExperiments] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/633757 (https://phabricator.wikimedia.org/T265372)
[20:27:12] <wikibugs>	 (03PS1) 10Dzahn: netmon: remove stretch PHP 7.2 support [puppet] - 10https://gerrit.wikimedia.org/r/633824
[20:27:14] <wikibugs>	 (03PS1) 10Dzahn: netmon: move webserver setup to profile and pass PHP version as param [puppet] - 10https://gerrit.wikimedia.org/r/633825
[20:27:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] netmon: move webserver setup to profile and pass PHP version as param [puppet] - 10https://gerrit.wikimedia.org/r/633825 (owner: 10Dzahn)
[20:30:04] <wikibugs>	 (03PS2) 10Dzahn: netmon: move webserver setup to profile and pass PHP version as param [puppet] - 10https://gerrit.wikimedia.org/r/633825
[20:31:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] netmon: move webserver setup to profile and pass PHP version as param [puppet] - 10https://gerrit.wikimedia.org/r/633825 (owner: 10Dzahn)
[20:32:57] <wikibugs>	 (03PS3) 10Dzahn: netmon: move webserver setup to profile and pass PHP version as param [puppet] - 10https://gerrit.wikimedia.org/r/633825
[20:34:04] <wikibugs>	 10Operations, 10Technical-blog-posts, 10Traffic: Blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T264729 (10srodlund) @ema For the main image, I went with the earth from space: https://commons.wikimedia.org/wiki/File:North_America_from_low_orbiting_...
[20:39:19] <wikibugs>	 (03PS1) 10Dzahn: netmon: ensure nmap and mtr-tiny are installed, add profile for tools [puppet] - 10https://gerrit.wikimedia.org/r/633827
[20:39:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] netmon: ensure nmap and mtr-tiny are installed, add profile for tools [puppet] - 10https://gerrit.wikimedia.org/r/633827 (owner: 10Dzahn)
[20:42:17] <wikibugs>	 (03PS2) 10Dzahn: netmon: ensure nmap and mtr-tiny are installed, add profile for tools [puppet] - 10https://gerrit.wikimedia.org/r/633827
[20:43:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] netmon: ensure nmap and mtr-tiny are installed, add profile for tools [puppet] - 10https://gerrit.wikimedia.org/r/633827 (owner: 10Dzahn)
[20:44:26] <mutante>	 !log bast1002 - apt-get remove nmap (it can be used on netmon hosts and was not consistent with other bast hosts)
[20:44:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:58] <mutante>	 !log bast1002 - apt-get autoremove - cleans up golang and ruby packages
[20:45:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:47:42] <wikibugs>	 (03PS3) 10Dzahn: netmon: ensure nmap and mtr-tiny are installed, add profile for tools [puppet] - 10https://gerrit.wikimedia.org/r/633827
[20:50:24] <wikibugs>	 (03CR) 10Dzahn: mediawiki: replace hiera with lookup, add data types in all profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[20:53:12] <wikibugs>	 (03PS5) 10Dzahn: mediawiki: replace hiera with lookup, add data types in all profiles [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953)
[20:58:05] <wikibugs>	 (03PS6) 10Dzahn: mediawiki: replace hiera with lookup, add data types in all profiles [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953)
[20:59:25] <wikibugs>	 (03CR) 10Dzahn: mediawiki: replace hiera with lookup, add data types in all profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[21:02:26] <wikibugs>	 10Operations, 10ops-codfw, 10netops: patch in FB peering into cr2-eqdfw - https://phabricator.wikimedia.org/T265412 (10RobH) p:05Triage→03Medium
[21:02:54] <wikibugs>	 10Operations, 10ops-codfw, 10netops: patch in FB peering into cr2-eqdfw - https://phabricator.wikimedia.org/T265412 (10RobH)
[21:03:13] <wikibugs>	 10Operations, 10ops-codfw, 10netops: patch in FB peering into cr2-eqdfw - https://phabricator.wikimedia.org/T265412 (10RobH)
[21:03:16] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs VM backups: add two more backup hosts, increase days to 7 [puppet] - 10https://gerrit.wikimedia.org/r/633829 (https://phabricator.wikimedia.org/T260692)
[21:03:52] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "" Cannot reassign variable '$java_home'" ?  https://puppet-compiler.wmflabs.org/compiler1001/25850/gerrit1001.wikimedia.org/change.gerrit1" [puppet] - 10https://gerrit.wikimedia.org/r/632224 (https://phabricator.wikimedia.org/T264182) (owner: 10Muehlenhoff)
[21:05:52] <wikibugs>	 (03PS2) 10Andrew Bogott: wmcs VM backups: add two more backup hosts, increase days to 7 [puppet] - 10https://gerrit.wikimedia.org/r/633829 (https://phabricator.wikimedia.org/T260692)
[21:06:11] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "noop on everything including cloud for C:profile::mediawiki::common  https://puppet-compiler.wmflabs.org/compiler1003/25849/" [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[21:07:44] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[21:07:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:08:16] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2279 is CRITICAL: CRITICAL: 956 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[21:09:06] <mutante>	 interesting. is that being used as a test host right now?
[21:09:39] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[21:09:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:12:07] <icinga-wm>	 ACKNOWLEDGEMENT - Ensure local MW versions match expected deployment on mw2279 is CRITICAL: CRITICAL: 956 mismatched wikiversions daniel_zahn https://phabricator.wikimedia.org/T264698 https://wikitech.wikimedia.org/wiki/Application_servers
[21:12:40] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[21:12:40] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[21:12:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:12:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:14:15] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: Degraded RAID on mw2279 - https://phabricator.wikimedia.org/T264698 (10Dzahn) ` 21:08 <+icinga-wm> PROBLEM - Ensure local MW versions match expected deployment on mw2279 is CRITICAL: CRITICAL: 956 mismatched wikiversions                     https://wikitech.wikimedia....
[21:16:11] <mutante>	 !log icinga had gerrit health alert but did not notice an issue myself and was gone next check
[21:16:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:19:09] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "result for profile::mediawiki::webserver: https://puppet-compiler.wmflabs.org/compiler1001/25851/" [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[21:22:28] <wikibugs>	 (03CR) 10Dzahn: "this is odd. Sometimes "optional parameter listed before required parameter" makes jerkins -1 but in other places I totally expected it to" [puppet] - 10https://gerrit.wikimedia.org/r/633288 (owner: 10Dzahn)
[21:28:42] <wikibugs>	 (03PS3) 10Dzahn: wikistats: redo the way cronjobs are setup, add parameter to absent [puppet] - 10https://gerrit.wikimedia.org/r/633288
[21:29:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wikistats: redo the way cronjobs are setup, add parameter to absent [puppet] - 10https://gerrit.wikimedia.org/r/633288 (owner: 10Dzahn)
[21:30:04] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs VM backups: add two more backup hosts, increase days to 7 [puppet] - 10https://gerrit.wikimedia.org/r/633829 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott)
[21:32:48] <wikibugs>	 (03PS2) 10Dzahn: elasticsearch: replace hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/633022
[21:34:49] <wikibugs>	 (03PS4) 10Dzahn: wikistats: redo the way cronjobs are setup, add parameter to absent [puppet] - 10https://gerrit.wikimedia.org/r/633288
[21:38:44] <wikibugs>	 (03PS3) 10Razzi: geoip: move archive timer from stat1007 to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/633032 (https://phabricator.wikimedia.org/T264152)
[21:41:26] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: Degraded RAID on mw2279 - https://phabricator.wikimedia.org/T264698 (10Papaul)  Create Dispatch: Success You have successfully submitted request SR1039679642.
[21:50:59] <wikibugs>	 (03PS1) 10DannyS712: Partially revert "[labs] Remove wmgMonologChannels override" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633761
[21:51:06] <wikibugs>	 (03PS4) 10Razzi: geoip: move archive timer from stat1007 to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/633032 (https://phabricator.wikimedia.org/T264152)
[21:51:22] <wikibugs>	 (03PS1) 10Dzahn: docker::registry: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/633835
[21:51:56] <wikibugs>	 (03PS2) 10DannyS712: Partially revert "[labs] Remove wmgMonologChannels override" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633761
[21:55:24] <wikibugs>	 (03PS1) 10Dzahn: docker: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/633836
[22:04:24] <wikibugs>	 (03PS1) 10Dzahn: ldap::client::labs: fix 'Unknown variable: '::restricted..' [puppet] - 10https://gerrit.wikimedia.org/r/633838
[22:05:35] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10RobH)
[22:05:38] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10RobH)
[22:07:02] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:08:01] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "The title 'et' has already been used in this resource expression" [puppet] - 10https://gerrit.wikimedia.org/r/633288 (owner: 10Dzahn)
[22:08:46] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:10:01] <wikibugs>	 (03PS5) 10Dzahn: wikistats: redo the way cronjobs are setup, add parameter to absent [puppet] - 10https://gerrit.wikimedia.org/r/633288
[22:10:08] <wikibugs>	 (03CR) 10Dzahn: "nice.. so all this time a duplicate cron job that now shows up to this change :)" [puppet] - 10https://gerrit.wikimedia.org/r/633288 (owner: 10Dzahn)
[22:23:08] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/25859/wikistats-wild-tiger.wikistats.eqiad.wmflabs/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/633288 (owner: 10Dzahn)
[22:25:22] <icinga-wm>	 RECOVERY - Long running screen/tmux on an-launcher1002 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[22:38:15] <wikibugs>	 (03PS1) 10Dzahn: wikistats (cloud): disable crons on one of 2 instances [puppet] - 10https://gerrit.wikimedia.org/r/633842
[22:39:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/25860/" [puppet] - 10https://gerrit.wikimedia.org/r/633842 (owner: 10Dzahn)
[22:41:12] <wikibugs>	 10Operations, 10ops-codfw, 10netops: patch in FB peering into cr2-eqdfw - https://phabricator.wikimedia.org/T265412 (10RobH)
[22:55:31] <wikibugs>	 (03PS1) 10Catrope: Rename GrowthExperiments helpdesk on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633843 (https://phabricator.wikimedia.org/T265214)
[22:58:44] <wikibugs>	 (03PS1) 10Dzahn: wikistats: allow to 'absent' import/dump crons as well (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/633845
[22:59:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wikistats: allow to 'absent' import/dump crons as well (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/633845 (owner: 10Dzahn)
[22:59:50] <wikibugs>	 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Neha Nair (nnair) - https://phabricator.wikimedia.org/T265428 (10drochford)
[23:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201013T2300).
[23:00:04] <jouncebot>	 hmonroy, tgr, and RoanKattouw: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:00:14] <RoanKattouw>	 I'll deploy
[23:00:31] <hmonroy>	 Let's do it!
[23:00:45] <tgr_>	 o/
[23:00:47] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Revert "Make variant D the default, and remove variant A" [extensions/GrowthExperiments] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/633757 (https://phabricator.wikimedia.org/T265372) (owner: 10Catrope)
[23:00:57] <wikibugs>	 (03PS3) 10Catrope: Enable watchlist expiry feature on four wikis from group 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632779 (https://phabricator.wikimedia.org/T264780) (owner: 10HMonroy)
[23:01:05] <wikibugs>	 (03PS4) 10Catrope: Enable watchlist expiry feature on four wikis from group 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632779 (https://phabricator.wikimedia.org/T264780) (owner: 10HMonroy)
[23:01:24] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Enable watchlist expiry feature on four wikis from group 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632779 (https://phabricator.wikimedia.org/T264780) (owner: 10HMonroy)
[23:02:13] <wikibugs>	 (03Merged) 10jenkins-bot: Enable watchlist expiry feature on four wikis from group 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632779 (https://phabricator.wikimedia.org/T264780) (owner: 10HMonroy)
[23:02:35] <RoanKattouw>	 hmonroy: Your change is on mwdebug2001, please test
[23:02:45] <hmonroy>	 checking
[23:05:52] <hmonroy>	 RoanKattouw: Looks good!
[23:06:40] <wikibugs>	 (03PS2) 10Catrope: Disable event logging in MediaViewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625771 (https://phabricator.wikimedia.org/T260582) (owner: 10Gergő Tisza)
[23:06:50] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Disable event logging in MediaViewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625771 (https://phabricator.wikimedia.org/T260582) (owner: 10Gergő Tisza)
[23:07:19] <logmsgbot>	 !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable watchlist expiry on frwiki, fawiki, dewiki, cswiki (T264780) (duration: 01m 04s)
[23:07:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:07:26] <stashbot>	 T264780: Watchlist Expiry: Release to group 2 pilot wikis [TUES, OCT 13] - https://phabricator.wikimedia.org/T264780
[23:07:29] <RoanKattouw>	 hmonroy: And it's live!
[23:07:48] <hmonroy>	 RoanKattouw: Awesome! Thank you :)
[23:07:54] <wikibugs>	 (03Merged) 10jenkins-bot: Disable event logging in MediaViewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625771 (https://phabricator.wikimedia.org/T260582) (owner: 10Gergő Tisza)
[23:09:10] <RoanKattouw>	 tgr_: Your patch is on mwdebug2001, would you like to test it or should I just deploy it right away?
[23:12:25] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Make variant D the default, and remove variant A" [extensions/GrowthExperiments] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/633757 (https://phabricator.wikimedia.org/T265372) (owner: 10Catrope)
[23:12:37] <tgr_>	 RoanKattouw: tested, thanks!
[23:12:50] <wikibugs>	 (03PS2) 10Catrope: Rename GrowthExperiments helpdesk on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633843 (https://phabricator.wikimedia.org/T265214)
[23:12:54] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Rename GrowthExperiments helpdesk on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633843 (https://phabricator.wikimedia.org/T265214) (owner: 10Catrope)
[23:13:43] <wikibugs>	 (03Merged) 10jenkins-bot: Rename GrowthExperiments helpdesk on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633843 (https://phabricator.wikimedia.org/T265214) (owner: 10Catrope)
[23:14:19] <logmsgbot>	 !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Disable event logging in MediaViewer (T260582) (duration: 01m 04s)
[23:14:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:14:25] <stashbot>	 T260582: Migrate EventLogging MediaViewer data to Event Platform - https://phabricator.wikimedia.org/T260582
[23:18:39] <logmsgbot>	 !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Rename GrowthExperiments help desk on ptwiki (T265214) (duration: 01m 04s)
[23:18:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:18:47] <stashbot>	 T265214: Change Growth features parameters on Portuguese Wikipedia - https://phabricator.wikimedia.org/T265214
[23:22:40] <logmsgbot>	 !log catrope@deploy1001 Synchronized php-1.36.0-wmf.13/extensions/GrowthExperiments/: Revert removal of variant A (T265372) (duration: 01m 04s)
[23:22:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:22:46] <stashbot>	 T265372: Variant C/D: configuration control - https://phabricator.wikimedia.org/T265372
[23:26:00] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 76 probes of 568 (alerts on 65) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[23:37:18] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 48 probes of 568 (alerts on 65) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[23:40:09] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T260379 (10Dwisehaupt) 05Invalid→03Resolved
[23:43:17] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T260379 (10Dwisehaupt) 05Resolved→03Invalid