[00:37:02] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:38:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:54:02] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:55:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:33:10] 08Warning Alert for device scs-c1-codfw.mgmt.codfw.wmnet - Processor usage over 85% [01:55:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:57:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:07:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.13 [core] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/633598 [02:11:04] (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.13 [core] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/633598 (https://phabricator.wikimedia.org/T263179) (owner: 10TrainBranchBot) [03:23:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:28:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:36:04] PROBLEM - Disk space on ms-be2036 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=89%): /tmp 0 MB (0% inode=89%): /var/tmp 0 MB (0% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2036&var-datasource=codfw+prometheus/ops [04:05:16] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:07:44] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:08:36] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:11:06] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:13:40] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:18:52] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:33:10] 08Warning Alert for device scs-c1-codfw.mgmt.codfw.wmnet - Processor usage over 85% [05:11:13] Urbanecm: Sorry, I was out yesterday, will check the task where you pinged me [05:35:22] !log Set global innodb_change_buffering = inserts; on pc2009 T263443 [05:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:30] T263443: Evaluate the impact of changing innodb_change_buffering to inserts - https://phabricator.wikimedia.org/T263443 [05:41:53] (03CR) 10Muehlenhoff: "We can simply drop timidity and freepats from the Puppet manifest; installed packages will stick around and things will get cleaned up whe" [puppet] - 10https://gerrit.wikimedia.org/r/445604 (owner: 10Reedy) [05:50:30] (03PS1) 10Marostegui: site.pp: Remove s3 comment [puppet] - 10https://gerrit.wikimedia.org/r/633603 [05:51:28] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove s3 comment [puppet] - 10https://gerrit.wikimedia.org/r/633603 (owner: 10Marostegui) [05:55:46] PROBLEM - Check systemd state on dbprov1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:19:12] (03PS1) 10Elukey: Remove analytics1048 from the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/633605 (https://phabricator.wikimedia.org/T255140) [06:19:43] (03CR) 10Elukey: [C: 03+2] Remove analytics1048 from the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/633605 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [06:39:54] !log Installing httpcomponents-client security updates for Stretch [06:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:42:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:02:56] !log installing PHP 7.0 security updates [07:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:07] (03PS1) 10Ayounsi: Add cloud-in4 filters to cloudsw interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/633683 (https://phabricator.wikimedia.org/T265288) [07:06:53] (03CR) 10Ayounsi: [C: 03+2] Add cloud-in4 filters to cloudsw interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/633683 (https://phabricator.wikimedia.org/T265288) (owner: 10Ayounsi) [07:07:17] (03Merged) 10jenkins-bot: Add cloud-in4 filters to cloudsw interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/633683 (https://phabricator.wikimedia.org/T265288) (owner: 10Ayounsi) [07:15:01] (03CR) 10Volans: "Nice! Thanks for working on this. Replies inline about naming and a couple of nits." (034 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [07:23:41] (03CR) 10Elukey: [C: 03+1] documentation: refactor configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/633477 (owner: 10Volans) [07:24:41] (03CR) 10Volans: [C: 03+2] documentation: refactor configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/633477 (owner: 10Volans) [07:24:51] (03CR) 10Volans: [C: 03+2] pylint: allow 'logger' as module-scope name [software/spicerack] - 10https://gerrit.wikimedia.org/r/633478 (owner: 10Volans) [07:27:55] (03Merged) 10jenkins-bot: documentation: refactor configuration [software/spicerack] - 10https://gerrit.wikimedia.org/r/633477 (owner: 10Volans) [07:28:33] (03Merged) 10jenkins-bot: pylint: allow 'logger' as module-scope name [software/spicerack] - 10https://gerrit.wikimedia.org/r/633478 (owner: 10Volans) [07:33:10] 08Warning Alert for device scs-c1-codfw.mgmt.codfw.wmnet - Processor usage over 85% [07:33:25] 10Operations, 10DBA, 10Data-Persistence, 10Release-Engineering-Team-TODO, and 2 others: Create integration test env for wmfmariadbpy - https://phabricator.wikimedia.org/T265266 (10hashar) From yesterday discussion: * we do not run arbitrary images * the CI job usually just run an image using Docker, the i... [07:35:41] jayme: thanks for the fix for ms-be2036 ! [07:36:35] 10Operations, 10Data-Persistence-Backup, 10SRE-tools: Add toil::systemd_scope_cleanup to dbprov hosts - https://phabricator.wikimedia.org/T265323 (10Marostegui) [07:37:24] godog: sure, yw! [07:37:37] !log installing ruby security updates on stretch [07:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:14] godog: I did not look into what has gone wrong, though. root's bash_history suggests there where other directories with broken permissions as well...do you have a theory? [07:40:36] 10Operations, 10MW-on-K8s, 10serviceops: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10Joe) [07:41:54] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 79 probes of 651 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:42:03] jayme: sadly not a good one no, the host was in trouble over the weekend and the hw controller I think freaked out, with one of the ssd in sw raid timing out [07:43:20] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [07:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:24] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:39] !log running schema change against s3 in eqiad T259831 [07:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:44] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [07:49:22] ACKNOWLEDGEMENT - Check systemd state on dbprov1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Marostegui T265323 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:49:53] 10Operations, 10MW-on-K8s, 10serviceops: Create a basic helm chart to test MediaWiki on kubernetes - https://phabricator.wikimedia.org/T265327 (10Joe) p:05Triage→03High [07:52:24] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 70 probes of 567 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:55:19] mmhh looks like the ms-be2036 fs full is back, I'll take a look [07:55:38] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [07:55:39] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:31] 10Operations, 10Machine Learning Platform, 10SRE-Access-Requests: Requesting adding to ores-admin for Ladsgroup - https://phabricator.wikimedia.org/T265172 (10Lydia_Pintscher) It'd be <3 if we could get this approved so that we can get T261326 done in time for Wikidata's 8th birthday on the 29th. [07:57:10] (03CR) 10Kormat: [C: 03+2] (Mostly) convert to pytest. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633534 (owner: 10Kormat) [08:00:02] (03Merged) 10jenkins-bot: (Mostly) convert to pytest. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633534 (owner: 10Kormat) [08:04:26] RECOVERY - Disk space on ms-be2036 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2036&var-datasource=codfw+prometheus/ops [08:11:17] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime [08:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:45] 10Operations, 10Analytics-Clusters, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10ema) The function `VUT_Main` is the main loop of VUT programs. The [[ https://github.com/varnishcache/varnish-cache/blob/6d4df3639725bbec6d1657b07867ec44f4ba14f8/lib/libvarnish... [08:13:15] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:02] (03PS1) 10Kormat: tox: Output format diffs [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633691 [08:18:26] (03CR) 10Kormat: [C: 03+2] tox: Output format diffs [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633691 (owner: 10Kormat) [08:19:53] (03Merged) 10jenkins-bot: tox: Output format diffs [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633691 (owner: 10Kormat) [08:21:41] 10Operations, 10netops: cr3-esams linecard diversity issue - https://phabricator.wikimedia.org/T262524 (10ayounsi) Steps to go directly with the most sustainable option. No need for a site depool if done carefully. {F32383060} [] Configure cr2/3:ae0 with `link-speed mixed` [] Disconnect cable #20042 between... [08:24:10] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:25:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:29:02] 10Operations, 10DBA, 10Data-Persistence, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) [08:40:10] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [08:40:11] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:08] 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for FY2020-2021 Q2 DC switchback - https://phabricator.wikimedia.org/T264364 (10Trizek-WMF) [08:56:04] (03PS1) 10Elukey: Update tests and docs to Varnish 6 and Debian Buster [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/633695 [08:59:11] (03PS2) 10Elukey: Update tests and docs to Varnish 6 and Debian Buster [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/633695 [09:01:12] (03PS1) 10Elukey: Set VUT grouping parameter to 'request' by default [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/633696 (https://phabricator.wikimedia.org/T264074) [09:02:01] (03CR) 10Elukey: [V: 03+2 C: 03+2] Update tests and docs to Varnish 6 and Debian Buster [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/633695 (owner: 10Elukey) [09:05:53] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:07:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:15:00] (03CR) 10Ema: [C: 03+1] Set VUT grouping parameter to 'request' by default [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/633696 (https://phabricator.wikimedia.org/T264074) (owner: 10Elukey) [09:20:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:22:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:22:25] 10Operations, 10Traffic, 10observability, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Document and/or improve navigation of the various HTTP frontend Grafana dashboards - https://phabricator.wikimedia.org/T253655 (10Ladsgroup) [09:23:41] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.44 [software/spicerack] - 10https://gerrit.wikimedia.org/r/633697 [09:26:56] 10Operations, 10Traffic, 10observability, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Document and/or improve navigation of the various HTTP frontend Grafana dashboards - https://phabricator.wikimedia.org/T253655 (10Ladsgroup) [09:27:49] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic: Limit maps serving to Wikimedia hosted sites only - https://phabricator.wikimedia.org/T261424 (10Elitre) @JMinor @JKatzWMF please give your Go to get this done. Thanks! [09:27:54] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.44 [software/spicerack] - 10https://gerrit.wikimedia.org/r/633697 (owner: 10Volans) [09:29:29] (03CR) 10Jbond: [C: 03+1] Add an apt proxy config for deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/633172 (https://phabricator.wikimedia.org/T262647) (owner: 10Muehlenhoff) [09:29:49] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 38 probes of 567 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:30:37] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 9 probes of 651 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:30:55] (03CR) 10Jbond: [C: 03+2] admin: urbanecm's home: Update .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/633365 (owner: 10Urbanecm) [09:30:57] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.44 [software/spicerack] - 10https://gerrit.wikimedia.org/r/633697 (owner: 10Volans) [09:31:16] (03CR) 10Jbond: "merged" [puppet] - 10https://gerrit.wikimedia.org/r/633365 (owner: 10Urbanecm) [09:32:39] !log cp3050: set grouping by request (vut->g_arg = 2) on varnishkafka-webrequest T264074 [09:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:46] T264074: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 [09:36:07] (03PS12) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) [09:37:48] (03PS1) 10Volans: Upstream release v0.0.44 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/633698 [09:38:30] (03CR) 10jerkins-bot: [V: 04-1] Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [09:38:47] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [09:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:55] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:38:58] !log running schema change against s1 in eqiad T259831 [09:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:05] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [09:39:59] (03CR) 10jerkins-bot: [V: 04-1] Upstream release v0.0.44 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/633698 (owner: 10Volans) [09:40:15] (03PS13) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) [09:42:53] 10Operations, 10Traffic, 10observability, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Document and/or improve navigation of the various HTTP frontend Grafana dashboards - https://phabricator.wikimedia.org/T253655 (10Ladsgroup) After talking to @ema, I'm cleaning this up a bit. Delet... [09:46:08] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/633156 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond) [09:46:49] (03CR) 10Volans: "recheck" [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/633698 (owner: 10Volans) [09:49:38] 10Operations, 10serviceops, 10Patch-For-Review: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10JMeybohm) p:05Triage→03Medium [09:50:42] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.44 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/633698 (owner: 10Volans) [09:51:09] 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10JMeybohm) p:05Triage→03Medium [09:51:36] !log cp3052: systemctl restart varnishkafka-webrequest.service T264074 [09:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:43] T264074: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 [09:51:49] (03PS1) 10Evrifaessa: Merge branch 'master' of https://gerrit.wikimedia.org/r/operations/mediawiki-config Change-Id: Ifa63ea16a07d8a39f71676e1300f38b6492afddb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633702 [09:51:51] (03PS1) 10Evrifaessa: Add namespace aliases for Turkish Wikipedia (trwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633703 (https://phabricator.wikimedia.org/T265336) [09:52:01] 10Operations, 10SRE-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10JMeybohm) p:05Triage→03Medium [09:52:50] (03Abandoned) 10Evrifaessa: Merge branch 'master' of https://gerrit.wikimedia.org/r/operations/mediawiki-config Change-Id: Ifa63ea16a07d8a39f71676e1300f38b6492afddb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633702 (owner: 10Evrifaessa) [09:53:07] (03Merged) 10jenkins-bot: Upstream release v0.0.44 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/633698 (owner: 10Volans) [09:53:43] 10Operations, 10MW-on-K8s, 10serviceops: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10JMeybohm) p:05Triage→03Medium [09:55:24] 10Operations, 10CAS-SSO, 10User-jbond: Apereo CAS expose CASCookieSameSite via profile::idp::client::http - https://phabricator.wikimedia.org/T264605 (10JMeybohm) p:05Triage→03Medium [09:55:30] !log cp3054: systemctl restart varnishkafka-webrequest.service T264074 [09:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:32] 10Operations, 10homer, 10netops: Homer: Netbox driven switch interfaces - https://phabricator.wikimedia.org/T250429 (10ayounsi) [10:02:22] (03CR) 10Gehel: "A few minor comments inline. I've only done a high level review, someone should review in more dtails." (038 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [10:04:33] !log uploaded spicerack_0.0.44 to apt.wikimedia.org buster-wikimedia [10:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:43] 10Operations, 10Wikidata, 10serviceops: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10JMeybohm) p:05Triage→03Medium [10:07:57] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:08:10] 10Operations, 10DBA, 10Data-Persistence, 10Release-Engineering-Team-TODO, and 2 others: Create integration test env for wmfmariadbpy - https://phabricator.wikimedia.org/T265266 (10JMeybohm) p:05Triage→03Medium [10:09:22] !log cp3050: *reload* varnishkafka-webrequest T264074 [10:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:27] T264074: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 [10:09:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:19:24] !log cp3050: clear varnishkafka-webrequest's vut->sighup via stap T264074 [10:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:29] (03PS1) 10Filippo Giunchedi: install_server: use standard partman recipe for nvme cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/633704 (https://phabricator.wikimedia.org/T156955) [10:19:29] T264074: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 [10:29:45] !log upgrading spicerack on cumin2001 to 0.0.44 [10:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:10] 08Warning Alert for device scs-c1-codfw.mgmt.codfw.wmnet - Processor usage over 85% [10:37:55] 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for FY2020-2021 Q2 DC switchback - https://phabricator.wikimedia.org/T264364 (10Trizek-WMF) [10:40:11] 10Operations, 10Traffic, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Aggregated metrics for ats-tls <-> clients ttfb percentiles - https://phabricator.wikimedia.org/T263536 (10fgiunchedi) >>! In T263536#6522137, @fgiunchedi wrote: > With 50 percentile added I'm considering this closed! > > A... [10:41:03] 10Operations, 10Traffic, 10observability, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Document and/or improve navigation of the various HTTP frontend Grafana dashboards - https://phabricator.wikimedia.org/T253655 (10fgiunchedi) I wanted to mention/add that as part of {T263536} there... [10:43:47] (03PS1) 10Volans: doc: add missing link to wmflib package [software/spicerack] - 10https://gerrit.wikimedia.org/r/633709 [10:46:14] (03CR) 10Elukey: [C: 03+1] doc: add missing link to wmflib package [software/spicerack] - 10https://gerrit.wikimedia.org/r/633709 (owner: 10Volans) [10:46:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_cxserver_cluster_codfw,swagger_check_restbase_esams} site={codfw,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:47:26] !log no-change rolling restart of push-notifications in codfw - T265258 [10:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:33] T265258: High latency on push notification service initialization - https://phabricator.wikimedia.org/T265258 [10:47:41] (03PS2) 10Volans: sre.hosts.decommission: import from new library [cookbooks] - 10https://gerrit.wikimedia.org/r/629692 (https://phabricator.wikimedia.org/T257905) [10:48:50] (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.decommission: import from new library [cookbooks] - 10https://gerrit.wikimedia.org/r/629692 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [10:50:32] 10Operations, 10Analytics-Clusters, 10Traffic, 10Patch-For-Review: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10ema) >>! In T264074#6537906, @ema wrote: > So varnishkafka seems to be correctly looping continuously in the do-while part of VUT_Main. Why is VSM_Status b... [10:51:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:54:33] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/633156 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond) [10:55:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_cxserver_cluster_codfw,swagger_check_restbase_esams} site={codfw,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:57:15] (03PS3) 10Volans: sre.hosts.decommission: import from new library [cookbooks] - 10https://gerrit.wikimedia.org/r/629692 (https://phabricator.wikimedia.org/T257905) [10:58:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:59:42] 10Operations, 10DBA: Puppetize grants for mysql hosts that are the source of recovery (dbstore, passive misc) - https://phabricator.wikimedia.org/T111929 (10LSobanski) @jcrespo could you weigh in on what is the exact work that needs to happen here? [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201013T1100). [11:00:04] Zoranzoki21 and Evrifaessa: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:13] I can deploy today! [11:00:36] Evrifaessa: hello, are you around? [11:01:33] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: refresh routing configuration [puppet] - 10https://gerrit.wikimedia.org/r/633711 (https://phabricator.wikimedia.org/T261724) [11:01:46] Urbanecm: I'm here [11:01:48] o/ [11:01:59] cool [11:02:12] (03PS2) 10Urbanecm: Add namespace aliases for Turkish Wikipedia (trwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633703 (https://phabricator.wikimedia.org/T265336) (owner: 10Evrifaessa) [11:02:16] (03CR) 10Urbanecm: [C: 03+2] Add namespace aliases for Turkish Wikipedia (trwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633703 (https://phabricator.wikimedia.org/T265336) (owner: 10Evrifaessa) [11:03:01] (03Merged) 10jenkins-bot: Add namespace aliases for Turkish Wikipedia (trwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633703 (https://phabricator.wikimedia.org/T265336) (owner: 10Evrifaessa) [11:03:19] Evrifaessa: can you test at mwdebug2001, please? [11:03:52] w8 [11:04:10] 10Operations, 10DBA, 10Growth-Team: Move echo tables from local wiki databases onto extension1 cluster for mediawikiwiki, metawiki, and officewiki - https://phabricator.wikimedia.org/T119154 (10LSobanski) 05Open→03Declined We consider this task to be very risky and with the limited gain suggest against g... [11:04:10] w8? [11:04:14] wait [11:04:28] it works, but we need to move the pages I guess [11:04:42] I'll do that with a script [11:04:46] so, that's fine [11:05:22] godog: sorry, I just devoiced you, as you're not a bot actually :-) [11:05:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: refresh routing configuration [puppet] - 10https://gerrit.wikimedia.org/r/633711 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [11:05:32] Hello, sorry for lating. :) [11:06:01] Urbanecm: how can you be sure? :) [11:06:23] (03CR) 10Volans: [C: 03+2] doc: add missing link to wmflib package [software/spicerack] - 10https://gerrit.wikimedia.org/r/633709 (owner: 10Volans) [11:06:25] kormat: I think I talked with the real godog a few times, but maybe it was an impostor? :-) [11:06:42] Evrifaessa: if that's all, I think we can go ahead? [11:06:43] Hi Kizule [11:07:18] Hello Urbanecm :) [11:07:44] I have only https://gerrit.wikimedia.org/r/c/633250/ :) [11:07:56] Kizule: I know, but I started with Evrifaessa, since you were not arouns [11:08:00] you're in the queue :) [11:09:04] Okay, my postman was a little while ago, it is reason why I haven't joined in the time. [11:09:14] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: e61fcebe7315f73d1fb4d531da37d2c1253115ee: Add namespace aliases for Turkish Wikipedia (T265336) (duration: 00m 59s) [11:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:20] T265336: Add namespace aliases to Turkish Wikipedia - https://phabricator.wikimedia.org/T265336 [11:09:53] (03Merged) 10jenkins-bot: doc: add missing link to wmflib package [software/spicerack] - 10https://gerrit.wikimedia.org/r/633709 (owner: 10Volans) [11:10:42] Evrifaessa: script started, patch deployed [11:10:50] (03PS3) 10Urbanecm: Add suppressredirect right to reviewers on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633250 (https://phabricator.wikimedia.org/T265169) (owner: 10Zoranzoki21) [11:10:54] (03CR) 10Urbanecm: [C: 03+2] Add suppressredirect right to reviewers on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633250 (https://phabricator.wikimedia.org/T265169) (owner: 10Zoranzoki21) [11:11:00] ty [11:11:06] np [11:11:37] (03Merged) 10jenkins-bot: Add suppressredirect right to reviewers on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633250 (https://phabricator.wikimedia.org/T265169) (owner: 10Zoranzoki21) [11:11:59] Kizule: can you try at mwdebug2001, please? [11:12:06] Urbanecm: how much time is it going to take for the script to be finished? [11:12:36] several minutes - I've actually started just dryrun now, and will start the full run once the dry one completes [11:13:06] I'll log here the start and en [11:13:06] (03CR) 10Gehel: Introduce an interface for progress bars. (033 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [11:13:07] *end [11:13:24] !log installed spicerack_0.0.43-1+deb10u1_amd64.deb on cumin2001 , need to wait a long-rnning cookbook to end to upgrade both hosts [11:13:24] Urbanecm: Okay [11:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:01] (03CR) 10Gehel: Introduce an interface for progress bars. (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [11:14:15] !log Start of `urbanecm@mwmaint2001:~$ mwscript namespaceDupes.php --wiki=trwiki --fix # T265336` [11:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:21] T265336: Add namespace aliases to Turkish Wikipedia - https://phabricator.wikimedia.org/T265336 [11:15:24] Kizule: how is it going? [11:15:28] Urbanecm: My patch should be good, on Special:UserGroupRights I see suppressredirect added in "reviewers", as should be. [11:15:32] cool [11:15:58] syncing [11:16:05] Urbanecm: Patch is good to go, yea. :) [11:16:55] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 90028b4c3c1cd4407e0834d603ccb8b256f2498e: Add suppressredirect right to reviewers on bnwiki (T265169) (duration: 00m 58s) [11:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:01] T265169: Add suppressredirect right to reviewers on bnwiki - https://phabricator.wikimedia.org/T265169 [11:17:02] Kizule: should be live :) [11:17:03] Urbanecm: there seems to be conflicts [11:17:12] Urbanecm: Checking... [11:17:21] Evrifaessa: the script will report that, and I'll say so on the task :-) [11:17:22] (03CR) 10Volans: [C: 04-2] "Need to wait a long-running cookbook on cumin1001 to finish before installing spicerack 0.0.44 and deploying this" [cookbooks] - 10https://gerrit.wikimedia.org/r/629692 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [11:17:33] (still running through) [11:18:02] Urbanecm: It is live, I'll close task as resolved now. [11:18:07] cool [11:18:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:20:27] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:22:57] !log imported php-defaults, php-excimer, php-luasandbox, php-geoip to component/icu63 T264991 [11:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:03] T264991: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 [11:24:42] Urbanecm: did the script end yet? [11:24:46] no [11:24:46] (03CR) 10Volans: [C: 03+1] "Thanks for the fixes, LGTM, just few more nits that came to mind, if you are in for it, all totally optional." (034 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [11:24:51] still running [11:25:00] 10Operations, 10GrowthExperiments-NewcomerTasks, 10Product-Infrastructure-Team-Backlog, 10serviceops, 10Growth-Team (Current Sprint): Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh) [11:31:19] Urbanecm: finished yet? [11:31:33] Evrifaessa: no :-). I'll !log once it's completed :) [11:52:30] (03PS1) 10Ayounsi: Add switch interface support to decom script [cookbooks] - 10https://gerrit.wikimedia.org/r/633723 (https://phabricator.wikimedia.org/T265341) [11:53:43] (03CR) 10jerkins-bot: [V: 04-1] Add switch interface support to decom script [cookbooks] - 10https://gerrit.wikimedia.org/r/633723 (https://phabricator.wikimedia.org/T265341) (owner: 10Ayounsi) [11:59:15] (03CR) 10Gehel: Introduce an interface for progress bars. (033 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [11:59:32] (03CR) 10Gehel: Introduce an interface for progress bars. (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [12:00:00] (03PS14) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) [12:01:42] Urbanecm: finished yet? [12:04:39] (03PS9) 10Volans: cookbook API: add class API [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) [12:05:00] (03CR) 10Volans: "Replies inline" (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [12:07:05] (03PS2) 10Ayounsi: Add switch interface support to decom script [cookbooks] - 10https://gerrit.wikimedia.org/r/633723 (https://phabricator.wikimedia.org/T265341) [12:08:13] (03CR) 10jerkins-bot: [V: 04-1] Add switch interface support to decom script [cookbooks] - 10https://gerrit.wikimedia.org/r/633723 (https://phabricator.wikimedia.org/T265341) (owner: 10Ayounsi) [12:08:15] (03CR) 10Volans: [C: 03+1] "LGTM!" [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [12:08:31] (03CR) 10Gehel: [C: 03+2] Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [12:14:22] Urbanecm: haha! no worries, I don't know how I got voiced tbh [12:14:33] @seen Urbanecm [12:14:33] Evrifaessa: Urbanecm is in here, right now [12:14:44] but yes definitely not a bot, that's what a bot would say [12:20:55] !log imported dh-php, php-acpu, php-imagick to component/icu63 T264991 [12:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:03] T264991: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 [12:27:31] 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10fgiunchedi) I spotted a problem with `sdd` in dmesg too, perhaps that disk isn't healthy ` [100075.068371] sd 0:1:0:3: [sdd] tag#13 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [100075.068375] sd 0:1... [12:30:32] (03CR) 10Alexandros Kosiaris: [C: 03+1] redis::instance: raise the file limit to match maxclients [puppet] - 10https://gerrit.wikimedia.org/r/632662 (https://phabricator.wikimedia.org/T263910) (owner: 10Giuseppe Lavagetto) [12:42:24] (03PS2) 10Giuseppe Lavagetto: redis::instance: switch to use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/632661 [12:42:34] 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10jbond) > no sure why but some host take a looong time While debugging this i noticed that we where receiving a lot of messages like the following (avalible... [12:42:50] 10Operations, 10ops-codfw, 10serviceops: Degraded RAID on mw2279 - https://phabricator.wikimedia.org/T264698 (10Papaul) p:05Triage→03Medium [12:43:03] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) I've captured 30 minutes of data using varnishlog simultaneously on cp3052 and cp3054, using 4 variants of this command for hit-front... [12:43:13] 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10Papaul) p:05Triage→03Medium [12:43:21] Urbanecm: you here? [12:43:37] PROBLEM - Disk space on ms-be2036 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=89%): /tmp 0 MB (0% inode=89%): /var/tmp 0 MB (0% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2036&var-datasource=codfw+prometheus/ops [12:44:02] (03CR) 10Giuseppe Lavagetto: [C: 03+2] redis::instance: switch to use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/632661 (owner: 10Giuseppe Lavagetto) [12:44:07] yeah that's me ^ (ms-be2036) [12:44:27] (03Abandoned) 10JMeybohm: api-gateway: use default envoy 1.15.1 image [deployment-charts] - 10https://gerrit.wikimedia.org/r/632483 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [12:46:00] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Papaul) @Marostegui since the server is under warranty, it is best to use a disk that is under warranty as well. [12:46:50] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) @Papaul sounds good, so maybe let's remove the old disk, give it 5 minutes, and then place the new one in? [12:48:52] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Papaul) Will do that once on site [12:49:27] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) Going to depool the host just in case, thanks! [12:49:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2026 for on-site maintenance T263837 ', diff saved to https://phabricator.wikimedia.org/P12975 and previous config saved to /var/cache/conftool/dbconfig/20201013-124940-marostegui.json [12:49:45] (03PS1) 10Giuseppe Lavagetto: profile::redis::multidc_instance: correct reference to base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/633735 [12:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:51] T263837: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 [12:49:59] !log End of `urbanecm@mwmaint2001:~$ mwscript namespaceDupes.php --wiki=trwiki --fix` # T265336 [12:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:04] T265336: Add namespace aliases to Turkish Wikipedia - https://phabricator.wikimedia.org/T265336 [12:50:44] !log urbanecm@mwmaint2001:~$ mwscript namespaceDupes.php --wiki=trwiki --add-prefix=FIXME --fix # T265336 [12:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:04] (03CR) 10Elukey: [C: 03+1] sre.hosts.decommission: import from new library [cookbooks] - 10https://gerrit.wikimedia.org/r/629692 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [12:51:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::redis::multidc_instance: correct reference to base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/633735 (owner: 10Giuseppe Lavagetto) [12:55:18] 10Operations, 10SRE-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10elukey) >>! In T264472#6525167, @Kormat wrote: > I've created your kerberos principal earlier today, you should receive an email telling you how to... [12:57:41] 10Operations, 10DBA, 10Data-Persistence, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) [12:58:16] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Kormat) [12:58:21] PROBLEM - Check size of conntrack table on cescout1001 is CRITICAL: CRITICAL: nf_conntrack is 99 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [12:58:27] 10Operations, 10DBA, 10Data-Persistence, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) 05Open→03Stalled All of eqiad is now done. The remaining hosts/sections in codfw will be done after the dc swit... [12:59:32] ^^^ conntrack error is me [12:59:43] oh great, thanks, was about to look [13:01:11] sukhe: ok i have finished playing around on cescout my nmap scan just finished so that alert should clear soon. ill remove nmap and let you know if i need it again. thanks :) [13:01:45] RECOVERY - Check size of conntrack table on cescout1001 is OK: OK: nf_conntrack is 76 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [13:02:20] (03CR) 10Mforns: "LGTM! Good to merge on our side!" [puppet] - 10https://gerrit.wikimedia.org/r/633510 (https://phabricator.wikimedia.org/T254332) (owner: 10Ayounsi) [13:03:20] (03PS2) 10Giuseppe Lavagetto: redis::instance: raise the file limit to match maxclients [puppet] - 10https://gerrit.wikimedia.org/r/632662 (https://phabricator.wikimedia.org/T263910) [13:04:15] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) @ayounsi > Let me know when it's fine to merge the relevant change (for src_net + dst_net at least). Please, merge... [13:05:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] redis::instance: raise the file limit to match maxclients [puppet] - 10https://gerrit.wikimedia.org/r/632662 (https://phabricator.wikimedia.org/T263910) (owner: 10Giuseppe Lavagetto) [13:05:57] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:07:20] there's a huge spike on fatals [13:07:26] jbond42: np :) [13:07:58] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Kormat) [13:08:08] !log imported php-mailparse, php-mongodb, php-msgpack to component/icu63 T264991 [13:08:09] 10Operations, 10DBA, 10Data-Persistence, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) 05Stalled→03Resolved [13:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:14] T264991: Upgrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 [13:08:38] 10Operations, 10DBA, 10Data-Persistence, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) 05Resolved→03Stalled [13:08:41] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Kormat) [13:09:21] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 14 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:11:16] (03PS1) 10Kormat: dbutil: Allow .csv path to be overridden in env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633736 [13:15:01] !log [urbanecm@mwmaint2001 ~]$ mwscript namespaceDupes.php --wiki=trwiki --add-prefix=BROKEN --fix # T265336 [13:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:08] T265336: Add namespace aliases to Turkish Wikipedia - https://phabricator.wikimedia.org/T265336 [13:16:50] (03PS4) 10Filippo Giunchedi: swift: change ownership depending on mountpoint status [puppet] - 10https://gerrit.wikimedia.org/r/516615 (https://phabricator.wikimedia.org/T225613) [13:17:11] (03CR) 10jerkins-bot: [V: 04-1] swift: change ownership depending on mountpoint status [puppet] - 10https://gerrit.wikimedia.org/r/516615 (https://phabricator.wikimedia.org/T225613) (owner: 10Filippo Giunchedi) [13:19:07] (03PS5) 10Filippo Giunchedi: swift: change ownership depending on mountpoint status [puppet] - 10https://gerrit.wikimedia.org/r/516615 (https://phabricator.wikimedia.org/T225613) [13:22:59] (03CR) 10Kormat: [C: 03+2] dbutil: Allow .csv path to be overridden in env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633736 (owner: 10Kormat) [13:24:26] (03Merged) 10jenkins-bot: dbutil: Allow .csv path to be overridden in env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/633736 (owner: 10Kormat) [13:29:53] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:30:01] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10MSantos) @CDanis and @Dzahn as per T261424#6538173, is there anything else to be done for the 3rd party block in the tr... [13:33:09] 08Warning Alert for device scs-c1-codfw.mgmt.codfw.wmnet - Processor usage over 85% [13:33:15] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:22] (03CR) 10Ayounsi: [C: 03+2] Nfacctd, add src_net, dst_net [puppet] - 10https://gerrit.wikimedia.org/r/633510 (https://phabricator.wikimedia.org/T254332) (owner: 10Ayounsi) [13:35:28] (03CR) 10JMeybohm: [C: 03+2] service_proxy: add node.js keepalive to push-notifications [puppet] - 10https://gerrit.wikimedia.org/r/633199 (owner: 10JMeybohm) [13:37:58] 10Operations, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10ayounsi) 05Resolved→03Open This has been alerting again this time for scs-c1-codfw. See https://librenms.wikimedia.org/graphs/device=170/type=device_processor/from=1601991300/legend=yes/pop... [13:39:10] 08Warning Alert for device scs-c1-codfw.mgmt.codfw.wmnet - Processor usage over 85% got acknowledged [13:40:25] 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10fgiunchedi) Plus these usb recurring messages in dmesg ` [105275.802560] usb 3-3: USB disconnect, device number 13 [105276.254453] usb 3-3: new high-speed USB device number 14 using xhci_hcd [105276.394569] us... [13:40:34] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [13:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:55] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10ayounsi) Merged, note that it's not in a CIDR notation, so `src_mask` + `dst_mask` would be needed to generate the CIDR form. [13:42:55] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [13:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:39] RECOVERY - Disk space on ms-be2036 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2036&var-datasource=codfw+prometheus/ops [13:46:37] (03PS1) 10Ayounsi: Nfacct: add src_mask + dst_mask [puppet] - 10https://gerrit.wikimedia.org/r/633737 (https://phabricator.wikimedia.org/T254332) [13:55:58] (03PS1) 10Vgutierrez: vcl: Bump ECDHE-ECDSA-AES128-SHA pageview replacement to 10% [puppet] - 10https://gerrit.wikimedia.org/r/633739 (https://phabricator.wikimedia.org/T258405) [13:57:27] (03PS1) 10Andrew Bogott: clouddvirt102[1-9]: apply libvirt-backy-ssd partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/633741 (https://phabricator.wikimedia.org/T260692) [13:59:31] (03PS2) 10Andrew Bogott: clouddvirt102[1-9]: apply libvirt-backy-ssd partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/633741 (https://phabricator.wikimedia.org/T260692) [14:07:58] (03CR) 10Andrew Bogott: [C: 03+2] clouddvirt102[1-9]: apply libvirt-backy-ssd partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/633741 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [14:12:52] (03PS1) 10Andrew Bogott: Revert "nova-fullstack monitoring: turn on debug logging" [puppet] - 10https://gerrit.wikimedia.org/r/633742 (https://phabricator.wikimedia.org/T265140) [14:13:55] (03CR) 10Andrew Bogott: [C: 03+2] Revert "nova-fullstack monitoring: turn on debug logging" [puppet] - 10https://gerrit.wikimedia.org/r/633742 (https://phabricator.wikimedia.org/T265140) (owner: 10Andrew Bogott) [14:16:10] (03PS1) 10Urbanecm: Add setmentor to wgAvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633743 [14:16:23] jouncebot: now [14:16:23] No deployments scheduled for the next 1 hour(s) and 43 minute(s) [14:16:32] (03CR) 10Urbanecm: [C: 03+2] Add setmentor to wgAvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633743 (owner: 10Urbanecm) [14:17:14] (03Merged) 10jenkins-bot: Add setmentor to wgAvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633743 (owner: 10Urbanecm) [14:18:34] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: 5b28fd685b9cb8d8e93650b5d02bc41b81d0883c: Add setmentor to wgAvailableRights (duration: 00m 59s) [14:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:14] (03PS1) 10Andrew Bogott: cloudvirt1022: move to virt_ceph_and_backy [puppet] - 10https://gerrit.wikimedia.org/r/633744 (https://phabricator.wikimedia.org/T260692) [14:23:17] (03PS1) 10Elukey: sre.hadoop.reboot-workers: allow to limit workers to reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/633766 (https://phabricator.wikimedia.org/T255138) [14:25:17] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1002 job=burrow partition={3,4,5} prometheus=ops site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=l [14:25:17] topic=All&var-consumer_group=All [14:26:52] (03CR) 10jerkins-bot: [V: 04-1] sre.hadoop.reboot-workers: allow to limit workers to reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/633766 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey) [14:27:03] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [14:37:15] RECOVERY - Disk space on sretest1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=sretest1001&var-datasource=eqiad+prometheus/ops [14:39:30] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [14:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:36] (03CR) 10CDanis: VCL: A heavy hammer for dire circumstances. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/631848 (owner: 10CDanis) [14:40:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:40:58] (03PS4) 10CDanis: VCL: A heavy hammer for dire circumstances. [puppet] - 10https://gerrit.wikimedia.org/r/631848 [14:41:25] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:42:00] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1022: move to virt_ceph_and_backy [puppet] - 10https://gerrit.wikimedia.org/r/633744 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [14:43:05] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1002 job=burrow partition={1,3,4,5} prometheus=ops site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster [14:43:05] r-topic=All&var-consumer_group=All [14:46:20] (03CR) 10Ema: VCL: A heavy hammer for dire circumstances. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/631848 (owner: 10CDanis) [14:46:21] (03CR) 10Ema: [C: 03+1] VCL: A heavy hammer for dire circumstances. [puppet] - 10https://gerrit.wikimedia.org/r/631848 (owner: 10CDanis) [14:51:33] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [14:55:27] (03PS1) 10Kormat: mariadb: Remove unused hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/633768 (https://phabricator.wikimedia.org/T256972) [14:58:59] (03PS2) 10Andrew Bogott: cloud-vps: update resolv.conf and associated domains for .wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/631511 [14:59:24] (03CR) 10Kormat: "PCC is no-op: https://puppet-compiler.wmflabs.org/compiler1003/25827/" [puppet] - 10https://gerrit.wikimedia.org/r/633768 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [15:02:23] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Papaul) new disk in place Status Name State Slot Number Size Security Status Bus Protocol Media Type Hot Spare Remaining Rated Write Endurance Physical Disk 0:1:0 On... [15:02:37] !log bounce logstash on logstash1007, GC death [15:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:57] (03CR) 10Andrew Bogott: "I'm confident about the search path thing being safe, less confident about the domain bit" [puppet] - 10https://gerrit.wikimedia.org/r/631511 (owner: 10Andrew Bogott) [15:04:32] (03PS2) 10Elukey: sre.hadoop.reboot-workers: allow to limit workers to reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/633766 (https://phabricator.wikimedia.org/T255138) [15:04:48] (03PS1) 10Kormat: [WIP] mariadb: Convert role::mariadb::core to profile. [puppet] - 10https://gerrit.wikimedia.org/r/633769 (https://phabricator.wikimedia.org/T256972) [15:05:15] (03CR) 10Volans: "Minor nits inline, LGMT" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/633766 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey) [15:05:36] (03CR) 10jerkins-bot: [V: 04-1] sre.hadoop.reboot-workers: allow to limit workers to reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/633766 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey) [15:05:51] marostegui: hello can you also take the possessed server (db2125) down so i can replace the CPU [15:06:45] papaul: hi, i'll do it. [15:07:02] thanks so much kormat [15:07:06] <3 [15:07:07] kormat: thanks [15:08:09] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Papaul) ` pt1979@es2026:~$ sudo megacli -PDRbld -ShowProg -physdrv[32:2] -aALL Rebuild Progress on Device at Enclosure 32, Slot 2 Completed 3% in 6 Minutes. [15:08:23] PROBLEM - Disk space on sretest1001 is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/cd7dd8b4f8190a4d3d7e08b4304dcd82cdfd76206313416e6ad144eb359e7ea9/mounts/shm is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=sretest1001&var-datasource=eqiad+prometheus/ops [15:08:29] papaul: poweroff is running now [15:08:36] kormat: cool thanks [15:10:56] (03PS5) 10CDanis: VCL: A heavy hammer for dire circumstances. [puppet] - 10https://gerrit.wikimedia.org/r/631848 [15:11:32] (03CR) 10Volans: "LGTM, one suggestion inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/633550 (owner: 10Elukey) [15:14:29] PROBLEM - Host db2125.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:15:05] ^ acking [15:15:27] ACKNOWLEDGEMENT - Host db2125.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Kormat maintenance [15:16:36] (03CR) 10Elukey: sre.hadoop.reboot-workers: allow to limit workers to reboot (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/633766 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey) [15:17:01] (03CR) 10BryanDavis: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/631511 (owner: 10Andrew Bogott) [15:20:57] (03PS6) 10CDanis: VCL: A heavy hammer for dire circumstances. [puppet] - 10https://gerrit.wikimedia.org/r/631848 [15:24:58] (03CR) 10CDanis: "With the new flag disabled, text and upload:" [puppet] - 10https://gerrit.wikimedia.org/r/631848 (owner: 10CDanis) [15:25:21] (03PS7) 10CDanis: VCL: A heavy hammer for dire circumstances. [puppet] - 10https://gerrit.wikimedia.org/r/631848 [15:26:12] (03PS1) 10Andrew Bogott: cloudvirt1021: add backy support [puppet] - 10https://gerrit.wikimedia.org/r/633771 (https://phabricator.wikimedia.org/T260692) [15:26:45] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1021: add backy support [puppet] - 10https://gerrit.wikimedia.org/r/633771 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [15:27:42] (03CR) 10Arturo Borrero Gonzalez: cloud-vps: update resolv.conf and associated domains for .wikimedia.cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/631511 (owner: 10Andrew Bogott) [15:29:09] PROBLEM - MegaRAID on es2026 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:29:10] ACKNOWLEDGEMENT - MegaRAID on es2026 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T265368 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:29:17] 10Operations, 10ops-codfw: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T265368 (10ops-monitoring-bot) [15:30:00] Same as T263837? [15:30:00] T263837: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 [15:30:37] (03CR) 10Andrew Bogott: cloud-vps: update resolv.conf and associated domains for .wikimedia.cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/631511 (owner: 10Andrew Bogott) [15:30:41] (03PS3) 10Andrew Bogott: cloud-vps: update resolv.conf and associated domains for .wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/631511 [15:33:51] 10Operations, 10ops-codfw: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T265368 (10Marostegui) 05Open→03Declined Handled at https://phabricator.wikimedia.org/T263837 [15:36:06] (03PS2) 10Elukey: sre.hadoop.change-distro-from-cdh: allow to select workers/journal [cookbooks] - 10https://gerrit.wikimedia.org/r/633550 [15:36:08] (03PS3) 10Elukey: sre.hadoop.reboot-workers: allow to limit workers to reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/633766 (https://phabricator.wikimedia.org/T255138) [15:36:35] RECOVERY - Host db2125.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.99 ms [15:37:14] (03CR) 10jerkins-bot: [V: 04-1] sre.hadoop.change-distro-from-cdh: allow to select workers/journal [cookbooks] - 10https://gerrit.wikimedia.org/r/633550 (owner: 10Elukey) [15:37:38] (03CR) 10jerkins-bot: [V: 04-1] sre.hadoop.reboot-workers: allow to limit workers to reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/633766 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey) [15:38:58] (03PS3) 10Elukey: sre.hadoop.change-distro-from-cdh: allow to select workers/journal [cookbooks] - 10https://gerrit.wikimedia.org/r/633550 [15:39:00] (03PS4) 10Elukey: sre.hadoop.reboot-workers: allow to limit workers to reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/633766 (https://phabricator.wikimedia.org/T255138) [15:39:13] the -1s are due to another cr waiting to be merged, sorry for the spam [15:40:04] (03CR) 10jerkins-bot: [V: 04-1] sre.hadoop.change-distro-from-cdh: allow to select workers/journal [cookbooks] - 10https://gerrit.wikimedia.org/r/633550 (owner: 10Elukey) [15:40:36] (03CR) 10jerkins-bot: [V: 04-1] sre.hadoop.reboot-workers: allow to limit workers to reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/633766 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey) [15:40:49] 10Operations, 10ops-codfw, 10DBA, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) Both CPU replaced, servers is back up [15:43:18] (03PS4) 10Volans: tests: import dns from new wmflib package [cookbooks] - 10https://gerrit.wikimedia.org/r/629692 (https://phabricator.wikimedia.org/T257905) [15:43:38] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: update resolv.conf and associated domains for .wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/631511 (owner: 10Andrew Bogott) [15:44:59] (03CR) 10jerkins-bot: [V: 04-1] tests: import dns from new wmflib package [cookbooks] - 10https://gerrit.wikimedia.org/r/629692 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [15:46:35] (03PS1) 10Andrew Bogott: WMCS nfs: remove the last mounts for wikidata-dev [puppet] - 10https://gerrit.wikimedia.org/r/633773 (https://phabricator.wikimedia.org/T208416) [15:47:15] (03CR) 10Andrew Bogott: [C: 03+2] WMCS nfs: remove the last mounts for wikidata-dev [puppet] - 10https://gerrit.wikimedia.org/r/633773 (https://phabricator.wikimedia.org/T208416) (owner: 10Andrew Bogott) [15:50:08] (03PS5) 10Volans: sre.hosts.decommission: import from new library [cookbooks] - 10https://gerrit.wikimedia.org/r/629692 (https://phabricator.wikimedia.org/T257905) [15:50:10] (03PS1) 10Volans: Temporary limit spicerack dependency [cookbooks] - 10https://gerrit.wikimedia.org/r/633774 (https://phabricator.wikimedia.org/T257905) [15:52:00] (03CR) 10CRusnov: [C: 03+1] "seems legitimate" [cookbooks] - 10https://gerrit.wikimedia.org/r/633774 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [15:52:02] (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.decommission: import from new library [cookbooks] - 10https://gerrit.wikimedia.org/r/629692 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [15:52:33] (03CR) 10Volans: [C: 03+2] Temporary limit spicerack dependency [cookbooks] - 10https://gerrit.wikimedia.org/r/633774 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [15:53:09] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: refresh network configuration [puppet] - 10https://gerrit.wikimedia.org/r/633775 (https://phabricator.wikimedia.org/T261724) [15:53:38] (03Merged) 10jenkins-bot: Temporary limit spicerack dependency [cookbooks] - 10https://gerrit.wikimedia.org/r/633774 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [15:53:59] (03PS6) 10Volans: sre.hosts.decommission: import from new library [cookbooks] - 10https://gerrit.wikimedia.org/r/629692 (https://phabricator.wikimedia.org/T257905) [15:55:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: refresh network configuration [puppet] - 10https://gerrit.wikimedia.org/r/633775 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [15:55:43] 10Operations, 10Phatality, 10observability, 10Developer Productivity: Deploying "Phatality" plugin for Kibana invokes oom-killer on logstash::collector nodes - https://phabricator.wikimedia.org/T237706 (10mmodell) @fgiunchedi it's unlikely that it will work in kibana 7 without significant changes. Kibana's... [15:56:11] !log power down ms-be2036 for maintenance [15:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:52] PROBLEM - Host ms-be2036 is DOWN: PING CRITICAL - Packet loss = 100% [15:59:13] (03CR) 10Vgutierrez: [C: 03+1] wikidough: enable OCSP stapling in dnsdist [puppet] - 10https://gerrit.wikimedia.org/r/632735 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:00:04] jbond42 and cdanis: I, the Bot under the Fountain, allow thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201013T1600). [16:07:03] (03PS2) 10Ssingh: wikidough: enable OCSP stapling in dnsdist [puppet] - 10https://gerrit.wikimedia.org/r/632735 (https://phabricator.wikimedia.org/T252132) [16:07:37] tgr_: just setting up over here to perform the wmf.11 promotions [16:07:56] (03CR) 10Ssingh: "Patch rebased, no other changes." [puppet] - 10https://gerrit.wikimedia.org/r/632735 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:08:00] (03CR) 10Ssingh: [C: 03+2] wikidough: enable OCSP stapling in dnsdist [puppet] - 10https://gerrit.wikimedia.org/r/632735 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:09:16] marxarelli: ack [16:13:20] longma: ^ [16:18:00] (03CR) 10Ottomata: [C: 03+1] Eventstreams: expose revision-visibility-change events. [deployment-charts] - 10https://gerrit.wikimedia.org/r/630262 (https://phabricator.wikimedia.org/T262479) (owner: 10Ppchelko) [16:18:52] (03PS4) 10Elukey: sre.hadoop.change-distro-from-cdh: allow to select workers/journal [cookbooks] - 10https://gerrit.wikimedia.org/r/633550 [16:18:54] (03PS5) 10Elukey: sre.hadoop.reboot-workers: allow to limit workers to reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/633766 (https://phabricator.wikimedia.org/T255138) [16:23:33] (03PS1) 10Jbond: wmflib: new fact listening ports [puppet] - 10https://gerrit.wikimedia.org/r/633779 [16:24:15] (03CR) 10jerkins-bot: [V: 04-1] wmflib: new fact listening ports [puppet] - 10https://gerrit.wikimedia.org/r/633779 (owner: 10Jbond) [16:24:26] 10Operations, 10ops-codfw, 10DBA, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) Thank you Papaul, I will start repooling the host tomorrow and see how not goes with load [16:25:57] tgr_, longma: sorry for the delay. thought i'd email the lists with the latest deployment plan (cc thcipriani). continuing... [16:26:27] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@77febb6]: airflow: parameterize active mediawiki dc [16:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:17] tgr_: were your wmf.11 sync'd? [16:27:24] wmf.11 *backports* [16:31:57] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@77febb6]: airflow: parameterize active mediawiki dc (duration: 05m 29s) [16:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:32] marxarelli: yeah [16:34:14] k. just double checking. looks good from here. rolling [16:35:03] (03PS1) 10Ssingh: dnsdist: update permissions for dnsdist.conf [puppet] - 10https://gerrit.wikimedia.org/r/633781 [16:35:17] oh fun. looks like deploy-promote is confused [16:36:27] (03PS1) 10Dduvall: group0 wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633784 [16:36:29] (03CR) 10Dduvall: [C: 03+2] group0 wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633784 (owner: 10Dduvall) [16:36:51] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/25836/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/633781 (owner: 10Ssingh) [16:36:57] (03CR) 10Ssingh: [C: 03+2] dnsdist: update permissions for dnsdist.conf [puppet] - 10https://gerrit.wikimedia.org/r/633781 (owner: 10Ssingh) [16:37:08] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633784 (owner: 10Dduvall) [16:38:15] 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10Papaul) @fgiunchedi ` Embedded Flash/SD-CARD Controller firmware revision 2.10.00 Embedded media manager failed media attach [16:39:12] (03Abandoned) 10Volans: sre.hosts.downtime: convert to class-based API [cookbooks] - 10https://gerrit.wikimedia.org/r/506954 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [16:39:31] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.11 [16:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:47] 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10Papaul) Upgrade ILO from 2.50 to 2.74 [16:41:11] (03PS1) 10Gergő Tisza: GrowthExperiments: Keep users with no explicit variant on variant A [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633786 (https://phabricator.wikimedia.org/T265372) [16:44:23] (03PS2) 10Gergő Tisza: GrowthExperiments: Keep users with no explicit variant on variant A [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633786 (https://phabricator.wikimedia.org/T265372) [16:50:39] (03PS2) 10Jbond: wmflib: new fact listening ports [puppet] - 10https://gerrit.wikimedia.org/r/633779 [16:51:25] (03CR) 10jerkins-bot: [V: 04-1] wmflib: new fact listening ports [puppet] - 10https://gerrit.wikimedia.org/r/633779 (owner: 10Jbond) [16:52:01] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10nettrom_WMF) >>! In T252391#6536174, @kostajh wrote: > Hmm, I spoke too soon. We rely on the `wgWMEUnderstandingFirstDay` bei... [16:52:59] (03PS3) 10Jbond: wmflib: new fact listening ports [puppet] - 10https://gerrit.wikimedia.org/r/633779 [16:53:44] (03CR) 10jerkins-bot: [V: 04-1] wmflib: new fact listening ports [puppet] - 10https://gerrit.wikimedia.org/r/633779 (owner: 10Jbond) [16:55:17] tgr_: so far so good. i'm thinking of rolling wmf.11 to group1 in about 30 min. does that work for you? [16:55:22] longma: ^ [16:55:47] (03PS4) 10Jbond: wmflib: new fact listening ports [puppet] - 10https://gerrit.wikimedia.org/r/633779 [16:56:17] i mean, so far so good in terms of logging. if there's more to do to verify the session issue i can hold off [16:56:53] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic: Limit maps serving to Wikimedia hosted sites only - https://phabricator.wikimedia.org/T261424 (10JMinor) Yes, I think everyone who requested to be added to the allow list has been added. There were a couple questions on the mailing lis... [16:57:43] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/633766 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey) [16:58:38] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/633550 (owner: 10Elukey) [17:00:04] chrisalbon and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201013T1700). [17:00:21] (03PS5) 10Jbond: wmflib: new fact listening ports [puppet] - 10https://gerrit.wikimedia.org/r/633779 [17:00:45] (03CR) 10Jbond: wmflib: new fact listening ports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633779 (owner: 10Jbond) [17:01:44] marxarelli: looks fine to me [17:02:24] tgr_: ack. thanks! [17:05:06] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/633768 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [17:07:24] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 129.2 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [17:09:36] (03CR) 10Ppchelko: [C: 03+2] Eventstreams: expose revision-visibility-change events. [deployment-charts] - 10https://gerrit.wikimedia.org/r/630262 (https://phabricator.wikimedia.org/T262479) (owner: 10Ppchelko) [17:11:43] (03Merged) 10jenkins-bot: Eventstreams: expose revision-visibility-change events. [deployment-charts] - 10https://gerrit.wikimedia.org/r/630262 (https://phabricator.wikimedia.org/T262479) (owner: 10Ppchelko) [17:11:54] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10wiki_willy) a:05RKemper→03Cmjohnson [17:15:51] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Papaul) return tracking information {F32383537} [17:15:52] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [17:15:52] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [17:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:56] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [17:16:56] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [17:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:08] RECOVERY - Host ms-be2036 is UP: PING OK - Packet loss = 0%, RTA = 33.42 ms [17:17:32] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10Tgr) >>! In T252391#6536174, @kostajh wrote: > I think instead of checking to see if `wgWMEUnderstandingFirstDay` is true, we... [17:18:28] !log ppchelko@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [17:18:28] !log ppchelko@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [17:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:48] (03CR) 10Razzi: [C: 03+2] geoip: archive MaxMind database to hdfs only [puppet] - 10https://gerrit.wikimedia.org/r/631896 (https://phabricator.wikimedia.org/T264152) (owner: 10Razzi) [17:30:37] !log 1.36.0-wmf.11 promoted to group0. no new errors (T263177). preparing to promote to group1 [17:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:43] T263177: 1.36.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T263177 [17:31:23] (03PS1) 10Dduvall: group1 wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633793 [17:31:25] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633793 (owner: 10Dduvall) [17:31:48] tgr_, longma: ^ rolling to group1 [17:32:06] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633793 (owner: 10Dduvall) [17:32:15] 👀 [17:32:29] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [17:32:30] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:26] marxarelli should I turn on xwikimediadebug and make some test edits in case the issue comes back? [17:34:12] DannyS712: that'd be great, yeah [17:34:25] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.11 [17:34:27] with verbose logging, or no? [17:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:42] PROBLEM - Host ms-be2036 is DOWN: PING CRITICAL - Packet loss = 100% [17:35:24] ^ tgr_ re: debug logging. would that be helpful? [17:35:34] !log dduvall@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.11 (duration: 01m 07s) [17:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:39] DannyS712: i think whatever you can do to try and repro the issue would be helpful. tgr_ has set up some additional logging on the session channel but i don't see how additional debug logging would hurt [17:40:14] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/633704 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [17:40:24] I turned on verbose and am just doing things on meta (recent changes patrolling) which should hopefully trigger whatever code paths caused it last time [17:41:50] PROBLEM - Host ms-be2036.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:42:47] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic: Limit maps serving to Wikimedia hosted sites only - https://phabricator.wikimedia.org/T261424 (10CDanis) >>! In T261424#6539494, @JMinor wrote: > Yes, I think everyone who requested to be added to the allow list has been added. There w... [17:42:58] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic: Limit maps serving to Wikimedia hosted sites only - https://phabricator.wikimedia.org/T261424 (10CDanis) a:03CDanis [17:47:01] (03CR) 10Bstorm: [C: 04-1] "The neovim add is fine. We need to add a bit more to make the others work." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633583 (https://phabricator.wikimedia.org/T219501) (owner: 10MichaelSchoenitzer) [17:47:36] RECOVERY - Host ms-be2036.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.67 ms [17:47:41] !log 1.36.0-wmf.13 branched at a6be801fc6331a6a6b96f02f368750200d50ab09 for T263179 [17:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:48] T263179: 1.36.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T263179 [17:48:44] RECOVERY - Host ms-be2036 is UP: PING OK - Packet loss = 0%, RTA = 33.33 ms [17:49:46] RECOVERY - Device not healthy -SMART- on mw2279 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mw2279&var-datasource=codfw+prometheus/ops [17:51:44] (03CR) 10Bstorm: [C: 04-1] "I see the problem. It's available in buster, but it isn't in stretch. Bastions and gridengine are still on Buster for now." [puppet] - 10https://gerrit.wikimedia.org/r/633583 (https://phabricator.wikimedia.org/T219501) (owner: 10MichaelSchoenitzer) [17:52:51] (03PS1) 10Andrew Bogott: site.pp: rearrange slightly to clarify different cloudvirt roles [puppet] - 10https://gerrit.wikimedia.org/r/633798 [17:54:10] (03CR) 10Dduvall: [C: 03+2] Branch commit for wmf/1.36.0-wmf.13 [core] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/633598 (https://phabricator.wikimedia.org/T263179) (owner: 10TrainBranchBot) [17:55:33] marxarelli: TBH the chances of it happening to the same user twice in a row are slim. If it was specifically related to X-Wikimedia-Debug, maybe, but DannyS712 said he didn't use XWD before the incident. [17:56:14] I didn't have it turned on, but I had used it before [17:56:14] 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10Papaul) The Flash/SD-CARD problem was fixed by formatting the NAND and draining the power ` Embedded Flash/SD-CARD Controller firmware revision 2.10.00 [17:56:18] but yeah, its unlikely [17:56:31] (03CR) 10Andrew Bogott: [C: 03+2] site.pp: rearrange slightly to clarify different cloudvirt roles [puppet] - 10https://gerrit.wikimedia.org/r/633798 (owner: 10Andrew Bogott) [17:56:37] ok [17:56:48] 10Operations, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T265208 (10Papaul) @fgiunchedi looks like icinga is happy now ` MD RAID View Extra Service Notes OK 2020-10-13 17:51:33 0d 0h 3m 51s 1/3 OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [17:57:37] 10Operations, 10ops-codfw, 10serviceops: Degraded RAID on mw2279 - https://phabricator.wikimedia.org/T264698 (10Papaul) @jijiki I will request a disk replacement [17:58:34] tgr_: i'm prepping wmf.13 to go out during the normal window. i would like to get wmf.11 to all wikis before then, but you mentioned having a meeting to attend in 30 min. if i rolled wmf.11 to all wikis now, would that work for you or would you rather i wait until after your meeting? [17:59:02] i haven't seen anything of concern after rolling to group1 [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201013T1800) [18:00:13] (03CR) 10Muehlenhoff: Add neovim, fd and ripgrep to toolforge (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633583 (https://phabricator.wikimedia.org/T219501) (owner: 10MichaelSchoenitzer) [18:00:54] 10Operations, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10RobH) >>! In T238036#6538876, @ayounsi wrote: > This has been alerting again this time for scs-c1-codfw. See https://librenms.wikimedia.org/graphs/device=170/type=device_processor/from=16019913... [18:01:47] !log scs-c1-codfw firmware update via T238036 [18:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:54] T238036: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 [18:02:52] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:03:10] 08̶W̶a̶r̶n̶i̶n̶g Device scs-c1-codfw.mgmt.codfw.wmnet recovered from Processor usage over 85% [18:03:54] doesnt count its only recovered cuz its rebooting ;D [18:04:36] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:06:27] (03PS5) 10Dzahn: wikistats: rm php7.0 pre-buster support, make PHP version parameter [puppet] - 10https://gerrit.wikimedia.org/r/633286 [18:08:32] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [18:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:42] !log scs-c1-codfw mgmt firmware updated, updating scs-a1-codfw T238036 [18:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:47] T238036: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 [18:10:31] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:31] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/25838/wikistats-wild-tiger.wikistats.eqiad.wmflabs/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/633286 (owner: 10Dzahn) [18:12:23] (03CR) 10Bstorm: [C: 04-1] Add neovim, fd and ripgrep to toolforge (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633583 (https://phabricator.wikimedia.org/T219501) (owner: 10MichaelSchoenitzer) [18:12:28] (03PS1) 10Alexandros Kosiaris: admin: Update some of my dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/633802 [18:15:50] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Update some of my dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/633802 (owner: 10Alexandros Kosiaris) [18:19:12] marxarelli: sorry, missed the ping. It would work; I don't think I can do much other than sit and wait to see if someone reports an issue, anyway. [18:19:36] tgr_: np. i'll roll it then [18:19:38] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.13 [core] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/633598 (https://phabricator.wikimedia.org/T263179) (owner: 10TrainBranchBot) [18:20:31] (03PS1) 10Dduvall: all wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633804 [18:20:33] (03CR) 10Dduvall: [C: 03+2] all wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633804 (owner: 10Dduvall) [18:20:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:21:07] !log 1.36.0-wmf.11 promoted to group1. no new errors (T263177). promoting to all wikis [18:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:13] T263177: 1.36.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T263177 [18:21:15] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633804 (owner: 10Dduvall) [18:21:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:23:10] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.11 [18:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:40] PROBLEM - Long running screen/tmux on an-launcher1002 is CRITICAL: CRIT: Long running SCREEN process. (user: milimetric PID: 22072, 1741990s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [18:24:56] ah, my apologies [18:25:30] ok, closed, sorry again [18:25:34] what a nifty tool :) [18:26:36] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is OK: (C)100 gt (W)80 gt 74.24 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [18:37:36] (03CR) 10Dzahn: calico: hiera->lookup, add data types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633033 (owner: 10Dzahn) [18:39:10] RECOVERY - MegaRAID on es2026 is OK: OK: optimal, 1 logical, 12 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:40:42] (03PS2) 10Dzahn: calico: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/633033 [18:41:34] milimetric: hehe, thanks. (https://xkcd.com/838/) [18:42:28] it's possible to whitelist hosts/roles fwiw [18:44:26] lol, yeah, no I belong on the naughty list this time. If I ever submit something that long running, it better be running on a distributed system and not a screen :) [18:45:21] ok;) [18:52:03] (03CR) 10Dzahn: [V: 03+1] "role::calico::builder not currently on any instances. would be prefix "calico" in packaging project in cloud: https://openstack-browser.to" [puppet] - 10https://gerrit.wikimedia.org/r/633033 (owner: 10Dzahn) [18:52:50] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 105.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [18:53:24] !log dduvall@deploy1001 Pruned MediaWiki: 1.36.0-wmf.6 (duration: 13m 00s) [18:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:18] !log dduvall@deploy1001 Pruned MediaWiki: 1.36.0-wmf.8 (duration: 02m 10s) [18:56:18] (03CR) 10Dzahn: [V: 03+1] "relforge: https://puppet-compiler.wmflabs.org/compiler1003/25841/" [puppet] - 10https://gerrit.wikimedia.org/r/633022 (owner: 10Dzahn) [18:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:15] !log dduvall@deploy1001 Pruned MediaWiki: 1.36.0-wmf.9 (duration: 01m 56s) [18:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:53] (03CR) 10Dzahn: [V: 04-1] "https://puppet-compiler.wmflabs.org/compiler1001/25843/lvs2009.codfw.wmnet/change.lvs2009.codfw.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/633026 (owner: 10Dzahn) [18:58:55] (03CR) 10Ppchelko: [C: 03+1] api-gateway: more instances in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/633559 (owner: 10Hnowlan) [18:59:01] (03PS1) 10Dduvall: testwikis wikis to 1.36.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633810 [18:59:03] (03CR) 10Dduvall: [C: 03+2] testwikis wikis to 1.36.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633810 (owner: 10Dduvall) [18:59:41] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633810 (owner: 10Dduvall) [19:00:03] !log dduvall@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.13 [19:00:04] marxarelli and longma: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Mediawiki train - American Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201013T1900). [19:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:37] (03PS2) 10Dzahn: pybal: replace hiera with lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/633026 [19:02:24] 10Operations, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10RobH) I've successfully upgraded the scs firmware fleetwide, with the exception of two devices: * [[ https://netbox.wikimedia.org/dcim/devices/1327/ | scs-a8-eqiad ]] - old model CM4148, needs... [19:02:34] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/25844/" [puppet] - 10https://gerrit.wikimedia.org/r/633026 (owner: 10Dzahn) [19:11:22] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Netbox Error for asw2-d4-eqiad - https://phabricator.wikimedia.org/T265393 (10wiki_willy) [19:11:59] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Netbox Error for asw2-d4-eqiad - https://phabricator.wikimedia.org/T265393 (10wiki_willy) Email sent to Julianne to re-check the invoice data pulled for asw2-d4-eqiad Thanks, Willy [19:12:34] (03CR) 10Dzahn: [V: 04-1] "https://puppet-compiler.wmflabs.org/compiler1001/25845/deploy1001.eqiad.wmnet/change.deploy1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:23:31] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10kostajh) >>! In T252391#6539461, @nettrom_WMF wrote: >>>! In T252391#6536174, @kostajh wrote: >> Hmm, I spoke too soon. We re... [19:23:45] 10Operations, 10ops-codfw, 10serviceops: Degraded RAID on mw2279 - https://phabricator.wikimedia.org/T264698 (10jijiki) Thank you! [19:24:02] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:24:16] (03PS3) 10Kosta Harlan: Disable wgWMEUnderstandingFirstDay (EditorJourney) logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633514 (https://phabricator.wikimedia.org/T252391) [19:24:22] (03PS2) 10Dzahn: mediawiki: replace hiera with lookup, add data types in all profiles [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) [19:24:52] (03CR) 10Kosta Harlan: "> Patch Set 1: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633514 (https://phabricator.wikimedia.org/T252391) (owner: 10Kosta Harlan) [19:26:09] (03CR) 10Kosta Harlan: [C: 03+1] GrowthExperiments: Keep users with no explicit variant on variant A [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633786 (https://phabricator.wikimedia.org/T265372) (owner: 10Gergő Tisza) [19:26:53] (03CR) 10CDanis: [C: 03+2] VCL: A heavy hammer for dire circumstances. [puppet] - 10https://gerrit.wikimedia.org/r/631848 (owner: 10CDanis) [19:27:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:31:59] (03CR) 10Gergő Tisza: [C: 03+1] Disable wgWMEUnderstandingFirstDay (EditorJourney) logging (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633514 (https://phabricator.wikimedia.org/T252391) (owner: 10Kosta Harlan) [19:32:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:33:20] (03CR) 10Gergő Tisza: [C: 04-1] "Roan pointed out that this wouldn't actually undo the patch since it removes A from the valid variant list, which is not configurable." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633786 (https://phabricator.wikimedia.org/T265372) (owner: 10Gergő Tisza) [19:33:40] PROBLEM - Check systemd state on elastic1063 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:34:23] (03PS1) 10Andrew Bogott: Update profile::openstack::base::nova::instance_dev for several cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/633816 [19:34:25] (03PS1) 10Andrew Bogott: Remove hiera for labvirt1010/1011 [puppet] - 10https://gerrit.wikimedia.org/r/633817 [19:35:20] (03CR) 10Andrew Bogott: [C: 03+2] Remove hiera for labvirt1010/1011 [puppet] - 10https://gerrit.wikimedia.org/r/633817 (owner: 10Andrew Bogott) [19:35:25] (03CR) 10Andrew Bogott: [C: 03+2] Update profile::openstack::base::nova::instance_dev for several cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/633816 (owner: 10Andrew Bogott) [19:35:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:38:49] (03CR) 10Kosta Harlan: "> Patch Set 2: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633786 (https://phabricator.wikimedia.org/T265372) (owner: 10Gergő Tisza) [19:39:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1013 with 10G interfaces - https://phabricator.wikimedia.org/T243414 (10Andrew) From the parent task: >>! In T216195#6524284, @ayounsi wrote: > Note that now racks `C8` and `D5` are dedicated to WMCS s... [19:40:37] (03CR) 10Gergő Tisza: [C: 04-1] "> Patch Set 2: -Code-Review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633786 (https://phabricator.wikimedia.org/T265372) (owner: 10Gergő Tisza) [19:40:40] !log dduvall@deploy1001 Finished scap: testwikis wikis to 1.36.0-wmf.13 (duration: 40m 51s) [19:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:57] marxarelli: lovely :) [19:43:25] (03Abandoned) 10Gergő Tisza: GrowthExperiments: Keep users with no explicit variant on variant A [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633786 (https://phabricator.wikimedia.org/T265372) (owner: 10Gergő Tisza) [19:44:55] hashar: so far so good :) [19:45:09] 10Operations, 10ops-eqiad, 10Discovery-Search: Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10dcausse) happened again today: ` [Tue Oct 13 18:14:55 2020] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [Tue Oct 13 1... [19:45:28] RECOVERY - Check systemd state on elastic1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:49:38] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [19:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:17] testwiki and logs look ok to me. rolling wmf.13 to group0, cc: longma [19:52:04] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:15] (03PS1) 10Dduvall: group0 wikis to 1.36.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633822 [19:52:17] (03CR) 10Dduvall: [C: 03+2] group0 wikis to 1.36.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633822 (owner: 10Dduvall) [19:52:54] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633822 (owner: 10Dduvall) [19:54:20] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.13 [19:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:37] (03CR) 10Dzahn: [V: 04-1] "works on most prod hosts but there seems to be some special case with cloud/labweb" [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [20:01:28] (03PS3) 10Dzahn: mediawiki: replace hiera with lookup, add data types in all profiles [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) [20:02:24] (03CR) 10Dzahn: "it's because hieradata/cloud/eqiad1/deployment-prep/common.yaml does not have a real FQDN for memcached_servers but it should be. fixing i" [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [20:03:19] !log add elastic2029-production-search-psi-codfw to cluster.routing.allocatin.exclude._name to drain active shards, instance currently in gc hell [20:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:36] !log 1.36.0-wmf.13 promoted to group0. no new or concerning errors or changes in error rates (T263179) [20:06:40] 10Operations, 10Technical-blog-posts, 10Traffic: Blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T264729 (10srodlund) @ema are you the sole author would you like additional authors added? [20:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:42] T263179: 1.36.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T263179 [20:07:03] logs look alright to me as well [20:08:16] * marxarelli nods [20:08:20] until tomorrow [20:11:28] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [20:12:01] (03PS4) 10Dzahn: mediawiki: replace hiera with lookup, add data types in all profiles [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) [20:14:05] !log restart production-search-psi-codfw on elastic2029 to reset any wonkiness from gc hell [20:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:58] !log unban elastic2029 from production-search-psi-codfw [20:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:51] (03PS1) 10Catrope: Revert "Make variant D the default, and remove variant A" [extensions/GrowthExperiments] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/633757 (https://phabricator.wikimedia.org/T265372) [20:27:12] (03PS1) 10Dzahn: netmon: remove stretch PHP 7.2 support [puppet] - 10https://gerrit.wikimedia.org/r/633824 [20:27:14] (03PS1) 10Dzahn: netmon: move webserver setup to profile and pass PHP version as param [puppet] - 10https://gerrit.wikimedia.org/r/633825 [20:27:42] (03CR) 10jerkins-bot: [V: 04-1] netmon: move webserver setup to profile and pass PHP version as param [puppet] - 10https://gerrit.wikimedia.org/r/633825 (owner: 10Dzahn) [20:30:04] (03PS2) 10Dzahn: netmon: move webserver setup to profile and pass PHP version as param [puppet] - 10https://gerrit.wikimedia.org/r/633825 [20:31:05] (03CR) 10jerkins-bot: [V: 04-1] netmon: move webserver setup to profile and pass PHP version as param [puppet] - 10https://gerrit.wikimedia.org/r/633825 (owner: 10Dzahn) [20:32:57] (03PS3) 10Dzahn: netmon: move webserver setup to profile and pass PHP version as param [puppet] - 10https://gerrit.wikimedia.org/r/633825 [20:34:04] 10Operations, 10Technical-blog-posts, 10Traffic: Blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T264729 (10srodlund) @ema For the main image, I went with the earth from space: https://commons.wikimedia.org/wiki/File:North_America_from_low_orbiting_... [20:39:19] (03PS1) 10Dzahn: netmon: ensure nmap and mtr-tiny are installed, add profile for tools [puppet] - 10https://gerrit.wikimedia.org/r/633827 [20:39:50] (03CR) 10jerkins-bot: [V: 04-1] netmon: ensure nmap and mtr-tiny are installed, add profile for tools [puppet] - 10https://gerrit.wikimedia.org/r/633827 (owner: 10Dzahn) [20:42:17] (03PS2) 10Dzahn: netmon: ensure nmap and mtr-tiny are installed, add profile for tools [puppet] - 10https://gerrit.wikimedia.org/r/633827 [20:43:32] (03CR) 10jerkins-bot: [V: 04-1] netmon: ensure nmap and mtr-tiny are installed, add profile for tools [puppet] - 10https://gerrit.wikimedia.org/r/633827 (owner: 10Dzahn) [20:44:26] !log bast1002 - apt-get remove nmap (it can be used on netmon hosts and was not consistent with other bast hosts) [20:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:58] !log bast1002 - apt-get autoremove - cleans up golang and ruby packages [20:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:42] (03PS3) 10Dzahn: netmon: ensure nmap and mtr-tiny are installed, add profile for tools [puppet] - 10https://gerrit.wikimedia.org/r/633827 [20:50:24] (03CR) 10Dzahn: mediawiki: replace hiera with lookup, add data types in all profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [20:53:12] (03PS5) 10Dzahn: mediawiki: replace hiera with lookup, add data types in all profiles [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) [20:58:05] (03PS6) 10Dzahn: mediawiki: replace hiera with lookup, add data types in all profiles [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) [20:59:25] (03CR) 10Dzahn: mediawiki: replace hiera with lookup, add data types in all profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [21:02:26] 10Operations, 10ops-codfw, 10netops: patch in FB peering into cr2-eqdfw - https://phabricator.wikimedia.org/T265412 (10RobH) p:05Triage→03Medium [21:02:54] 10Operations, 10ops-codfw, 10netops: patch in FB peering into cr2-eqdfw - https://phabricator.wikimedia.org/T265412 (10RobH) [21:03:13] 10Operations, 10ops-codfw, 10netops: patch in FB peering into cr2-eqdfw - https://phabricator.wikimedia.org/T265412 (10RobH) [21:03:16] (03PS1) 10Andrew Bogott: wmcs VM backups: add two more backup hosts, increase days to 7 [puppet] - 10https://gerrit.wikimedia.org/r/633829 (https://phabricator.wikimedia.org/T260692) [21:03:52] (03CR) 10Dzahn: [C: 04-1] "" Cannot reassign variable '$java_home'" ? https://puppet-compiler.wmflabs.org/compiler1001/25850/gerrit1001.wikimedia.org/change.gerrit1" [puppet] - 10https://gerrit.wikimedia.org/r/632224 (https://phabricator.wikimedia.org/T264182) (owner: 10Muehlenhoff) [21:05:52] (03PS2) 10Andrew Bogott: wmcs VM backups: add two more backup hosts, increase days to 7 [puppet] - 10https://gerrit.wikimedia.org/r/633829 (https://phabricator.wikimedia.org/T260692) [21:06:11] (03CR) 10Dzahn: [V: 03+1] "noop on everything including cloud for C:profile::mediawiki::common https://puppet-compiler.wmflabs.org/compiler1003/25849/" [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [21:07:44] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [21:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:16] PROBLEM - Ensure local MW versions match expected deployment on mw2279 is CRITICAL: CRITICAL: 956 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:09:06] interesting. is that being used as a test host right now? [21:09:39] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:07] ACKNOWLEDGEMENT - Ensure local MW versions match expected deployment on mw2279 is CRITICAL: CRITICAL: 956 mismatched wikiversions daniel_zahn https://phabricator.wikimedia.org/T264698 https://wikitech.wikimedia.org/wiki/Application_servers [21:12:40] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [21:12:40] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:15] 10Operations, 10ops-codfw, 10serviceops: Degraded RAID on mw2279 - https://phabricator.wikimedia.org/T264698 (10Dzahn) ` 21:08 <+icinga-wm> PROBLEM - Ensure local MW versions match expected deployment on mw2279 is CRITICAL: CRITICAL: 956 mismatched wikiversions https://wikitech.wikimedia.... [21:16:11] !log icinga had gerrit health alert but did not notice an issue myself and was gone next check [21:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:09] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "result for profile::mediawiki::webserver: https://puppet-compiler.wmflabs.org/compiler1001/25851/" [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [21:22:28] (03CR) 10Dzahn: "this is odd. Sometimes "optional parameter listed before required parameter" makes jerkins -1 but in other places I totally expected it to" [puppet] - 10https://gerrit.wikimedia.org/r/633288 (owner: 10Dzahn) [21:28:42] (03PS3) 10Dzahn: wikistats: redo the way cronjobs are setup, add parameter to absent [puppet] - 10https://gerrit.wikimedia.org/r/633288 [21:29:44] (03CR) 10jerkins-bot: [V: 04-1] wikistats: redo the way cronjobs are setup, add parameter to absent [puppet] - 10https://gerrit.wikimedia.org/r/633288 (owner: 10Dzahn) [21:30:04] (03CR) 10Andrew Bogott: [C: 03+2] wmcs VM backups: add two more backup hosts, increase days to 7 [puppet] - 10https://gerrit.wikimedia.org/r/633829 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [21:32:48] (03PS2) 10Dzahn: elasticsearch: replace hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/633022 [21:34:49] (03PS4) 10Dzahn: wikistats: redo the way cronjobs are setup, add parameter to absent [puppet] - 10https://gerrit.wikimedia.org/r/633288 [21:38:44] (03PS3) 10Razzi: geoip: move archive timer from stat1007 to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/633032 (https://phabricator.wikimedia.org/T264152) [21:41:26] 10Operations, 10ops-codfw, 10serviceops: Degraded RAID on mw2279 - https://phabricator.wikimedia.org/T264698 (10Papaul) Create Dispatch: Success You have successfully submitted request SR1039679642. [21:50:59] (03PS1) 10DannyS712: Partially revert "[labs] Remove wmgMonologChannels override" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633761 [21:51:06] (03PS4) 10Razzi: geoip: move archive timer from stat1007 to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/633032 (https://phabricator.wikimedia.org/T264152) [21:51:22] (03PS1) 10Dzahn: docker::registry: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/633835 [21:51:56] (03PS2) 10DannyS712: Partially revert "[labs] Remove wmgMonologChannels override" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633761 [21:55:24] (03PS1) 10Dzahn: docker: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/633836 [22:04:24] (03PS1) 10Dzahn: ldap::client::labs: fix 'Unknown variable: '::restricted..' [puppet] - 10https://gerrit.wikimedia.org/r/633838 [22:05:35] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10RobH) [22:05:38] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10RobH) [22:07:02] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:08:01] (03CR) 10Dzahn: [C: 04-1] "The title 'et' has already been used in this resource expression" [puppet] - 10https://gerrit.wikimedia.org/r/633288 (owner: 10Dzahn) [22:08:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:10:01] (03PS5) 10Dzahn: wikistats: redo the way cronjobs are setup, add parameter to absent [puppet] - 10https://gerrit.wikimedia.org/r/633288 [22:10:08] (03CR) 10Dzahn: "nice.. so all this time a duplicate cron job that now shows up to this change :)" [puppet] - 10https://gerrit.wikimedia.org/r/633288 (owner: 10Dzahn) [22:23:08] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/25859/wikistats-wild-tiger.wikistats.eqiad.wmflabs/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/633288 (owner: 10Dzahn) [22:25:22] RECOVERY - Long running screen/tmux on an-launcher1002 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [22:38:15] (03PS1) 10Dzahn: wikistats (cloud): disable crons on one of 2 instances [puppet] - 10https://gerrit.wikimedia.org/r/633842 [22:39:46] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/25860/" [puppet] - 10https://gerrit.wikimedia.org/r/633842 (owner: 10Dzahn) [22:41:12] 10Operations, 10ops-codfw, 10netops: patch in FB peering into cr2-eqdfw - https://phabricator.wikimedia.org/T265412 (10RobH) [22:55:31] (03PS1) 10Catrope: Rename GrowthExperiments helpdesk on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633843 (https://phabricator.wikimedia.org/T265214) [22:58:44] (03PS1) 10Dzahn: wikistats: allow to 'absent' import/dump crons as well (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/633845 [22:59:03] (03CR) 10jerkins-bot: [V: 04-1] wikistats: allow to 'absent' import/dump crons as well (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/633845 (owner: 10Dzahn) [22:59:50] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Neha Nair (nnair) - https://phabricator.wikimedia.org/T265428 (10drochford) [23:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201013T2300). [23:00:04] hmonroy, tgr, and RoanKattouw: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:14] I'll deploy [23:00:31] Let's do it! [23:00:45] o/ [23:00:47] (03CR) 10Catrope: [C: 03+2] Revert "Make variant D the default, and remove variant A" [extensions/GrowthExperiments] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/633757 (https://phabricator.wikimedia.org/T265372) (owner: 10Catrope) [23:00:57] (03PS3) 10Catrope: Enable watchlist expiry feature on four wikis from group 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632779 (https://phabricator.wikimedia.org/T264780) (owner: 10HMonroy) [23:01:05] (03PS4) 10Catrope: Enable watchlist expiry feature on four wikis from group 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632779 (https://phabricator.wikimedia.org/T264780) (owner: 10HMonroy) [23:01:24] (03CR) 10Catrope: [C: 03+2] Enable watchlist expiry feature on four wikis from group 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632779 (https://phabricator.wikimedia.org/T264780) (owner: 10HMonroy) [23:02:13] (03Merged) 10jenkins-bot: Enable watchlist expiry feature on four wikis from group 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632779 (https://phabricator.wikimedia.org/T264780) (owner: 10HMonroy) [23:02:35] hmonroy: Your change is on mwdebug2001, please test [23:02:45] checking [23:05:52] RoanKattouw: Looks good! [23:06:40] (03PS2) 10Catrope: Disable event logging in MediaViewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625771 (https://phabricator.wikimedia.org/T260582) (owner: 10Gergő Tisza) [23:06:50] (03CR) 10Catrope: [C: 03+2] Disable event logging in MediaViewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625771 (https://phabricator.wikimedia.org/T260582) (owner: 10Gergő Tisza) [23:07:19] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable watchlist expiry on frwiki, fawiki, dewiki, cswiki (T264780) (duration: 01m 04s) [23:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:26] T264780: Watchlist Expiry: Release to group 2 pilot wikis [TUES, OCT 13] - https://phabricator.wikimedia.org/T264780 [23:07:29] hmonroy: And it's live! [23:07:48] RoanKattouw: Awesome! Thank you :) [23:07:54] (03Merged) 10jenkins-bot: Disable event logging in MediaViewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625771 (https://phabricator.wikimedia.org/T260582) (owner: 10Gergő Tisza) [23:09:10] tgr_: Your patch is on mwdebug2001, would you like to test it or should I just deploy it right away? [23:12:25] (03Merged) 10jenkins-bot: Revert "Make variant D the default, and remove variant A" [extensions/GrowthExperiments] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/633757 (https://phabricator.wikimedia.org/T265372) (owner: 10Catrope) [23:12:37] RoanKattouw: tested, thanks! [23:12:50] (03PS2) 10Catrope: Rename GrowthExperiments helpdesk on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633843 (https://phabricator.wikimedia.org/T265214) [23:12:54] (03CR) 10Catrope: [C: 03+2] Rename GrowthExperiments helpdesk on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633843 (https://phabricator.wikimedia.org/T265214) (owner: 10Catrope) [23:13:43] (03Merged) 10jenkins-bot: Rename GrowthExperiments helpdesk on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633843 (https://phabricator.wikimedia.org/T265214) (owner: 10Catrope) [23:14:19] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Disable event logging in MediaViewer (T260582) (duration: 01m 04s) [23:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:25] T260582: Migrate EventLogging MediaViewer data to Event Platform - https://phabricator.wikimedia.org/T260582 [23:18:39] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Rename GrowthExperiments help desk on ptwiki (T265214) (duration: 01m 04s) [23:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:47] T265214: Change Growth features parameters on Portuguese Wikipedia - https://phabricator.wikimedia.org/T265214 [23:22:40] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.13/extensions/GrowthExperiments/: Revert removal of variant A (T265372) (duration: 01m 04s) [23:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:46] T265372: Variant C/D: configuration control - https://phabricator.wikimedia.org/T265372 [23:26:00] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 76 probes of 568 (alerts on 65) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:37:18] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 48 probes of 568 (alerts on 65) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:40:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T260379 (10Dwisehaupt) 05Invalid→03Resolved [23:43:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T260379 (10Dwisehaupt) 05Resolved→03Invalid