[00:28:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:28:58] RECOVERY - snapshot of s2 in eqiad on alert1001 is OK: Last snapshot for s2 at eqiad (db1095.eqiad.wmnet:3312) taken on 2020-10-18 22:50:23 (845 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:32:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:10:32] (03PS2) 10Huji: Change votewiki language temporarily to fa for fawiki elections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626851 (https://phabricator.wikimedia.org/T262689) (owner: 10Majavah) [02:04:24] (03PS1) 10Andrew Bogott: Remove a stray, unused file [puppet] - 10https://gerrit.wikimedia.org/r/634836 [02:04:26] (03PS1) 10Andrew Bogott: glance-api: replace 'default_store' with the more flexible 'glance_backends' [puppet] - 10https://gerrit.wikimedia.org/r/634837 (https://phabricator.wikimedia.org/T263461) [02:19:38] (03PS2) 10Andrew Bogott: glance-api: replace 'default_store' with the more flexible 'glance_backends' [puppet] - 10https://gerrit.wikimedia.org/r/634837 (https://phabricator.wikimedia.org/T263461) [02:19:51] (03Abandoned) 10Andrew Bogott: Remove a stray, unused file [puppet] - 10https://gerrit.wikimedia.org/r/634836 (owner: 10Andrew Bogott) [02:23:31] (03PS3) 10Andrew Bogott: glance-api: replace 'default_store' with the more flexible 'glance_backends' [puppet] - 10https://gerrit.wikimedia.org/r/634837 (https://phabricator.wikimedia.org/T263461) [02:26:55] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/compiler1001/25966/" [puppet] - 10https://gerrit.wikimedia.org/r/634837 (https://phabricator.wikimedia.org/T263461) (owner: 10Andrew Bogott) [03:31:09] (03CR) 10Andrew Bogott: [C: 03+2] glance-api: replace 'default_store' with the more flexible 'glance_backends' [puppet] - 10https://gerrit.wikimedia.org/r/634837 (https://phabricator.wikimedia.org/T263461) (owner: 10Andrew Bogott) [03:33:28] (03PS1) 10Andrew Bogott: glance: disable the file backend for glance in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/634839 (https://phabricator.wikimedia.org/T263461) [03:37:17] (03PS2) 10Andrew Bogott: glance: disable the file backend for glance in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/634839 (https://phabricator.wikimedia.org/T263461) [03:38:17] (03CR) 10Andrew Bogott: [C: 03+2] glance: disable the file backend for glance in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/634839 (https://phabricator.wikimedia.org/T263461) (owner: 10Andrew Bogott) [04:07:29] (03PS1) 10Andrew Bogott: glance-api: make active/active in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/634840 (https://phabricator.wikimedia.org/T263461) [04:08:30] (03CR) 10jerkins-bot: [V: 04-1] glance-api: make active/active in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/634840 (https://phabricator.wikimedia.org/T263461) (owner: 10Andrew Bogott) [04:10:04] (03PS2) 10Andrew Bogott: glance-api: make active/active in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/634840 (https://phabricator.wikimedia.org/T263461) [04:12:15] (03PS1) 10Ppchelko: Add api.wikimedia.org to the list of allowed CORS origins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634841 [04:12:45] (03CR) 10Ppchelko: "Echo notifications don't work without it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634841 (owner: 10Ppchelko) [04:15:52] (03PS3) 10Andrew Bogott: glance-api: make active/active in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/634840 (https://phabricator.wikimedia.org/T263461) [04:22:44] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/compiler1003/25970/" [puppet] - 10https://gerrit.wikimedia.org/r/634840 (https://phabricator.wikimedia.org/T263461) (owner: 10Andrew Bogott) [05:10:45] (03PS1) 10Marostegui: site.pp: Clarifying comment about multi-instance hosts [puppet] - 10https://gerrit.wikimedia.org/r/634842 [05:12:08] (03CR) 10Marostegui: [C: 03+2] site.pp: Clarifying comment about multi-instance hosts [puppet] - 10https://gerrit.wikimedia.org/r/634842 (owner: 10Marostegui) [05:22:33] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) Just for the record, the rebuilt process for the new disk finished correctly: ` root@es2026:~# megacli -LDInfo -L0 -a0 Adapter 0 -- Virtual Drive Information: Virtual D... [05:36:25] 10Operations, 10SRE-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10Marostegui) Is this fully done? [05:53:59] 10Operations, 10Wikimedia-Mailing-lists: Create a mailing list for frwiki nominators - https://phabricator.wikimedia.org/T265835 (10Marostegui) a:03Marostegui [06:08:11] 10Operations, 10Wikimedia-Mailing-lists: Create a mailing list for frwiki nominators - https://phabricator.wikimedia.org/T265835 (10Marostegui) This list has been created. You should've received an email with the autogenerated admin password. Subscription requires approval. Can you please add this mailing list... [06:18:12] 10Operations, 10Wikimedia-Mailing-lists: Bot unable to send messages to wikipedia-fr-wikimag - https://phabricator.wikimedia.org/T265844 (10Marostegui) @herron could you take a look at this? [06:18:19] 10Operations, 10Wikimedia-Mailing-lists: Bot unable to send messages to wikipedia-fr-wikimag - https://phabricator.wikimedia.org/T265844 (10Marostegui) p:05Triage→03Medium [06:45:46] 10Operations, 10SRE-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10MoritzMuehlenhoff) 05Open→03Resolved For the followup work with the old home there's T264994, so we can close this. [06:45:52] !log elukey@deploy1001 Started deploy [analytics/turnilo/deploy@334627e]: Upgrade to 1.27 [06:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:02] !log elukey@deploy1001 Finished deploy [analytics/turnilo/deploy@334627e]: Upgrade to 1.27 (duration: 00m 10s) [06:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:58] 10Operations, 10CAS-SSO: Update CAS to 6.2 - https://phabricator.wikimedia.org/T265857 (10MoritzMuehlenhoff) [07:05:11] (03CR) 10Volans: "looks good overall, minor nits inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/634572 (owner: 10Jbond) [07:13:06] PROBLEM - Disk space on wdqs2002 is CRITICAL: DISK CRITICAL - free space: /srv 53420 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wdqs2002&var-datasource=codfw+prometheus/ops [07:14:35] gehel: ^ [07:16:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2125 ', diff saved to https://phabricator.wikimedia.org/P13022 and previous config saved to /var/cache/conftool/dbconfig/20201019-071614-marostegui.json [07:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:57] marostegui: thanks ! I'll have a look in a few [07:21:27] gehel: good morning! I checked and under srv there is a big wikidata.jnl (1.2T) [07:22:30] 10Operations, 10netops, 10cloud-services-team (Kanban): Remove 185.15.56.0/24 from network::external - https://phabricator.wikimedia.org/T265864 (10ayounsi) p:05Triage→03Medium [07:22:58] elukey: thanks ! That file is supposed to be around 660GB. But sometimes it grows for unknown reasons. [07:23:04] I'll reset the data [07:24:26] (03CR) 10Ayounsi: "1 inline comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/634050 (owner: 10Arturo Borrero Gonzalez) [07:24:30] (03PS1) 10Elukey: Remove analytics1055 from the hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/634905 (https://phabricator.wikimedia.org/T255140) [07:25:15] (03CR) 10Elukey: [C: 03+2] Remove analytics1055 from the hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/634905 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [07:25:26] (03PS1) 10Muehlenhoff: Bump to 6.2.4 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/634907 (https://phabricator.wikimedia.org/T265857) [07:25:51] (03PS1) 10Volans: cumin: fix aliases [puppet] - 10https://gerrit.wikimedia.org/r/634908 (https://phabricator.wikimedia.org/T259013) [07:26:21] (03CR) 10Volans: role::logstash::collector: remove unused role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617090 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [07:26:37] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/634908 (https://phabricator.wikimedia.org/T259013) (owner: 10Volans) [07:28:05] (03CR) 10Volans: [C: 03+2] cumin: fix aliases [puppet] - 10https://gerrit.wikimedia.org/r/634908 (https://phabricator.wikimedia.org/T259013) (owner: 10Volans) [07:28:34] thanks gehel! [07:30:09] (03PS1) 10Volans: cumin: fix aliases typo [puppet] - 10https://gerrit.wikimedia.org/r/634910 [07:32:16] (03CR) 10Muehlenhoff: [C: 03+1] cumin: fix aliases typo [puppet] - 10https://gerrit.wikimedia.org/r/634910 (owner: 10Volans) [07:32:33] (03CR) 10Volans: [C: 03+2] cumin: fix aliases typo [puppet] - 10https://gerrit.wikimedia.org/r/634910 (owner: 10Volans) [07:36:00] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [07:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:11] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) [07:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:29] (03CR) 10Volans: [C: 03+1] "trivial enough :)" [software/cumin] - 10https://gerrit.wikimedia.org/r/634504 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [07:41:24] (03CR) 10Ayounsi: "Didn't review the actual script as I think Riccardo did in another CR. Let me know if I should." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/634566 (owner: 10Jbond) [07:42:14] (03CR) 10Gehel: [C: 03+2] Mark get_short_command() as private. [software/cumin] - 10https://gerrit.wikimedia.org/r/634504 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [07:46:25] (03PS1) 10Elukey: sre.hadoop.init-hadoop-workers: add option to wipe partition tables [cookbooks] - 10https://gerrit.wikimedia.org/r/634911 [07:49:56] (03CR) 10ZPapierski: [C: 04-1] "This change is correct configuration-wise, but since it requires additional community sync, we are keeping this one on hold, unless requir" [puppet] - 10https://gerrit.wikimedia.org/r/615810 (owner: 10ZPapierski) [07:50:33] (03CR) 10Elukey: [C: 04-1] "This doesn't work, the partitions need to be unmounted first." [cookbooks] - 10https://gerrit.wikimedia.org/r/634911 (owner: 10Elukey) [07:51:14] 10Operations, 10netops, 10observability, 10Security, 10User-jbond: ulog: filter out diffscan from ulog - https://phabricator.wikimedia.org/T265590 (10ayounsi) Shameless plug T238414. [07:53:00] 10Operations, 10Traffic, 10Performance-Team (Radar): Consider collecting more timestamp milestones from ATS-TLS - https://phabricator.wikimedia.org/T265869 (10Gilles) [07:53:55] !log gehel@cumin2001 START - Cookbook sre.wdqs.data-transfer [07:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:59] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: re-enable compaction for prometheus[12]003 [puppet] - 10https://gerrit.wikimedia.org/r/633972 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [07:57:09] (03PS4) 10Filippo Giunchedi: hieradata: re-enable compaction for prometheus[12]003 [puppet] - 10https://gerrit.wikimedia.org/r/633972 (https://phabricator.wikimedia.org/T261281) [08:01:12] !log re-enable compaction for prometheus[12]003 - T261281 [08:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:19] T261281: Improve performance of Thanos (+ Prometheus) - https://phabricator.wikimedia.org/T261281 [08:02:28] there will be some alerts about prometheus restarted, expected [08:02:32] (03PS1) 10Ladsgroup: Add dns entry for zuul.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/634913 (https://phabricator.wikimedia.org/T207008) [08:05:50] (03PS1) 10Ladsgroup: mediawiki: Funnel zuul.wikimedia.org to integration.wikimedia.org/zuul [puppet] - 10https://gerrit.wikimedia.org/r/634914 (https://phabricator.wikimedia.org/T207008) [08:07:03] PROBLEM - Prometheus prometheus2003/k8s restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s [08:07:20] (03PS3) 10Muehlenhoff: Make the mirror to use configurable via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/634023 (https://phabricator.wikimedia.org/T262647) [08:07:35] PROBLEM - Prometheus prometheus2003/ops restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [08:08:29] PROBLEM - Thanos query has high gRPC client errors on alert1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [08:08:51] PROBLEM - Prometheus prometheus2003/analytics restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/analytics [08:09:35] PROBLEM - Prometheus prometheus1003/services restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/services [08:10:25] PROBLEM - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [08:11:37] PROBLEM - Prometheus prometheus1003/k8s restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s [08:12:43] RECOVERY - Thanos query has high gRPC client errors on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [08:13:11] PROBLEM - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [08:13:17] PROBLEM - Prometheus prometheus1003/analytics restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/analytics [08:14:15] RECOVERY - Disk space on wdqs2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wdqs2002&var-datasource=codfw+prometheus/ops [08:16:41] (03CR) 10Jcrespo: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/626460 (https://phabricator.wikimedia.org/T260717) (owner: 10Dzahn) [08:17:07] 10Operations, 10Domains, 10Education-Program-Dashboard, 10Traffic: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332 (10Ladsgroup) @bd808 Hey, I don't have access to [[https://wikitech.wikimedia.org/wiki/Nova_Resource:Redirects|redirects project]] in cloudVPS a... [08:18:06] (03CR) 10Jcrespo: "BTW, there is a role, different than insetup, for soon-to-be decom hosts, but with equivalent result. I don't remember the name now. We sh" [puppet] - 10https://gerrit.wikimedia.org/r/626460 (https://phabricator.wikimedia.org/T260717) (owner: 10Dzahn) [08:20:19] (03CR) 10Jcrespo: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/626460 (https://phabricator.wikimedia.org/T260717) (owner: 10Dzahn) [08:24:12] !log imported helm 2.16.12-1 to buster-wikimedia stretch-wikimedia jessie-wikimedia - T263616 [08:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:43] (03CR) 10Jcrespo: "Thank you!" [software] - 10https://gerrit.wikimedia.org/r/633053 (owner: 10Marostegui) [08:26:01] !log updated helm to 2.16.12-1 on deploy2001 [08:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:41] (03PS2) 10Filippo Giunchedi: profile: add alerts for Thanos sidecar not uploading or failing to do so [puppet] - 10https://gerrit.wikimedia.org/r/634475 (https://phabricator.wikimedia.org/T265632) [08:27:07] (03CR) 10Filippo Giunchedi: profile: add alerts for Thanos sidecar not uploading or failing to do so (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/634475 (https://phabricator.wikimedia.org/T265632) (owner: 10Filippo Giunchedi) [08:28:02] 10Operations, 10Domains, 10Traffic, 10Patch-For-Review: wikiknihy.cz - transfer to Wikimedia Czech Republic? - https://phabricator.wikimedia.org/T127573 (10Ladsgroup) 05Open→03Declined So currently, https://wikiknihy.cz/ redirects to cs.wikibooks.org without any issues. I call this declined as it seems... [08:30:00] 10Operations, 10Traffic, 10Sustainability (Incident Followup): upload.wikimedia.org returns HTTP status code 503 for truncated urls, not 404 - https://phabricator.wikimedia.org/T106517 (10Joe) While those urls weren't my original report (which was about truncated URLs), it seems the behaviour has in the mean... [08:30:10] (03CR) 10Jcrespo: "Very nicely done! Thank you for working on this, and making the push to merge it, which I really wanted done. Problem is now we will expec" [puppet] - 10https://gerrit.wikimedia.org/r/620899 (https://phabricator.wikimedia.org/T257033) (owner: 10Jcrespo) [08:31:20] !log swift codfw-prod: bump object weight for ms-be2057 - T261633 [08:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:26] T261633: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 [08:31:39] RECOVERY - Prometheus prometheus2003/k8s restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s [08:32:09] RECOVERY - Prometheus prometheus2003/analytics restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/analytics [08:32:15] RECOVERY - Prometheus prometheus2003/ops restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [08:33:47] RECOVERY - Prometheus prometheus1003/k8s restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s [08:33:53] RECOVERY - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [08:34:00] (03CR) 10Muehlenhoff: [C: 03+2] Make the mirror to use configurable via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/634023 (https://phabricator.wikimedia.org/T262647) (owner: 10Muehlenhoff) [08:34:01] RECOVERY - Prometheus prometheus1003/analytics restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/analytics [08:34:01] RECOVERY - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [08:34:39] RECOVERY - Prometheus prometheus1003/services restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/services [08:35:14] (03PS2) 10Elukey: sre.hadoop.init-hadoop-workers: add option to wipe partition tables [cookbooks] - 10https://gerrit.wikimedia.org/r/634911 [08:37:10] !log upgrade rsyslog to 8.2008.0-1~bpo10+1 on centrallog2001 - T259780 [08:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:15] T259780: rsyslog occasional segfault on centrallog hosts - https://phabricator.wikimedia.org/T259780 [08:37:23] (03CR) 10Volans: [C: 04-1] "Looks good in general, some nits inline and a suggestion to use a different approach in the CLI." (035 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [08:40:01] !log updated helm to 2.16.12-1 on deploy*,chartmuseum*,contint* [08:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:10] (03CR) 10Volans: "Alternative option inline, but I'm wondering if in general we should just reimage the hosts in those cases." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/634911 (owner: 10Elukey) [08:47:06] 10Operations, 10Traffic, 10Wikispore, 10HTTPS: Make Wikispore HTTPS-only - https://phabricator.wikimedia.org/T260701 (10Ladsgroup) [08:50:24] 10Operations, 10MW-on-K8s, 10serviceops: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Joe) [08:50:50] (03PS1) 10Muehlenhoff: Switch sretest1001 to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/634921 [08:51:29] 10Operations, 10MW-on-K8s, 10observability, 10serviceops: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Joe) p:05Triage→03High [08:51:43] 10Operations, 10MW-on-K8s, 10serviceops: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10Joe) a:03Joe [08:55:01] (03PS3) 10Elukey: sre.hadoop.init-hadoop-workers: add option to wipe partition tables [cookbooks] - 10https://gerrit.wikimedia.org/r/634911 [09:03:16] (03CR) 10Muehlenhoff: [C: 03+2] Switch sretest1001 to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/634921 (owner: 10Muehlenhoff) [09:03:20] (03PS2) 10Muehlenhoff: Switch sretest1001 to deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/634921 [09:05:05] 10Operations, 10Traffic, 10Sustainability (Incident Followup): upload.wikimedia.org returns HTTP status code 503 for truncated urls, not 404 - https://phabricator.wikimedia.org/T106517 (10Tgr) Yeah, as mentioned in T106517#1569938, that's intentional. A 400 means the URL couldn't be resolved into a valid fil... [09:07:17] 10Operations, 10MW-on-K8s, 10observability, 10serviceops: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Joe) Additional datapoint that was required: we should be sending ~ 10/15k messages per second to the central log server, depending on traffic. [09:08:50] 10Operations, 10netops: Spike of multicast traffic - https://phabricator.wikimedia.org/T212273 (10ayounsi) 05Open→03Declined No mo' multicast. [09:09:58] !log gehel@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [09:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:37] 10Operations, 10Analytics, 10Traffic, 10netops: Turnilo: per-second rates for wmf_netflow bytes + packets - https://phabricator.wikimedia.org/T263290 (10elukey) @CDanis I added the following config to turnilo on an-tool1005 (staging instance): ` measures: - name: bytes title: Bytes... [09:11:10] (03PS1) 10Filippo Giunchedi: aptrepo: add component/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/634922 (https://phabricator.wikimedia.org/T259780) [09:11:27] PROBLEM - WDQS high update lag on wdqs2004 is CRITICAL: 4229 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:14:28] (03PS1) 10Giuseppe Lavagetto: Add apache httpd base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324) [09:15:31] (03CR) 10JMeybohm: "> Patch Set 4:" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/634354 (https://phabricator.wikimedia.org/T258572) (owner: 10Jeena Huneidi) [09:19:07] (03PS1) 10Ladsgroup: Add wikimedia.org.tr template pointing out to another NS [dns] - 10https://gerrit.wikimedia.org/r/634925 (https://phabricator.wikimedia.org/T259792) [09:19:24] (03PS1) 10Muehlenhoff: Revert "Switch sretest1001 to deb.debian.org" [puppet] - 10https://gerrit.wikimedia.org/r/634926 [09:19:28] (03CR) 10jerkins-bot: [V: 04-1] Add wikimedia.org.tr template pointing out to another NS [dns] - 10https://gerrit.wikimedia.org/r/634925 (https://phabricator.wikimedia.org/T259792) (owner: 10Ladsgroup) [09:20:10] 10Operations, 10Continuous-Integration-Infrastructure, 10Doxygen, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)): Update Doxygen to 1.8.18 - https://phabricator.wikimedia.org/T253793 (10hashar) Doxygen has been upgraded to 1.8.19 (T254465) [09:20:52] 10Operations, 10Continuous-Integration-Config, 10Developer Productivity, 10Doxygen, and 3 others: Update Doxygen in CI to 1.8.17 or greater - https://phabricator.wikimedia.org/T242155 (10hashar) Doxygen has been upgraded to 1.8.19 (T254465) [09:21:04] 10Operations, 10Continuous-Integration-Infrastructure, 10Doxygen, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)): Update Doxygen to 1.8.18 - https://phabricator.wikimedia.org/T253793 (10hashar) [09:21:19] (03PS2) 10Muehlenhoff: Revert "Switch sretest1001 to deb.debian.org" [puppet] - 10https://gerrit.wikimedia.org/r/634926 [09:21:22] 10Operations, 10Continuous-Integration-Infrastructure, 10Doxygen, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)): Update Doxygen to 1.8.18 - https://phabricator.wikimedia.org/T253793 (10hashar) [09:24:31] (03CR) 10Ayounsi: [C: 03+1] sre.pdus.rotate-password: fix TypeError: 'tuple' object does not support item assignment [cookbooks] - 10https://gerrit.wikimedia.org/r/629074 (owner: 10Jbond) [09:26:29] RECOVERY - WDQS high update lag on wdqs2004 is OK: (C)3600 ge (W)1200 ge 1002 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:27:39] 10Operations, 10Analytics, 10Traffic, 10netops: Turnilo: per-second rates for wmf_netflow bytes + packets - https://phabricator.wikimedia.org/T263290 (10faidon) Yay, that's awesome! You can't imagine how much time this would save! I changed the config a little bit. Specifically: * Bits per second is more... [09:30:37] (03PS1) 10Ladsgroup: Make pywikibot.org reach production ncredir [dns] - 10https://gerrit.wikimedia.org/r/634928 (https://phabricator.wikimedia.org/T257536) [09:30:58] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Switch sretest1001 to deb.debian.org" [puppet] - 10https://gerrit.wikimedia.org/r/634926 (owner: 10Muehlenhoff) [09:35:16] (03CR) 10Ladsgroup: "I'm not sure what I'm doing here is correct. Please take a look." [dns] - 10https://gerrit.wikimedia.org/r/634928 (https://phabricator.wikimedia.org/T257536) (owner: 10Ladsgroup) [09:37:07] PROBLEM - Check systemd state on ms-be2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:38:39] 10Operations, 10serviceops, 10Kubernetes, 10Service-Architecture: Consider using a file-based xDS system for envoy in k8s - https://phabricator.wikimedia.org/T265879 (10Joe) [09:39:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:39:49] 10Operations, 10serviceops, 10Kubernetes, 10Service-Architecture: Upgrade envoy configuration to use the v3 API - https://phabricator.wikimedia.org/T265880 (10Joe) [09:41:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:41:20] (03CR) 10Ayounsi: [C: 04-1] bird: ensure bird service is running [puppet] - 10https://gerrit.wikimedia.org/r/625926 (owner: 10Jbond) [09:43:05] PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:43:19] (03Abandoned) 10Ayounsi: Smokeping: remove asw- mgmt probing [puppet] - 10https://gerrit.wikimedia.org/r/419966 (owner: 10Ayounsi) [09:47:21] PROBLEM - SSH on ms-be2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:48:08] 10Operations, 10serviceops, 10Kubernetes, 10Service-Architecture: Improve envoy configuration CI checks - https://phabricator.wikimedia.org/T265881 (10Joe) [09:51:16] 10Operations, 10serviceops, 10Kubernetes, 10Service-Architecture: Allow canarying new envoy configurations in kubernetes - https://phabricator.wikimedia.org/T265882 (10Joe) [09:53:59] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/634922 (https://phabricator.wikimedia.org/T259780) (owner: 10Filippo Giunchedi) [09:55:38] 10Operations: Provide failover capacity for package installations from main mirror - https://phabricator.wikimedia.org/T262647 (10MoritzMuehlenhoff) 05Open→03Resolved This is complete: The package mirror can now be set via profile::base::mirror_server (and still defaults to mirrors.wikimedia.org) [09:58:03] PROBLEM - SSH on ms-be2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:00:25] RECOVERY - SSH on ms-be2016 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:02:51] RECOVERY - SSH on ms-be2021 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:03:37] PROBLEM - Check systemd state on ms-be2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:05:46] (03PS4) 10Jbond: diffscan: pyhotnify [puppet] - 10https://gerrit.wikimedia.org/r/634572 [10:06:06] (03PS1) 10Elukey: turnilo: add bps/pps measure to the wmf_netflow datasource [puppet] - 10https://gerrit.wikimedia.org/r/634931 (https://phabricator.wikimedia.org/T263290) [10:06:56] (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: add component/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/634922 (https://phabricator.wikimedia.org/T259780) (owner: 10Filippo Giunchedi) [10:07:23] (03CR) 10Ayounsi: [C: 03+1] turnilo: add bps/pps measure to the wmf_netflow datasource [puppet] - 10https://gerrit.wikimedia.org/r/634931 (https://phabricator.wikimedia.org/T263290) (owner: 10Elukey) [10:08:14] (03CR) 10Elukey: [C: 03+2] turnilo: add bps/pps measure to the wmf_netflow datasource [puppet] - 10https://gerrit.wikimedia.org/r/634931 (https://phabricator.wikimedia.org/T263290) (owner: 10Elukey) [10:08:45] PROBLEM - Check systemd state on ms-be2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:10:09] PROBLEM - Check systemd state on ms-be2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:10:23] PROBLEM - SSH on ms-be2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:11:46] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10netops: Turnilo: per-second rates for wmf_netflow bytes + packets - https://phabricator.wikimedia.org/T263290 (10elukey) a:03elukey [10:15:17] the ms-be errors in codfw is me btw, rebalancing [10:15:56] how dare you godog [10:15:59] :D [10:16:36] haha! I know right elukey ? unbelievable [10:17:11] yes yes for sure [10:18:27] PROBLEM - Check systemd state on ms-be2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:22:29] PROBLEM - Docker registry HTTPS interface on registry2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [10:22:39] RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:23:46] 10Operations, 10Mail: E-mail for people in different OIT LDAP object unit - https://phabricator.wikimedia.org/T159750 (10MoritzMuehlenhoff) 05Open→03Declined We can close this task given that the OpenLDAP mirror in going away in favour of JumpCloud [10:23:59] RECOVERY - Docker registry HTTPS interface on registry2001 is OK: HTTP OK: HTTP/1.1 200 OK - 2567 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Docker [10:24:25] PROBLEM - SSH on ms-be2029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:25:17] RECOVERY - SSH on ms-be2016 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:25:57] RECOVERY - SSH on ms-be2029 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:30:04] jan_drewniak: May I have your attention please! Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201019T1030) [10:32:49] PROBLEM - SSH on ms-be2029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:33:29] RECOVERY - Check systemd state on ms-be2021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:33:59] 10Operations, 10CAS-SSO, 10Patch-For-Review: Update CAS to 6.2 - https://phabricator.wikimedia.org/T265857 (10Marostegui) p:05Triage→03Medium [10:35:00] 10Operations, 10serviceops, 10Kubernetes, 10Service-Architecture: Allow canarying new envoy configurations in kubernetes - https://phabricator.wikimedia.org/T265882 (10Marostegui) p:05Triage→03Medium [10:35:05] PROBLEM - SSH on ms-be2037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:35:07] 10Operations, 10serviceops, 10Kubernetes, 10Service-Architecture: Improve envoy configuration CI checks - https://phabricator.wikimedia.org/T265881 (10Marostegui) p:05Triage→03Medium [10:35:13] 10Operations, 10serviceops, 10Kubernetes, 10Service-Architecture: Upgrade envoy configuration to use the v3 API - https://phabricator.wikimedia.org/T265880 (10Marostegui) p:05Triage→03Medium [10:35:19] 10Operations, 10serviceops, 10Kubernetes, 10Service-Architecture: Consider using a file-based xDS system for envoy in k8s - https://phabricator.wikimedia.org/T265879 (10Marostegui) p:05Triage→03Medium [10:36:30] (03PS1) 10Giuseppe Lavagetto: Switch cxserver to use the envoy service proxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634933 [10:36:32] (03PS1) 10Giuseppe Lavagetto: Switch restbase calls to be channeled via envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634934 [10:36:34] (03PS1) 10Giuseppe Lavagetto: Remove references to the 'electron' service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634935 [10:36:37] (03PS1) 10Giuseppe Lavagetto: service_proxy: add cxserver to the default configuration [puppet] - 10https://gerrit.wikimedia.org/r/634936 [10:37:39] RECOVERY - SSH on ms-be2029 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:37:56] (03PS1) 10Ladsgroup: Add _ to the allowed list of short url characters [puppet] - 10https://gerrit.wikimedia.org/r/634937 (https://phabricator.wikimedia.org/T230685) [10:38:19] RECOVERY - SSH on ms-be2037 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:38:37] 10Operations, 10EasyTimeline, 10Packaging: WMF deployed EasyTimeline extension depends on Ploticus package which is not available in Debian Buster (but available again in Debian Bullseye) - https://phabricator.wikimedia.org/T253377 (10Marostegui) p:05Triage→03Medium [10:40:49] (03CR) 10Urbanecm: [C: 03+2] "docs-only, no-op for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634772 (owner: 10Urbanecm) [10:41:32] (03Merged) 10jenkins-bot: arbcom_ruwiki.yaml: Fix a copy/paste typo from creating the wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634772 (owner: 10Urbanecm) [10:43:13] (03CR) 10Lucas Werkmeister (WMDE): Add _ to the allowed list of short url characters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/634937 (https://phabricator.wikimedia.org/T230685) (owner: 10Ladsgroup) [10:43:57] (03CR) 10Urbanecm: "> Patch Set 1: Code-Review+2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634543 (https://phabricator.wikimedia.org/T264246) (owner: 10Cicalese) [10:47:09] RECOVERY - Check systemd state on ms-be2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:53:45] !log [urbanecm@mwmaint2001 ~/updateVarDumps/script]$ mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=jawikivoyage --print-orphaned-records-to=- --progress-markers # T246539 [10:53:46] 10Operations, 10Documentation: Document how to fix IPMI issues on Wikitech - https://phabricator.wikimedia.org/T191956 (10Marostegui) 05Open→03Resolved I am going to close this as fixed as most of the common stuff is at https://wikitech.wikimedia.org/wiki/Management_Interfaces as pointed above. [10:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:52] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [10:54:30] (03PS5) 10Milimetric: analytics_cluster/turnilo: Configure url shortner [puppet] - 10https://gerrit.wikimedia.org/r/622600 (https://phabricator.wikimedia.org/T233336) [10:55:11] PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:55:34] 10Operations, 10Wikimedia-Mailing-lists: Bot unable to send messages to wikipedia-fr-wikimag - https://phabricator.wikimedia.org/T265844 (10Peachey88) @Orlodrim If you know how to check your mail headers, If you look at the one that is sent to you instead of the mailing list that adds [SPAM], Can you have a lo... [10:57:10] 10Operations, 10Traffic, 10observability: prometheus-varnish-exporter@frontend.service: Unit entered failed state - invalid character 'C' - https://phabricator.wikimedia.org/T203191 (10Marostegui) [10:57:23] !log Start `mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=enwikisource --print-orphaned-records-to=/tmp/urbanecm/enwikisource-orphaned.log --progress-markers` in a tmux session named updateVarDumps at mwmaint2001 (T246539) [10:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:56] 10Operations, 10User-notice: 2018 data center switchover: Move all the things over to codfw - https://phabricator.wikimedia.org/T200022 (10Marostegui) 05Open→03Resolved The switchover was done in 2018: T199073 Closing this! [10:58:59] 10Operations, 10CommRel-Specialists-Support (Jan-Mar-2019), 10Goal, 10User-Johan: Community Relations support for the 2018 data center switchover - https://phabricator.wikimedia.org/T199676 (10Marostegui) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate European mid-day backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201019T1100). [11:00:04] kostajh and Urbanecm: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:14] I can deploy today! [11:00:18] hi Urbanecm [11:00:21] hi kostajh [11:00:57] Urbanecm: do you want me to make the change discussed in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/633514/5#message-3607877122a852a9f0883d3b7f8d841e581c9209 to split this into three patches instead of 2? [11:01:01] (03PS6) 10Urbanecm: labs: Disable EditorJourney (UnderstandingFirstDay) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633514 (https://phabricator.wikimedia.org/T252391) (owner: 10Kosta Harlan) [11:01:09] kostajh: no problem, I'll merge it as-is [11:01:16] thanks [11:01:22] (03CR) 10Urbanecm: [C: 03+2] labs: Disable EditorJourney (UnderstandingFirstDay) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633514 (https://phabricator.wikimedia.org/T252391) (owner: 10Kosta Harlan) [11:01:29] (03PS3) 10Urbanecm: Disable EditorJourney (UnderstandingFirstDay) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634012 (https://phabricator.wikimedia.org/T252391) (owner: 10Kosta Harlan) [11:01:34] (03CR) 10Urbanecm: [C: 03+2] Disable EditorJourney (UnderstandingFirstDay) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634012 (https://phabricator.wikimedia.org/T252391) (owner: 10Kosta Harlan) [11:02:37] (03Merged) 10jenkins-bot: labs: Disable EditorJourney (UnderstandingFirstDay) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633514 (https://phabricator.wikimedia.org/T252391) (owner: 10Kosta Harlan) [11:02:41] (03Merged) 10jenkins-bot: Disable EditorJourney (UnderstandingFirstDay) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634012 (https://phabricator.wikimedia.org/T252391) (owner: 10Kosta Harlan) [11:03:37] kostajh: can this be tested at mwdebug? [11:03:47] PROBLEM - SSH on ms-be2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:03:48] (I mean, for prod, not beta) [11:03:59] Urbanecm: yes, it can [11:04:04] 10Operations: systemd-logind fails with result 'timeout' in db2093 and dns4001 - https://phabricator.wikimedia.org/T198215 (10Marostegui) 05Open→03Resolved This is no longer happening [11:04:16] kostajh: okay. I pulled it onto mwdebug2001, please let me know if it works :) [11:04:30] Urbanecm: thanks, looking [11:04:36] (I'll need a bit of time) [11:04:40] (03PS4) 10Urbanecm: Restore bureaucrat's abilities at uzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634557 (https://phabricator.wikimedia.org/T265746) [11:04:44] (03CR) 10Urbanecm: [C: 03+2] Restore bureaucrat's abilities at uzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634557 (https://phabricator.wikimedia.org/T265746) (owner: 10Urbanecm) [11:04:50] kostajh: sure, take your time [11:05:34] (03Merged) 10jenkins-bot: Restore bureaucrat's abilities at uzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634557 (https://phabricator.wikimedia.org/T265746) (owner: 10Urbanecm) [11:06:30] (03CR) 10Urbanecm: labs: Disable EditorJourney (UnderstandingFirstDay) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633514 (https://phabricator.wikimedia.org/T252391) (owner: 10Kosta Harlan) [11:07:25] PROBLEM - SSH on ms-be2036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:08:20] (03PS1) 10Lucas Werkmeister (WMDE): Remove noratelimit from Wikidata bot group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634938 (https://phabricator.wikimedia.org/T258354) [11:08:51] (03PS6) 10Milimetric: analytics_cluster/turnilo: Configure url shortner [puppet] - 10https://gerrit.wikimedia.org/r/622600 (https://phabricator.wikimedia.org/T233336) [11:08:55] RECOVERY - SSH on ms-be2036 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:09:28] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] "To be deployed tomorrow, 2020-10-20." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634938 (https://phabricator.wikimedia.org/T258354) (owner: 10Lucas Werkmeister (WMDE)) [11:10:15] RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:21] RECOVERY - SSH on ms-be2020 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:10:50] 10Operations, 10Wikimedia-Mailing-lists: Bot unable to send messages to wikipedia-fr-wikimag - https://phabricator.wikimedia.org/T265844 (10Orlodrim) When I send an e-mail to myself without going through the mailing list server, it get the following spam tags: X-Ovh-Tracer-Id: 8141945179226048193 X-Ovh-Remote... [11:10:51] ACKNOWLEDGEMENT - Disk space on sretest1001 is CRITICAL: DISK CRITICAL - /var/lib/docker/overlay2/f17bd20f2acde2f8f40c9a6364472e317d93b16b0206be825278eb3558751c83/merged is not accessible: Permission denied Marostegui testing hosts https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=sretest1001&var-datasource=eqiad+prometheus/ops [11:10:51] ACKNOWLEDGEMENT - Apache HTTP on testvm1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Marostegui testing hosts https://wikitech.wikimedia.org/wiki/Application_servers [11:10:51] ACKNOWLEDGEMENT - mediawiki-installation DSH group on testvm1001 is CRITICAL: Host testvm1001 is not in mediawiki-installation dsh group Marostegui testing hosts https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [11:13:12] !log Manually run `mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log` for several small group2 wikis (T246539) [11:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:19] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [11:15:30] 10Operations, 10observability: mx1001 exim queue warning - https://phabricator.wikimedia.org/T224692 (10Marostegui) 05Open→03Resolved a:05herron→03None Resolving this as this host is ok now, after a year! :) [11:15:45] 10Operations, 10observability: mx1001 exim queue warning - https://phabricator.wikimedia.org/T224692 (10Marostegui) a:03herron [11:16:30] Urbanecm: I verified that events no longer flow to EditorJourney topic when traffic is routed through mwdebug2001, and that we do continue to oversample events for EditAttemptStep. [11:16:39] Urbanecm: I'm looking at logstash https://logstash.wikimedia.org/goto/a5847bd62e47d1dd3b4d2aa4004b485d and don't see anything problematic [11:16:45] me neither [11:16:48] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] "> * https://integration.wikimedia.org/ci/job/operations-mw-config-php72-composer-diffConfig-docker/4559/console : SUCCESS Please carefully" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634938 (https://phabricator.wikimedia.org/T258354) (owner: 10Lucas Werkmeister (WMDE)) [11:16:59] kostajh: so, ready to sync, I guess? [11:17:00] Urbanecm: I assume the "Persisting session for unknown reason" is not relevant? [11:17:28] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [11:17:29] doesn't seem so [11:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:13] OK, I think it is good to sync then [11:18:22] syncing [11:19:28] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10kostajh) @jijiki EditorJourney logging is now switched off. We may at some point want to re-enable but will wait for this wor... [11:20:02] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 26b97261f2b9d1991ea08fe32b6007ba6fe5088f: Disable EditorJourney (UnderstandingFirstDay) (T252391) (duration: 01m 10s) [11:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:07] kostajh: done :) [11:20:08] T252391: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 [11:20:16] Urbanecm: thank you! [11:20:38] np [11:21:01] RECOVERY - Disk space on sretest1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=sretest1001&var-datasource=eqiad+prometheus/ops [11:23:41] PROBLEM - Check systemd state on ms-be2022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:19] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: ce92c9814bf9c12cab1a9592dfb32f935d255d93: Restore bureaucrat abilities at uzwiki (T265746) (duration: 00m 56s) [11:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:25] T265746: Uzwiki bureaucrats lost their special abilities after deploying T265509 - https://phabricator.wikimedia.org/T265746 [11:24:37] !log EU B&C window done [11:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:50] 10Operations: long-running root console sessions - https://phabricator.wikimedia.org/T105869 (10Marostegui) 05Open→03Resolved Monitoring is now in place on Icinga. Closing [11:26:05] 10Operations: Integrate Buster 10.6 point update - https://phabricator.wikimedia.org/T263974 (10MoritzMuehlenhoff) [11:31:01] !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [11:31:05] PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:31] Good morning team! [11:35:57] 10Operations, 10Wikimedia-Mailing-lists: Create a mailing list for frwiki nominators - https://phabricator.wikimedia.org/T265835 (10Kvardek_du) [11:37:05] RECOVERY - Check systemd state on ms-be2022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:40:41] !log [urbanecm@mwmaint2001 ~/updateVarDumps/script]$ while read wiki; do echo "Processing $wiki"; mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log; done < ../small-group2.dblist # T246539 # small-group2.dblist is wikis from small.dblist that are also in group2.dblist [11:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:47] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [11:41:35] PROBLEM - Check systemd state on ms-be2054 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:42:00] !log End of `mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=enwikisource --print-orphaned-records-to=/tmp/urbanecm/enwikisource-orphaned.log --progress-markers` (T246539) [11:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:55] 10Operations, 10ops-eqsin, 10serviceops: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10akosiaris) Any news on this one? (just found out today about it while working on T265607) [11:43:34] !log End of `[urbanecm@mwmaint2001 ~/updateVarDumps/script]$ while read wiki; do echo "Processing $wiki"; mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log; done < ../small-group2.dblist` # T246539 # small-group2.dblist is wikis from small.dblist that are also in group2.dblist [11:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:09] Does anyone have an idea about what is breaking in T263617? [11:44:10] T263617: Cannot login to beta cluster: "There seems to be a problem with your login session..." - https://phabricator.wikimedia.org/T263617 [11:46:30] !log updating idp-test2001 to CAS 6.2.4 [11:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:38] (03PS1) 10Faidon Liambotis: turnilo: add bytes formatting [puppet] - 10https://gerrit.wikimedia.org/r/634941 [11:47:41] elukey: ^ [11:49:19] PROBLEM - SSH on ms-be2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:51:08] !log updating idp-test1001 to CAS 6.2.4 [11:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:04] 10Operations, 10netops: Upgrade Junos on asw2-esams - https://phabricator.wikimedia.org/T252631 (10ayounsi) Also https://kb.juniper.net/InfoCenter/index?page=content&id=JSA11086 [11:54:43] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Bump to 6.2.4 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/634907 (https://phabricator.wikimedia.org/T265857) (owner: 10Muehlenhoff) [11:55:55] RECOVERY - SSH on ms-be2023 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:56:53] (03CR) 10Lucas Werkmeister (WMDE): Add _ to the allowed list of short url characters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/634937 (https://phabricator.wikimedia.org/T230685) (owner: 10Ladsgroup) [11:58:27] RECOVERY - Check systemd state on ms-be2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:04] Urbanecm and Amir1: That opportune time is upon us again. Time for a Create smnwiki deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201019T1200). [12:00:15] heh, don't be afraid :) [12:00:57] Amir1: /me waves :) [12:02:18] sorry [12:02:21] around now [12:02:24] good :) [12:02:24] forgot [12:02:26] I'm starting then [12:03:08] (03PS4) 10Muehlenhoff: Install ldap-replica200[34] as additional LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/632648 (https://phabricator.wikimedia.org/T264388) [12:03:18] (03PS2) 10Urbanecm: Initial configuration for smnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634774 (https://phabricator.wikimedia.org/T264859) [12:03:23] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for smnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634774 (https://phabricator.wikimedia.org/T264859) (owner: 10Urbanecm) [12:04:33] (03Merged) 10jenkins-bot: Initial configuration for smnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634774 (https://phabricator.wikimedia.org/T264859) (owner: 10Urbanecm) [12:07:18] pulled onto mwmaint2001 [12:07:47] running the magical command [12:08:02] pulling omto mwdebug2001 [12:09:24] db is live, syncing [12:09:40] (03PS1) 10Faidon Liambotis: turnilo: add exporter (router) hostname + region [puppet] - 10https://gerrit.wikimedia.org/r/634946 [12:10:01] (03PS2) 10Faidon Liambotis: turnilo: add exporter hostname and region for netflow [puppet] - 10https://gerrit.wikimedia.org/r/634946 [12:10:28] !log urbanecm@deploy1001 Synchronized wmf-config/db-eqiad.php: Creating smnwiki (T264859) (duration: 00m 56s) [12:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:35] T264859: Create Inari Sámi Wikipedia - https://phabricator.wikimedia.org/T264859 [12:11:06] (03PS3) 10Faidon Liambotis: turnilo: add exporter hostname and region for netflow [puppet] - 10https://gerrit.wikimedia.org/r/634946 (https://phabricator.wikimedia.org/T254332) [12:11:32] !log urbanecm@deploy1001 Synchronized wmf-config/db-codfw.php: Creating smnwiki (T264859) (duration: 00m 55s) [12:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:37] !log urbanecm@deploy1001 Synchronized dblists: Creating smnwiki (T264859) (duration: 00m 55s) [12:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:29] PROBLEM - Check systemd state on ms-be2054 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:14:13] !log urbanecm@deploy1001 rebuilt and synchronized wikiversions files: Creating smnwiki (T264859) [12:14:15] (03PS1) 10Faidon Liambotis: turnilo: fix retainMissingValue misconfig [puppet] - 10https://gerrit.wikimedia.org/r/634948 [12:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:39] wiki works, so proceeding with the rest of syncs [12:15:02] !log Deploy schema change on smnwiki T265321 T264900 [12:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:09] T265321: ipblocks_restrictions.ir_type is tinyint(1) in code but tinyint(4) in production - https://phabricator.wikimedia.org/T265321 [12:15:09] T264900: Prepare and check storage layer for smnwiki - https://phabricator.wikimedia.org/T264900 [12:15:45] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: Creating smnwiki (T264859) (duration: 00m 56s) [12:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:51] T264859: Create Inari Sámi Wikipedia - https://phabricator.wikimedia.org/T264859 [12:16:09] !log Sanitize smnwiki on db1124:3315 and db2094:3315 - T264900 [12:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:51] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Creating smnwiki (T264859) (duration: 00m 55s) [12:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:14] Amir1: hmm, apparently, I forgot to add the language to https://github.com/wikimedia/operations-mediawiki-config/blob/master/langlist. Should I just push&merge a follow-up, and sync it? Or did I break something unintentionally? [12:17:27] (wiki seems to be running, so it's at least not immediately visible) [12:17:44] Urbanecm: that cause issues for interwiki cache [12:18:01] so add it, then update the interwiki cache [12:18:05] Amir1: so, since I didn't update interwiki cache yet, just fixing the mistake will help? [12:18:06] good [12:18:10] yup [12:19:17] thanks [12:19:19] (03PS1) 10Urbanecm: Add smn to langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634949 (https://phabricator.wikimedia.org/T264859) [12:19:36] (03CR) 10Urbanecm: [C: 03+2] "part of wiki-creation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634949 (https://phabricator.wikimedia.org/T264859) (owner: 10Urbanecm) [12:20:55] (03Merged) 10jenkins-bot: Add smn to langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634949 (https://phabricator.wikimedia.org/T264859) (owner: 10Urbanecm) [12:21:19] syncing it [12:22:12] !log urbanecm@deploy1001 Synchronized langlist: Creating smnwiki (T264859) (duration: 00m 56s) [12:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:20] T264859: Create Inari Sámi Wikipedia - https://phabricator.wikimedia.org/T264859 [12:23:35] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634950 [12:23:37] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634950 (owner: 10Urbanecm) [12:23:38] updating cache now [12:24:19] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634950 (owner: 10Urbanecm) [12:25:44] !log urbanecm@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 00m 56s) [12:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:56] Amir1: so, seems we're done? :) [12:26:07] \o/ [12:26:07] (03CR) 10Hashar: [C: 03+1] "Forgot to +1" [puppet] - 10https://gerrit.wikimedia.org/r/631237 (owner: 10Paladox) [12:26:15] Urbanecm: Like always, thank you so much! [12:26:28] !log Creation of smnwiki is done (T264859) [12:26:33] happy to help :) [12:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:07] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:29:43] (03CR) 10Ayounsi: "I think that's redundant with the work being done in I6a24d41e125718e1bea4711d8f7b9d126ef38969" [puppet] - 10https://gerrit.wikimedia.org/r/634946 (https://phabricator.wikimedia.org/T254332) (owner: 10Faidon Liambotis) [12:31:27] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:43:44] (03PS1) 10Marostegui: db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/634952 [12:44:22] (03CR) 10Marostegui: [C: 03+2] db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/634952 (owner: 10Marostegui) [12:46:47] (03CR) 10Elukey: [C: 03+2] "Tested in staging, works fine! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/634941 (owner: 10Faidon Liambotis) [12:47:41] (03PS1) 10Kormat: mariadb: Use lookup() instead of hiera() [puppet] - 10https://gerrit.wikimedia.org/r/634953 (https://phabricator.wikimedia.org/T256972) [12:47:43] (03PS1) 10JMeybohm: Enable atomic helm upgrades for admin deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/634954 (https://phabricator.wikimedia.org/T252428) [12:48:21] paravoid: deployed, thanks! [12:48:38] there are a couple more your way :P [12:48:43] !log installing httpcomponents-client security updates on Buster [12:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:20] (03PS2) 10JMeybohm: Enable atomic helm upgrades for admin deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/634954 (https://phabricator.wikimedia.org/T252428) [12:51:18] (03PS1) 10KartikMistry: WIP: Remove wgContentTranslationRESTBase config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634956 [12:51:50] 10Operations, 10Wikimedia-Mailing-lists, 10I18n: Several unreadable mailing list descriptions (Mojibake) due to wrong charset encodings, should be Unicode - https://phabricator.wikimedia.org/T261031 (10whym) Is the problem larger than the current title and description of this task suggest? The title of this... [12:52:37] (03CR) 10Kormat: "PCC is clean: https://puppet-compiler.wmflabs.org/compiler1003/25973/" [puppet] - 10https://gerrit.wikimedia.org/r/634953 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [12:54:48] 10Operations, 10Wikimedia-Mailing-lists, 10I18n: Several unreadable mailing list descriptions (Mojibake) due to wrong charset encodings, should be Unicode - https://phabricator.wikimedia.org/T261031 (10Ladsgroup) >>! In T261031#6560192, @whym wrote: > Is the problem larger than the current title and descript... [12:58:55] RECOVERY - Check systemd state on ms-be2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:11] (03CR) 10Muehlenhoff: [C: 03+2] Install ldap-replica200[34] as additional LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/632648 (https://phabricator.wikimedia.org/T264388) (owner: 10Muehlenhoff) [13:15:23] 10Operations, 10DBA, 10User-Kormat: Convert role::mariadb::misc to profile - https://phabricator.wikimedia.org/T265900 (10Kormat) [13:21:28] 10Operations, 10DBA, 10User-Kormat: Clean up role::mariadb::ferm and profile::mariadb::ferm - https://phabricator.wikimedia.org/T265901 (10Kormat) [13:26:22] !log import prometheus-openldap-exporter 0+git20171128-2+deb10u1 for buster-wikimedia T264388 [13:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:29] T264388: Migrate LDAP replicas to Buster - https://phabricator.wikimedia.org/T264388 [13:26:46] (03PS1) 10Kormat: mariadb: Convert role::mariadb::misc to profile. [puppet] - 10https://gerrit.wikimedia.org/r/634971 (https://phabricator.wikimedia.org/T265900) [13:31:01] !log mholloway-shell@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [13:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:18] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [13:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:26] (03CR) 10Kormat: "PCC contains only changes to the MOTD: https://puppet-compiler.wmflabs.org/compiler1002/25974/" [puppet] - 10https://gerrit.wikimedia.org/r/634971 (https://phabricator.wikimedia.org/T265900) (owner: 10Kormat) [13:33:57] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [13:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:31] !log Start of `[urbanecm@mwmaint2001 ~/updateVarDumps/output/group2-medium]$ while read wiki; do echo "Processing $wiki"; mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > output/$wiki.log; done < wikis.dblist` (T246539; wikis.dblist is medium wikis from group2.dblist) [13:34:33] 10Operations, 10DBA, 10Data-Persistence, 10User-Kormat: Clean up role::mariadb::ferm and profile::mariadb::ferm - https://phabricator.wikimedia.org/T265901 (10Marostegui) p:05Triage→03Medium [13:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:37] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [13:34:37] Daimona: ^^ [13:37:23] Noice :) [13:40:34] (03PS1) 10Muehlenhoff: acmechief: Also allow ldap-replica2003/2004 [puppet] - 10https://gerrit.wikimedia.org/r/634974 (https://phabricator.wikimedia.org/T264388) [13:42:04] (03CR) 10Klausman: [C: 03+1] sre.hadoop.init-hadoop-workers: add option to wipe partition tables [cookbooks] - 10https://gerrit.wikimedia.org/r/634911 (owner: 10Elukey) [13:46:02] (03PS1) 10Ottomata: helmfile.d: refactor eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/634976 (https://phabricator.wikimedia.org/T258572) [13:48:22] PROBLEM - Check systemd state on ldap-replica2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:47] (03PS2) 10Ottomata: helmfile.d: refactor eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/634976 (https://phabricator.wikimedia.org/T258572) [13:49:36] ACKNOWLEDGEMENT - Check systemd state on ldap-replica2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Muehlenhoff In setup https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:39] (03PS3) 10Ottomata: helmfile.d: refactor eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/634976 (https://phabricator.wikimedia.org/T258572) [13:51:41] (03PS1) 10JMeybohm: Review access change [software/heptiolabs/eventrouter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/634801 [13:51:56] (03Abandoned) 10JMeybohm: Review access change [software/heptiolabs/eventrouter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/634801 (owner: 10JMeybohm) [13:55:45] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat: Convert role::mariadb::misc to profile - https://phabricator.wikimedia.org/T265900 (10LSobanski) p:05Triage→03Medium [14:03:10] (03PS1) 10Ottomata: helmfile.d: refactor eventgate-analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/634983 (https://phabricator.wikimedia.org/T258572) [14:06:00] (03CR) 10jerkins-bot: [V: 04-1] helmfile.d: refactor eventgate-analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/634983 (https://phabricator.wikimedia.org/T258572) (owner: 10Ottomata) [14:08:37] (03PS1) 10Ottomata: helmfile.d: refactor eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/634984 (https://phabricator.wikimedia.org/T258572) [14:08:54] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 663 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:09:37] (03PS2) 10Ottomata: helmfile.d: refactor eventgate-analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/634983 (https://phabricator.wikimedia.org/T258572) [14:09:48] (03CR) 10Marostegui: [C: 03+1] mariadb: Use lookup() instead of hiera() [puppet] - 10https://gerrit.wikimedia.org/r/634953 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [14:10:30] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:12:32] (03CR) 10jerkins-bot: [V: 04-1] helmfile.d: refactor eventgate-analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/634983 (https://phabricator.wikimedia.org/T258572) (owner: 10Ottomata) [14:12:54] (03PS3) 10Paladox: gerrit: open link in new window [puppet] - 10https://gerrit.wikimedia.org/r/631237 [14:13:44] Is now a good time to deploy some little config changes related to api.mediawiki.org? [14:13:46] (03CR) 10Alexandros Kosiaris: [C: 03+1] Enable atomic helm upgrades for admin deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/634954 (https://phabricator.wikimedia.org/T252428) (owner: 10JMeybohm) [14:15:12] (03PS4) 10Cicalese: apiportal: enable discussion tools [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633980 (https://phabricator.wikimedia.org/T260624) (owner: 10Hnowlan) [14:15:22] (03CR) 10Kormat: [C: 03+2] mariadb: Use lookup() instead of hiera() [puppet] - 10https://gerrit.wikimedia.org/r/634953 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [14:15:40] !log installing llvm-toolchain-7 bugfix updates from Buster point release [14:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:13] (03CR) 10BPirkle: [C: 03+1] "Approved for self merge and deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633980 (https://phabricator.wikimedia.org/T260624) (owner: 10Hnowlan) [14:16:59] (03PS2) 10Kormat: mariadb: Convert role::mariadb::misc to profile. [puppet] - 10https://gerrit.wikimedia.org/r/634971 (https://phabricator.wikimedia.org/T265900) [14:17:45] (03CR) 10Cicalese: [C: 03+2] apiportal: enable discussion tools [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633980 (https://phabricator.wikimedia.org/T260624) (owner: 10Hnowlan) [14:18:05] (03PS1) 10JMeybohm: Initial commit of eventrouter docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634985 (https://phabricator.wikimedia.org/T262675) [14:18:55] (03Merged) 10jenkins-bot: apiportal: enable discussion tools [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633980 (https://phabricator.wikimedia.org/T260624) (owner: 10Hnowlan) [14:21:49] (03CR) 10Andrew Bogott: [C: 03+2] glance-api: make active/active in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/634840 (https://phabricator.wikimedia.org/T263461) (owner: 10Andrew Bogott) [14:22:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:22:52] (03PS3) 10Ottomata: helmfile.d: refactor eventgate-analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/634983 (https://phabricator.wikimedia.org/T258572) [14:24:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:24:58] (03PS7) 10Cicalese: Configuration for user menu and sidebar special pages. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634356 (https://phabricator.wikimedia.org/T264246) [14:25:08] (03PS8) 10Cicalese: Configuration for user menu and sidebar special pages. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634356 (https://phabricator.wikimedia.org/T264246) [14:25:52] (03CR) 10BPirkle: [C: 03+1] "Approved for self merge and deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634356 (https://phabricator.wikimedia.org/T264246) (owner: 10Cicalese) [14:26:33] (03CR) 10Cicalese: [C: 03+2] Configuration for user menu and sidebar special pages. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634356 (https://phabricator.wikimedia.org/T264246) (owner: 10Cicalese) [14:27:01] (03PS2) 10Cicalese: Add api.wikimedia.org to the list of allowed CORS origins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634841 (owner: 10Ppchelko) [14:27:20] (03Merged) 10jenkins-bot: Configuration for user menu and sidebar special pages. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634356 (https://phabricator.wikimedia.org/T264246) (owner: 10Cicalese) [14:28:19] 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for FY2020-2021 Q2 DC switchback - https://phabricator.wikimedia.org/T264364 (10Trizek-WMF) [14:29:04] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove restrouter [labs/private] - 10https://gerrit.wikimedia.org/r/632709 (owner: 10Alexandros Kosiaris) [14:29:08] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Remove restrouter [labs/private] - 10https://gerrit.wikimedia.org/r/632709 (owner: 10Alexandros Kosiaris) [14:30:44] (03CR) 10Ema: [C: 04-1] varnish: check for debug=1 value in X-Analytics header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683) (owner: 10Effie Mouzeli) [14:30:56] !log bpirkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: gerrit:634356 Configuration for user menu and sidebar special pages (duration: 00m 56s) [14:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:14] !log bpirkle@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: gerrit:634356 Configuration for user menu and sidebar special pages (duration: 00m 55s) [14:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:28] (03CR) 10BPirkle: [C: 03+1] "Approved for self merge and deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634841 (owner: 10Ppchelko) [14:32:54] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:32:58] (03CR) 10Cicalese: [C: 03+2] Add api.wikimedia.org to the list of allowed CORS origins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634841 (owner: 10Ppchelko) [14:33:21] (03PS4) 10Ottomata: eventlogging-processor - skip events for schemas migrated to Event Platform [puppet] - 10https://gerrit.wikimedia.org/r/634314 (https://phabricator.wikimedia.org/T262304) [14:33:41] (03Merged) 10jenkins-bot: Add api.wikimedia.org to the list of allowed CORS origins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634841 (owner: 10Ppchelko) [14:34:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:35:18] (03CR) 10Ottomata: [C: 03+2] eventlogging-processor - skip events for schemas migrated to Event Platform [puppet] - 10https://gerrit.wikimedia.org/r/634314 (https://phabricator.wikimedia.org/T262304) (owner: 10Ottomata) [14:36:12] (03CR) 10JMeybohm: [C: 04-1] helmfile.d: refactor eventgate-logging-external (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/634976 (https://phabricator.wikimedia.org/T258572) (owner: 10Ottomata) [14:36:28] !log bpirkle@deploy1001 Synchronized wmf-config/CommonSettings.php: gerrit:634841 Add api.wikimedia.org to the list of allowed CORS origins (duration: 00m 57s) [14:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:24] we're all done deploying api.wikimedia.org config changes [14:37:43] 10Operations, 10Traffic: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904 (10Reedy) [14:40:58] (03PS1) 10DCausse: [cirrus] cleanup mediasearch commons A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634991 [14:41:00] (03PS1) 10DCausse: [cirrus] flip activation of MLR rescore window using supported_syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634992 [14:42:04] (03PS4) 10Effie Mouzeli: varnish: check for debug=1 value in X-Analytics header [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683) [14:44:14] (03PS5) 10Effie Mouzeli: varnish: check for debug=1 value in X-Analytics header [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683) [14:44:51] (03CR) 10Effie Mouzeli: varnish: check for debug=1 value in X-Analytics header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683) (owner: 10Effie Mouzeli) [14:44:52] 10Operations, 10Traffic: ATS trying to set socket options SO_MARK / IP_TOS - https://phabricator.wikimedia.org/T265911 (10ema) [14:53:14] (03PS4) 10Ottomata: helmfile.d: refactor eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/634976 (https://phabricator.wikimedia.org/T258572) [14:53:23] (03CR) 10Ottomata: helmfile.d: refactor eventgate-logging-external (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/634976 (https://phabricator.wikimedia.org/T258572) (owner: 10Ottomata) [14:59:51] (03PS4) 10Ottomata: helmfile.d: refactor eventgate-analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/634983 (https://phabricator.wikimedia.org/T258572) [15:04:13] (03PS1) 10Bartosz Dziewoński: Fix mobile diff redirect when 'curid' query parameter is present [extensions/MobileFrontend] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634802 (https://phabricator.wikimedia.org/T265654) [15:04:42] (03CR) 10Ammarpad: [C: 03+1] gerrit: open link in new window [puppet] - 10https://gerrit.wikimedia.org/r/631237 (owner: 10Paladox) [15:06:05] (03CR) 10Huji: [C: 03+1] Change votewiki language temporarily to fa for fawiki elections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626851 (https://phabricator.wikimedia.org/T262689) (owner: 10Majavah) [15:07:25] (03PS2) 10Ottomata: helmfile.d: refactor eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/634984 (https://phabricator.wikimedia.org/T258572) [15:07:46] (03PS5) 10Ottomata: helmfile.d: refactor eventgate-analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/634983 (https://phabricator.wikimedia.org/T258572) [15:07:48] 10Operations, 10Traffic: ATS trying to set socket options SO_MARK / IP_TOS - https://phabricator.wikimedia.org/T265911 (10Marostegui) p:05Triage→03Medium [15:08:07] (03PS5) 10Ottomata: helmfile.d: refactor eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/634976 (https://phabricator.wikimedia.org/T258572) [15:11:56] 10Operations: Integrate Buster 10.6 point update - https://phabricator.wikimedia.org/T263974 (10MoritzMuehlenhoff) [15:14:36] (03PS5) 10Pablo Grass (WMDE): Set Wikidata MF to collapse sections by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634039 (https://phabricator.wikimedia.org/T239195) (owner: 10Itamar Givon) [15:15:19] (03CR) 10Pablo Grass (WMDE): [C: 03+1] Set Wikidata MF to collapse sections by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634039 (https://phabricator.wikimedia.org/T239195) (owner: 10Itamar Givon) [15:15:45] (03PS4) 10Hashar: gerrit: open link in new window [puppet] - 10https://gerrit.wikimedia.org/r/631237 (owner: 10Paladox) [15:16:32] (03CR) 10Hashar: [C: 03+1] "I have updated the commit message again to use the sha1 of the Zuul config change ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/1" [puppet] - 10https://gerrit.wikimedia.org/r/631237 (owner: 10Paladox) [15:19:26] (03PS1) 10Andrew Bogott: wmcs: add backup jobs for glance images on cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/634997 (https://phabricator.wikimedia.org/T265843) [15:20:25] (03CR) 10jerkins-bot: [V: 04-1] wmcs: add backup jobs for glance images on cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/634997 (https://phabricator.wikimedia.org/T265843) (owner: 10Andrew Bogott) [15:21:36] 10Operations, 10Puppet, 10observability, 10User-fgiunchedi, 10User-jbond: PuppetDB grafana graphs not matching logs - https://phabricator.wikimedia.org/T265649 (10fgiunchedi) [15:21:38] (03PS2) 10Andrew Bogott: wmcs: add backup jobs for glance images on cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/634997 (https://phabricator.wikimedia.org/T265843) [15:22:36] (03CR) 10jerkins-bot: [V: 04-1] wmcs: add backup jobs for glance images on cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/634997 (https://phabricator.wikimedia.org/T265843) (owner: 10Andrew Bogott) [15:24:25] (03PS3) 10Andrew Bogott: wmcs: add backup jobs for glance images on cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/634997 (https://phabricator.wikimedia.org/T265843) [15:25:26] (03CR) 10jerkins-bot: [V: 04-1] wmcs: add backup jobs for glance images on cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/634997 (https://phabricator.wikimedia.org/T265843) (owner: 10Andrew Bogott) [15:26:35] 10Operations, 10ops-eqiad: an-scheduler1001 renamed to an-coord1002 - Update Host labelling and Switch ports - https://phabricator.wikimedia.org/T265639 (10Cmjohnson) 05Open→03Resolved switch port has been changed, it is in the analytics vlan, server label updated. [15:26:39] 10Operations, 10Analytics-Clusters: Rename an-scheduler1001 to an-coord1002 - https://phabricator.wikimedia.org/T265620 (10Cmjohnson) [15:27:41] (03PS1) 10Elukey: Assing role::analytics_cluster::coordinator::query to an-coord1002 [puppet] - 10https://gerrit.wikimedia.org/r/635000 (https://phabricator.wikimedia.org/T257412) [15:27:47] 10Operations, 10ops-codfw, 10netops: patch in FB peering into cr2-eqdfw - https://phabricator.wikimedia.org/T265412 (10Papaul) On site visit set for the 21St [15:29:34] (03PS4) 10Andrew Bogott: wmcs: add backup jobs for glance images on cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/634997 (https://phabricator.wikimedia.org/T265843) [15:31:36] !log update puppet compilers' facts [15:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:10] (03CR) 10Andrew Bogott: [C: 03+2] wmcs: add backup jobs for glance images on cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/634997 (https://phabricator.wikimedia.org/T265843) (owner: 10Andrew Bogott) [15:34:23] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10observability, 10Release-Engineering-Team (Development services): Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10lmata) moving to radar but probably will close eventually as the Gitlab move progresses [15:35:32] 10Operations, 10ops-codfw, 10netops: patch in FB peering into cr2-eqdfw - https://phabricator.wikimedia.org/T265412 (10RobH) [15:37:11] (03CR) 10Ryan Kemper: [C: 03+2] Bump shard_size warning/crit thresholds [puppet] - 10https://gerrit.wikimedia.org/r/634391 (owner: 10Ryan Kemper) [15:38:44] 10Operations, 10Domains, 10Education-Program-Dashboard, 10Traffic: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332 (10bd808) >>! In T146332#6559215, @Ladsgroup wrote: > @bd808 Hey, I don't have access to [[https://wikitech.wikimedia.org/wiki/Nova_Resource:Red... [15:38:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: patch in FB peering into cr2-eqdfw - https://phabricator.wikimedia.org/T265916 (10RobH) [15:41:30] 10Operations, 10ops-codfw: codfw: relocate thanos-fe2003 to create space for new ms-be servers - https://phabricator.wikimedia.org/T265647 (10Papaul) [15:42:42] (03PS1) 10Andrew Bogott: wmcs: glance image backup bugfixes [puppet] - 10https://gerrit.wikimedia.org/r/635004 (https://phabricator.wikimedia.org/T265843) [15:42:46] 10Operations, 10ops-codfw: codfw: relocate thanos-fe2003 to create space for new ms-be servers - https://phabricator.wikimedia.org/T265647 (10Papaul) 05Open→03Resolved @fgiunchedi we just decom cp2021 which was in rack D2 on U3 so no need to move thanos-fe2003 i can use U2 and U3 to rack the new ms-be se... [15:43:22] 10Operations, 10ops-codfw, 10decommission-hardware: decommission cp2003, cp2009, cp2015, cp2021 - https://phabricator.wikimedia.org/T265729 (10Papaul) [15:43:56] 10Operations, 10ops-codfw, 10decommission-hardware: decommission cp2003, cp2009, cp2015, cp2021 - https://phabricator.wikimedia.org/T265729 (10Papaul) 05Open→03Resolved complete [15:46:51] (03CR) 10Andrew Bogott: [C: 03+2] wmcs: glance image backup bugfixes [puppet] - 10https://gerrit.wikimedia.org/r/635004 (https://phabricator.wikimedia.org/T265843) (owner: 10Andrew Bogott) [15:50:10] 10Operations, 10ops-eqsin, 10serviceops: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10RobH) So we got some movement on this Friday/replies today. Dell Singapore is being very difficult and require a local contact number. I've gone ahead and... [15:51:58] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Maryana Pinchuk - https://phabricator.wikimedia.org/T265555 (10Maryana) {F32399182} I'm getting this error message when I try to log in to superset.wikimedia.org – is this expected? [15:54:33] (03PS2) 10Elukey: Assing role::analytics_cluster::coordinator::query to an-coord1002 [puppet] - 10https://gerrit.wikimedia.org/r/635000 (https://phabricator.wikimedia.org/T257412) [15:55:38] (03CR) 10JMeybohm: [C: 03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/634976 (https://phabricator.wikimedia.org/T258572) (owner: 10Ottomata) [15:58:04] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Maryana Pinchuk - https://phabricator.wikimedia.org/T265555 (10Urbanecm) @Maryana It seems you are trying to log in using the wrong username (hence the access denied message). Can you please use Maryana instead? According to https://ldap.to... [15:58:55] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Maryana Pinchuk - https://phabricator.wikimedia.org/T265555 (10Urbanecm) (note the maryana account is also a member of analytics project at Cloud VPS, while the MPinchuk is completely unprivileged) [15:58:58] (03PS3) 10Elukey: Assing role::analytics_cluster::coordinator::query to an-coord1002 [puppet] - 10https://gerrit.wikimedia.org/r/635000 (https://phabricator.wikimedia.org/T257412) [15:59:05] !log mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=smnwiki --cluster=all [15:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:38] (03PS1) 10RobH: updating deploy1002 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/635011 (https://phabricator.wikimedia.org/T265653) [16:01:43] 10Operations, 10netops, 10cloud-services-team (Kanban): Enable L3 routing on cloudsw nodes - https://phabricator.wikimedia.org/T265288 (10RLazarus) Note October 27 is also scheduled for the MediaWiki datacenter switchback -- please let's not have both events going at the same time. :) The switchback is sched... [16:02:16] (03CR) 10RobH: [C: 03+2] updating deploy1002 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/635011 (https://phabricator.wikimedia.org/T265653) (owner: 10RobH) [16:03:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) setup/install deploy1002 - https://phabricator.wikimedia.org/T265653 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['deploy1002.eqiad.wmnet'] ` The log can be found in `/var/l... [16:04:23] (03CR) 10Urbanecm: [C: 03+1] "per Huji, LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626851 (https://phabricator.wikimedia.org/T262689) (owner: 10Majavah) [16:05:51] (03PS4) 10Elukey: Assing role::analytics_cluster::coordinator::query to an-coord1002 [puppet] - 10https://gerrit.wikimedia.org/r/635000 (https://phabricator.wikimedia.org/T257412) [16:16:26] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [16:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:22] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:26] 10Operations, 10Security-Team: Offboard Chase Pettet from Security Team - https://phabricator.wikimedia.org/T265147 (10Reedy) I'm unable to do #sre-access-requests https://phabricator.wikimedia.org/project/members/956/ [16:18:34] 10Operations, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10Product-Infrastructure-Team-Backlog, and 2 others: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10MGerlach) >>! In T258978#6532612, @Joe wrote: > - Logging: log in `json format` to stdout Added... [16:18:49] (03PS1) 10Alexandros Kosiaris: Add a couple of metrics [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/635012 [16:19:56] (03PS1) 10Elukey: Add fake secrets for an-coord1002 [labs/private] - 10https://gerrit.wikimedia.org/r/635013 [16:21:55] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/635012 (owner: 10Alexandros Kosiaris) [16:22:25] (03PS2) 10Elukey: Add fake secrets for an-coord1002 [labs/private] - 10https://gerrit.wikimedia.org/r/635013 [16:23:39] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake secrets for an-coord1002 [labs/private] - 10https://gerrit.wikimedia.org/r/635013 (owner: 10Elukey) [16:23:50] 10Operations, 10Security-Team: Offboard Chase Pettet from Security Team - https://phabricator.wikimedia.org/T265147 (10MoritzMuehlenhoff) I've removed Chase from SRE-Access-Requests [16:24:05] 10Operations, 10Security-Team: Offboard Chase Pettet from Security Team - https://phabricator.wikimedia.org/T265147 (10MoritzMuehlenhoff) [16:24:24] 10Operations, 10Security-Team: Offboard Chase Pettet from Security Team - https://phabricator.wikimedia.org/T265147 (10sbassett) [16:24:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) setup/install deploy1002 - https://phabricator.wikimedia.org/T265653 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['deploy1002.eqiad.wmnet'] ` Of which those **FAILED**: ` ['deploy1002.eqiad.wmnet'] ` [16:25:07] 10Operations, 10Security-Team: Offboard Chase Pettet from Security Team - https://phabricator.wikimedia.org/T265147 (10sbassett) Thanks, @MoritzMuehlenhoff. I think we just have the two open subtasks left and then we can close this out. [16:25:14] (03PS1) 10Andrew Bogott: wmcs instance backup: move a few more projects to cloudvirt1021 [puppet] - 10https://gerrit.wikimedia.org/r/635014 (https://phabricator.wikimedia.org/T260692) [16:25:27] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, 10Patch-For-Review: Create redirect to integration.wikimedia.org/zuul - https://phabricator.wikimedia.org/T207008 (10Jdforrester-WMF) [16:25:42] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Create redirect to integration.wikimedia.org/zuul - https://phabricator.wikimedia.org/T207008 (10Jdforrester-WMF) [16:26:32] (03PS5) 10Elukey: Assing role::analytics_cluster::coordinator::query to an-coord1002 [puppet] - 10https://gerrit.wikimedia.org/r/635000 (https://phabricator.wikimedia.org/T257412) [16:26:33] 10Operations, 10Security-Team, 10Stewards-and-global-tools, 10Security, 10User-revi: Security Issue Access Request for 2020 Stewards - https://phabricator.wikimedia.org/T246449 (10sbassett) @Krd - [[ https://phabricator.wikimedia.org/project/members/4570/ | you are now added ]]. Resolving this task for... [16:26:49] (03PS2) 10Andrew Bogott: wmcs instance backup: move a few more projects to cloudvirt1021 [puppet] - 10https://gerrit.wikimedia.org/r/635014 (https://phabricator.wikimedia.org/T260692) [16:28:02] 10Operations, 10Security-Team, 10Stewards-and-global-tools, 10Security, 10User-revi: Security Issue Access Request for 2020 Stewards - https://phabricator.wikimedia.org/T246449 (10sbassett) 05Stalled→03Resolved [16:28:05] (03CR) 10Andrew Bogott: [C: 03+2] wmcs instance backup: move a few more projects to cloudvirt1021 [puppet] - 10https://gerrit.wikimedia.org/r/635014 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [16:28:36] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/25985/" [puppet] - 10https://gerrit.wikimedia.org/r/635000 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [16:29:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/635012 (owner: 10Alexandros Kosiaris) [16:29:51] 10Operations, 10SRE-Access-Requests: Access to peek2001.codfw.wmnet - https://phabricator.wikimedia.org/T265922 (10Reedy) [16:30:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1013 with 10G interfaces - https://phabricator.wikimedia.org/T243414 (10Andrew) 05Stalled→03Open [16:31:01] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [16:31:13] (03PS3) 10CRusnov: netbox: Move eqiad private to automation [dns] - 10https://gerrit.wikimedia.org/r/634302 (https://phabricator.wikimedia.org/T258729) [16:31:15] (03PS2) 10CRusnov: netbox: Move eqiad public to automation [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) [16:31:41] (03CR) 10jerkins-bot: [V: 04-1] netbox: Move eqiad public to automation [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [16:31:43] (03CR) 10jerkins-bot: [V: 04-1] netbox: Move eqiad private to automation [dns] - 10https://gerrit.wikimedia.org/r/634302 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [16:32:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) setup/install deploy1002 - https://phabricator.wikimedia.org/T265653 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['deploy1002.eqiad.wmnet'] ` The log can be found in `/var/l... [16:38:05] 10Operations, 10SRE-Access-Requests, 10Security-Team: Access to peek2001.codfw.wmnet - https://phabricator.wikimedia.org/T265922 (10Reedy) [16:38:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) setup/install deploy1002 - https://phabricator.wikimedia.org/T265653 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['deploy1002.eqiad.wmnet'] ` and were **ALL** successful. [16:40:47] (03CR) 10Huji: [C: 03+1] "Will you kindly see through it being deployed?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626851 (https://phabricator.wikimedia.org/T262689) (owner: 10Majavah) [16:44:20] !log robh@cumin1001 START - Cookbook sre.dns.netbox [16:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:24] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp4031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:54:35] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [16:56:41] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [16:59:36] (03CR) 10DCausse: "this will stop triggering mlr for simple queries with keyword, I think this is acceptable but we can double check in the logs to see how m" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634992 (owner: 10DCausse) [16:59:49] (03CR) 10Ebernhardson: [C: 03+1] "This does seem more correct than before" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634992 (owner: 10DCausse) [17:00:04] ryankemper: #bothumor My software never has bugs. It just develops random features. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201019T1700). [17:02:09] 10Operations, 10netops, 10observability, 10Security, 10User-jbond: ulog: filter out diffscan from ulog - https://phabricator.wikimedia.org/T265590 (10herron) What are the downsides to using iptables rules for this? Also, what are thoughts about creating a generalized iptables "skiplog" chain, and adding... [17:02:37] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Maryana Pinchuk - https://phabricator.wikimedia.org/T265555 (10Maryana) Oops – thanks, @Urbanecm! Logged in through Maryana and it worked. In other news, I have way too many WMF accounts ;) [17:08:55] 10Operations, 10netops, 10observability, 10Security, 10User-jbond: ulog: filter out diffscan from ulog - https://phabricator.wikimedia.org/T265590 (10ayounsi) Downside of having a special iptables rule for diffscan's IP is that it will not be treated anymore as a "random external host" and could cause fa... [17:09:23] (03CR) 10Jforrester: [C: 03+1] wikitech.php: Set CURLOPT_RETURNTRANSFER true in gerrit handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634663 (https://phabricator.wikimedia.org/T242554) (owner: 10Reedy) [17:09:45] 10Operations, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 (2.44.10) - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) {T154237} has indicated that this update may introduce a regression in the handling of some languages in -translated SVGs. [17:12:19] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp4031 is OK: HTTP OK: HTTP/1.0 200 OK - 23507 bytes in 0.226 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:12:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: patch in FB peering into cr1-eqiad:xe-3/2/1 - https://phabricator.wikimedia.org/T265916 (10RobH) [17:26:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: patch in FB peering into cr1-eqiad:xe-3/2/1 - https://phabricator.wikimedia.org/T265916 (10RobH) [17:26:29] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: patch in FB peering into cr1-eqiad:xe-3/2/1 - https://phabricator.wikimedia.org/T265916 (10RobH) [17:29:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: patch in FB peering into cr1-eqiad:xe-3/2/1 - https://phabricator.wikimedia.org/T265916 (10RobH) Please note that FB provided a circuitID for the other peering connection but not this one, so its entry for the circuit is N/A. I noticed that we have other circ... [17:30:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: patch in FB peering into cr1-eqiad:xe-3/2/1 - https://phabricator.wikimedia.org/T265916 (10RobH) [17:32:07] (03CR) 10Urbanecm: [C: 03+1] "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626851 (https://phabricator.wikimedia.org/T262689) (owner: 10Majavah) [17:32:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: patch in FB peering into cr1-eqiad:xe-3/2/1 - https://phabricator.wikimedia.org/T265916 (10RobH) [17:33:17] 10Operations, 10ops-codfw, 10netops: patch in FB peering into cr2-eqdfw - https://phabricator.wikimedia.org/T265412 (10RobH) [17:34:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: patch in FB peering into cr1-eqiad:xe-3/2/1 - https://phabricator.wikimedia.org/T265916 (10RobH) [17:34:38] 10Operations, 10ops-codfw, 10netops: patch in FB peering into cr2-eqdfw - https://phabricator.wikimedia.org/T265412 (10RobH) [17:35:01] 10Operations, 10Domains, 10Education-Program-Dashboard, 10Traffic: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332 (10Ladsgroup) @Ragesoss requested this (look at the above comments) and he's CTO of wiki edu foundation (the org maintaining the dashboard) [17:50:11] (03PS4) 10CRusnov: netbox: Move eqiad private to automation [dns] - 10https://gerrit.wikimedia.org/r/634302 (https://phabricator.wikimedia.org/T258729) [17:50:13] (03PS3) 10CRusnov: netbox: Move eqiad public to automation [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) [17:50:37] (03CR) 10jerkins-bot: [V: 04-1] netbox: Move eqiad public to automation [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [17:53:17] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:27] jouncebot: next [17:56:27] In 0 hour(s) and 3 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201019T1800) [18:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201019T1800) [18:00:04] RoanKattouw and MatmaRex: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:11] \o/ [18:00:16] I'll deploy today [18:00:38] (03CR) 10Catrope: [C: 03+2] Fix mobile diff redirect when 'curid' query parameter is present [extensions/MobileFrontend] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634802 (https://phabricator.wikimedia.org/T265654) (owner: 10Bartosz Dziewoński) [18:01:00] hi [18:03:25] (03PS4) 10Catrope: Drop wgHiddenPrefs hack for VE Beta Feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620958 (https://phabricator.wikimedia.org/T254349) (owner: 10Esanders) [18:03:31] (03CR) 10Catrope: [C: 03+2] Drop wgHiddenPrefs hack for VE Beta Feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620958 (https://phabricator.wikimedia.org/T254349) (owner: 10Esanders) [18:04:11] (03Merged) 10jenkins-bot: Drop wgHiddenPrefs hack for VE Beta Feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620958 (https://phabricator.wikimedia.org/T254349) (owner: 10Esanders) [18:04:38] MatmaRex: Your config patch is on mwdebug2001, please test [18:05:38] (03PS4) 10CRusnov: netbox: Move eqiad public to automation [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) [18:05:49] looking [18:06:05] (03CR) 10jerkins-bot: [V: 04-1] netbox: Move eqiad public to automation [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [18:06:37] (03CR) 10Alex Paskulin: "Is this a fix for https://phabricator.wikimedia.org/T265920?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634841 (owner: 10Ppchelko) [18:06:41] RoanKattouw: i am getting errors about readonly mode [18:07:02] (trying to view my preferences) [18:07:18] oh, i was on the wrong debug server. never mind [18:07:26] (03CR) 10Ppchelko: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634841 (owner: 10Ppchelko) [18:07:40] 2001, not 1001 [18:07:43] Yeah exactly [18:07:52] one more week guys [18:08:03] then it should be back with the switchover i assume [18:08:15] seems good [18:08:44] i'm convinced i changed it to 2001 in the browser extension's dropdown, but it didn't take. weird [18:10:56] !log catrope@deploy1001 Synchronized wmf-config/CommonSettings.php: Drop wgHiddenPrefs hack for VE beta feature (T254349) (duration: 00m 56s) [18:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:02] T254349: A default install of MW+VE still shows VE in beta features and defaults the user option to disabled - https://phabricator.wikimedia.org/T254349 [18:11:08] mutante: I'm afraid of that day. I'll have to re-train my muscle memory again. [18:11:11] (03CR) 10Alex Paskulin: "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634841 (owner: 10Ppchelko) [18:11:29] MatmaRex If you don't mind, I'll do my config changes first, while CI still works on your other patch [18:12:01] (03PS2) 10Catrope: GrowthExperiments: Enable variant C/D for new users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634117 (https://phabricator.wikimedia.org/T265556) [18:12:07] (03CR) 10Catrope: [C: 03+2] GrowthExperiments: Enable variant C/D for new users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634117 (https://phabricator.wikimedia.org/T265556) (owner: 10Catrope) [18:12:26] np [18:13:49] (03PS5) 10CRusnov: netbox: Move eqiad public to automation [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) [18:14:27] (03Merged) 10jenkins-bot: GrowthExperiments: Enable variant C/D for new users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634117 (https://phabricator.wikimedia.org/T265556) (owner: 10Catrope) [18:17:28] (03CR) 10Volans: [C: 04-1] "svc should not be migrated at this time" (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/634302 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [18:20:40] (03Merged) 10jenkins-bot: Fix mobile diff redirect when 'curid' query parameter is present [extensions/MobileFrontend] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634802 (https://phabricator.wikimedia.org/T265654) (owner: 10Bartosz Dziewoński) [18:20:55] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: GrowthExperiments: Enable variant C/D for new users (T265556) (duration: 00m 56s) [18:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:00] T265556: Variant tests: roll out variant C/D - https://phabricator.wikimedia.org/T265556 [18:22:08] MatmaRex: Your MobileFrontend patch is on mwdebug2001, please test [18:22:30] (03PS2) 10Catrope: Enable and configure GrowthExperiments on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634119 (https://phabricator.wikimedia.org/T243445) [18:22:37] (03CR) 10Catrope: [C: 03+2] Enable and configure GrowthExperiments on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634119 (https://phabricator.wikimedia.org/T243445) (owner: 10Catrope) [18:22:44] RoanKattouw: looks good [18:22:55] Urbanecm: :) for maintenance servers the muscle memory problem could be solved with "ssh mwmaint.discovery.wmnet", there is just no mwdebug.discovery.wmnet like that [18:23:26] (03Merged) 10jenkins-bot: Enable and configure GrowthExperiments on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634119 (https://phabricator.wikimedia.org/T243445) (owner: 10Catrope) [18:23:30] mutante: ohoho, we have discovery URI for that. Thanks! [18:23:36] * Urbanecm fixes his swat-prepare script [18:24:03] mutante: is there a way how to get the active DC somehow? [18:24:16] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.13/extensions/MobileFrontend/: Fix mobile diff redirect when curid parameter is present (T265654) (duration: 00m 58s) [18:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:22] T265654: Accessing diff pages from the watchlist points to revision page rather than diff page as curid parameter is not unset - https://phabricator.wikimedia.org/T265654 [18:25:01] Urbanecm: modules/profile/templates/scap/dsh-mediawiki-canaries.tpl.erb:{{- $active_dc := (json (getv "/mediawiki-config/common/WMFMasterDatacenter")).val }} [18:25:11] so in mw-config repo [18:25:17] WMFMasterDatacenter [18:25:21] mutante: mwmaint.discovery.wmnet still points to...mwmaint1002 :-( [18:25:25] (03CR) 10Huji: [C: 03+1] "SecurePoll on local DBs is not supported by WMF (long story). We must use it on votewiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626851 (https://phabricator.wikimedia.org/T262689) (owner: 10Majavah) [18:27:20] Urbanecm: hmm.. yes.. so the thing here is it does more than one thing. one is "are mw periodic jobs running here" and the other is "it this the webserver for noc.wikimedia.org" [18:27:34] one was switched and the other was not.. room for improvement there [18:28:11] (03PS1) 10Andrew Bogott: cloudvirt1019/1020: Move to Buster on next reimage [puppet] - 10https://gerrit.wikimedia.org/r/635023 (https://phabricator.wikimedia.org/T263677) [18:29:06] :( [18:29:33] !log removing 10 files for legal compliance [18:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:34] RoanKattouw: can you please ping me once it's fine to deploy a config for me? [18:31:04] Urbanecm: After this sync that I just started [18:31:10] thanks [18:31:41] (03PS5) 10CRusnov: netbox: Move eqiad private to automation [dns] - 10https://gerrit.wikimedia.org/r/634302 (https://phabricator.wikimedia.org/T258729) [18:31:43] (03PS6) 10CRusnov: netbox: Move eqiad public to automation [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) [18:31:48] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable and configure GrowthExperiments on trwiki (T243445) (duration: 00m 57s) [18:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:54] T243445: Deploy Growth features on Turkish Wikipedia - https://phabricator.wikimedia.org/T243445 [18:32:10] (03PS3) 10Urbanecm: Change votewiki language temporarily to fa for fawiki elections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626851 (https://phabricator.wikimedia.org/T262689) (owner: 10Majavah) [18:32:17] (03CR) 10Urbanecm: [C: 03+2] Change votewiki language temporarily to fa for fawiki elections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626851 (https://phabricator.wikimedia.org/T262689) (owner: 10Majavah) [18:32:59] (03Merged) 10jenkins-bot: Change votewiki language temporarily to fa for fawiki elections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626851 (https://phabricator.wikimedia.org/T262689) (owner: 10Majavah) [18:33:26] (03CR) 10CRusnov: "Fixed svc addresses." (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/634302 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [18:33:28] (03PS3) 10ArielGlenn: get revision info from stubs file and use to generate page range info [dumps] - 10https://gerrit.wikimedia.org/r/633567 (https://phabricator.wikimedia.org/T263319) [18:33:50] (03CR) 10Urbanecm: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626851 (https://phabricator.wikimedia.org/T262689) (owner: 10Majavah) [18:37:29] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 18902aa75efafb7d56ca347c12781dbe59f2f8ad: Change votewiki language temporarily to fa for fawiki elections (T262689) (duration: 00m 56s) [18:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:35] T262689: Carry out the 2020 fawiki elections on votewiki - https://phabricator.wikimedia.org/T262689 [18:37:55] RoanKattouw: thanks, that's all from me [18:39:57] I also have another config patch coming [18:40:36] 10Operations, 10serviceops: improve mw maintenance server switch over and discovery names - https://phabricator.wikimedia.org/T265936 (10Dzahn) [18:43:33] (03PS1) 10Ebernhardson: Revert "cirrus: temporarily disable saneitizer" [puppet] - 10https://gerrit.wikimedia.org/r/635047 (https://phabricator.wikimedia.org/T263073) [18:44:44] 10Operations, 10Wikimedia-Mailing-lists: Wrong encoding on mailman - https://phabricator.wikimedia.org/T265937 (10Ash_Crow) [18:46:28] (03PS1) 10Catrope: GrowthExperiments: Set default variant to D on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635048 (https://phabricator.wikimedia.org/T243445) [18:52:01] (03CR) 10Catrope: [C: 03+2] GrowthExperiments: Set default variant to D on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635048 (https://phabricator.wikimedia.org/T243445) (owner: 10Catrope) [18:52:46] (03Merged) 10jenkins-bot: GrowthExperiments: Set default variant to D on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635048 (https://phabricator.wikimedia.org/T243445) (owner: 10Catrope) [18:54:24] (03PS1) 10QChris: Allow “Gerrit Managers” to import history [software/heptiolabs/eventrouter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/635049 [18:54:27] (03CR) 10QChris: [V: 03+2 C: 03+2] Allow “Gerrit Managers” to import history [software/heptiolabs/eventrouter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/635049 (owner: 10QChris) [18:56:49] PROBLEM - Check systemd state on ms-be2026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:01:16] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: GrowthExperiments: Set default variant to D on trwiki (T243445, T265556) (duration: 00m 56s) [19:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:24] T265556: Variant tests: roll out variant C/D - https://phabricator.wikimedia.org/T265556 [19:01:26] T243445: Deploy Growth features on Turkish Wikipedia - https://phabricator.wikimedia.org/T243445 [19:06:49] (03CR) 10Dzahn: "role(insetup) was introduced after role(spare::system) already existed though.. let me check what the current workflow page actually says." [puppet] - 10https://gerrit.wikimedia.org/r/626460 (https://phabricator.wikimedia.org/T260717) (owner: 10Dzahn) [19:07:26] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) setup/install deploy1002 - https://phabricator.wikimedia.org/T265653 (10RobH) 05Open→03Resolved [19:08:15] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) setup/install deploy1002 - https://phabricator.wikimedia.org/T265653 (10RobH) 05Resolved→03Open a:05RobH→03Cmjohnson I shouldn't have resolved, hostname label has to go on. @Cmjohnson: Once the hostname label is applied to deploy1002, this can be r... [19:10:50] (03CR) 10Dzahn: "yea, so the server lifecycle page actually doesn't mention the specific roles. I think based on your second comment about Hiera and that i" [puppet] - 10https://gerrit.wikimedia.org/r/626460 (https://phabricator.wikimedia.org/T260717) (owner: 10Dzahn) [19:17:29] (03Abandoned) 10Dzahn: site: remove backup host role from helium [puppet] - 10https://gerrit.wikimedia.org/r/626460 (https://phabricator.wikimedia.org/T260717) (owner: 10Dzahn) [19:18:47] 10Operations, 10Wikimedia-Mailing-lists: Wrong encoding on mailman - https://phabricator.wikimedia.org/T265937 (10Quiddity) [19:19:18] 10Operations, 10Wikimedia-Mailing-lists, 10I18n: Several unreadable mailing list descriptions (Mojibake) due to wrong charset encodings, should be Unicode - https://phabricator.wikimedia.org/T261031 (10Quiddity) [19:26:46] (03PS1) 10Dzahn: conftool-data: switch codfw parsoid canary servers [puppet] - 10https://gerrit.wikimedia.org/r/635055 (https://phabricator.wikimedia.org/T265558) [19:29:07] !log dzahn@cumin1001 conftool action : set/weight=0; selector: dc=codfw,cluster=parsoid,service=canary [19:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:51] RECOVERY - Check systemd state on ms-be2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:31:02] (03CR) 10Ottomata: [C: 03+2] helmfile.d: refactor eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/634976 (https://phabricator.wikimedia.org/T258572) (owner: 10Ottomata) [19:33:39] !log wtp2001 - sudo confctl decommission [19:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:25] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [19:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:55] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [19:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:27] (03PS6) 10Ottomata: helmfile.d: refactor eventgate-analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/634983 (https://phabricator.wikimedia.org/T258572) [19:41:20] (03PS1) 10Jgiannelos: Update mobileapps to 2020-10-15-140655-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/635059 [19:41:49] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@3c590e2]: Fix column mismatch for discovery.wikibase_item and multilist handler for esbulk uploads [19:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:06] (03CR) 10Ottomata: [C: 03+2] helmfile.d: refactor eventgate-analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/634983 (https://phabricator.wikimedia.org/T258572) (owner: 10Ottomata) [19:45:25] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@3c590e2]: Fix column mismatch for discovery.wikibase_item and multilist handler for esbulk uploads (duration: 03m 35s) [19:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:51] (03PS1) 10Ottomata: eventgate-analytics-external - set staging replicas to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/635062 (https://phabricator.wikimedia.org/T258572) [19:47:35] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics-external - set staging replicas to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/635062 (https://phabricator.wikimedia.org/T258572) (owner: 10Ottomata) [19:51:24] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [19:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:34] (03PS3) 10Ottomata: helmfile.d: refactor eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/634984 (https://phabricator.wikimedia.org/T258572) [19:51:38] (03CR) 10Jgiannelos: [C: 03+2] Update mobileapps to 2020-10-15-140655-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/635059 (owner: 10Jgiannelos) [19:52:27] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [19:52:27] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [19:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:13] (03Merged) 10jenkins-bot: Update mobileapps to 2020-10-15-140655-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/635059 (owner: 10Jgiannelos) [19:57:03] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [19:57:04] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [19:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:37] !log jgiannelos@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [19:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] chrisalbon and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201019T2000). [20:00:04] (03CR) 10Ottomata: [C: 03+2] helmfile.d: refactor eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/634984 (https://phabricator.wikimedia.org/T258572) (owner: 10Ottomata) [20:00:43] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: dc=codfw,name=wtp200[1-9].codfw.wmnet [20:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:47] !log decom'ing wtp200[1-9].codfw.wmnet (pooled=inactive) T265558 [20:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:53] T265558: decommission wtp2001 through wtp2020 - https://phabricator.wikimedia.org/T265558 [20:02:54] (03CR) 10Dzahn: [C: 03+2] conftool-data: switch codfw parsoid canary servers [puppet] - 10https://gerrit.wikimedia.org/r/635055 (https://phabricator.wikimedia.org/T265558) (owner: 10Dzahn) [20:04:46] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (Kanban): labtestnet2002: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 (10Andrew) [20:05:45] PROBLEM - mediawiki-installation DSH group on wtp2003 is CRITICAL: Host wtp2003 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:06:38] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [20:06:41] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:48] !log jgiannelos@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [20:08:48] !log jgiannelos@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [20:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:18] (03PS1) 10Ottomata: eventstreams - fix schema URIs in stream docs [deployment-charts] - 10https://gerrit.wikimedia.org/r/635067 [20:09:58] !log dzahn@cumin1001 conftool action : set/weight=1; selector: dc=codfw,cluster=parsoid,service=canary [20:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:30] (03PS1) 10Andrew Bogott: cloud-vps snuggle project: remove a couple of NFS mounts [puppet] - 10https://gerrit.wikimedia.org/r/635068 (https://phabricator.wikimedia.org/T102680) [20:10:59] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps snuggle project: remove a couple of NFS mounts [puppet] - 10https://gerrit.wikimedia.org/r/635068 (https://phabricator.wikimedia.org/T102680) (owner: 10Andrew Bogott) [20:11:18] (03CR) 10Dzahn: "set weight to 1 as usually done for canaries" [puppet] - 10https://gerrit.wikimedia.org/r/635055 (https://phabricator.wikimedia.org/T265558) (owner: 10Dzahn) [20:11:41] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventstreams - fix schema URIs in stream docs [deployment-charts] - 10https://gerrit.wikimedia.org/r/635067 (owner: 10Ottomata) [20:12:06] (03PS5) 10Dzahn: parsoid: add data types [puppet] - 10https://gerrit.wikimedia.org/r/634385 [20:15:23] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@e66bec2]: Fix column mismatch when reading discovery.wikibase_item [20:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:29] (03CR) 10Huji: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626851 (https://phabricator.wikimedia.org/T262689) (owner: 10Majavah) [20:15:44] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: dc=codfw,name=wtp201[0-9].codfw.wmnet [20:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:57] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [20:16:00] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:10] !log decom'ing wtp201[0-9].codfw.wmnet (pooled=inactive) T265558 [20:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:15] T265558: decommission wtp2001 through wtp2020 - https://phabricator.wikimedia.org/T265558 [20:16:26] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@e66bec2]: Fix column mismatch when reading discovery.wikibase_item (duration: 01m 03s) [20:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:44] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [20:16:45] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:58] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: dc=codfw,name=wtp2020.codfw.wmnet [20:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:27] !log jgiannelos@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [20:17:27] !log jgiannelos@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [20:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:28] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [20:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:35] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [20:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:02] 10Operations, 10Scap, 10serviceops, 10Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)): Make a way to build Scap .deb in Docker - https://phabricator.wikimedia.org/T265501 (10jijiki) [20:19:44] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [20:19:44] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [20:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:39] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/25986/" [puppet] - 10https://gerrit.wikimedia.org/r/634385 (owner: 10Dzahn) [20:21:49] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [20:21:49] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [20:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:48] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10jijiki) @kostajh Thank you! 💃🏼 [20:24:05] (03PS2) 10Dzahn: conftool-data: remove wtp2001 through wtp2020 [puppet] - 10https://gerrit.wikimedia.org/r/634126 (https://phabricator.wikimedia.org/T265558) [20:24:53] (03CR) 10Dzahn: [C: 03+2] conftool-data: remove wtp2001 through wtp2020 [puppet] - 10https://gerrit.wikimedia.org/r/634126 (https://phabricator.wikimedia.org/T265558) (owner: 10Dzahn) [20:25:46] (03PS1) 10Ppchelko: Enable warn+ logging for ParserCache channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635071 (https://phabricator.wikimedia.org/T264394) [20:26:35] (03PS2) 10Ppchelko: Enable warn+ logging for ParserCache channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635071 (https://phabricator.wikimedia.org/T264394) [20:29:10] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10jijiki) [20:29:42] (03PS2) 10Dzahn: site: remove wtp2001 through wtp2020 [puppet] - 10https://gerrit.wikimedia.org/r/634362 (https://phabricator.wikimedia.org/T265558) [20:38:27] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/25987/" [puppet] - 10https://gerrit.wikimedia.org/r/634362 (https://phabricator.wikimedia.org/T265558) (owner: 10Dzahn) [20:38:44] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [20:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:01] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10observability, 10Release-Engineering-Team (Development services): Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10hashar) Gitlab is irrelevant, we still have to maintain Gerrit and will have to maintain it for as... [20:46:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [20:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:38] (03PS1) 10Ppchelko: Api-Gateway: use HTTPS in staging for upstream services [deployment-charts] - 10https://gerrit.wikimedia.org/r/635075 (https://phabricator.wikimedia.org/T265638) [20:49:39] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [20:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:22] (03CR) 10Ppchelko: [C: 03+2] Api-Gateway: use HTTPS in staging for upstream services [deployment-charts] - 10https://gerrit.wikimedia.org/r/635075 (https://phabricator.wikimedia.org/T265638) (owner: 10Ppchelko) [20:53:02] (03Merged) 10jenkins-bot: Api-Gateway: use HTTPS in staging for upstream services [deployment-charts] - 10https://gerrit.wikimedia.org/r/635075 (https://phabricator.wikimedia.org/T265638) (owner: 10Ppchelko) [20:55:55] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [20:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:21] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [20:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:04] Reedy and sbassett: How many deployers does it take to do Weekly Security deployment window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201019T2100). [21:01:12] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [21:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:07] (03PS1) 10Dzahn: add deploy1002 to deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/635079 [21:04:54] 10Operations, 10MediaWiki-General, 10Platform Engineering: Allow easier ICU transitions in MediaWiki - https://phabricator.wikimedia.org/T263437 (10AMooney) a:03tstarling [21:14:31] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@94c23a1]: airflow: fix column mismatch writing page predictions [21:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:19] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@94c23a1]: airflow: fix column mismatch writing page predictions (duration: 04m 48s) [21:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:28] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [21:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:04] (03PS1) 10Dzahn: remove wtp2001 through wtp2009 [dns] - 10https://gerrit.wikimedia.org/r/635083 (https://phabricator.wikimedia.org/T265558) [21:26:32] (03CR) 10Jeena Huneidi: [DNM] Experimental King helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/634354 (https://phabricator.wikimedia.org/T258572) (owner: 10Jeena Huneidi) [21:28:01] (03PS1) 10BryanDavis: toolviews: Fix logic bug affecting toolforge.org vhosts [puppet] - 10https://gerrit.wikimedia.org/r/635084 (https://phabricator.wikimedia.org/T265949) [21:29:47] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [21:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:12] (03CR) 10Dzahn: [C: 03+2] remove wtp2001 through wtp2009 [dns] - 10https://gerrit.wikimedia.org/r/635083 (https://phabricator.wikimedia.org/T265558) (owner: 10Dzahn) [21:40:49] (03PS1) 10Ppchelko: Enable parsoid on api_appserver [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635086 (https://phabricator.wikimedia.org/T265295) [21:41:10] (03PS2) 10Ppchelko: Enable parsoid on api_appserver [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635086 (https://phabricator.wikimedia.org/T265295) [21:42:17] (03CR) 10jerkins-bot: [V: 04-1] Enable parsoid on api_appserver [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635086 (https://phabricator.wikimedia.org/T265295) (owner: 10Ppchelko) [21:43:07] (03PS3) 10Ppchelko: Enable parsoid on api_appserver [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635086 (https://phabricator.wikimedia.org/T265295) [21:45:10] (03CR) 10Bstorm: [C: 03+2] toolviews: Fix logic bug affecting toolforge.org vhosts [puppet] - 10https://gerrit.wikimedia.org/r/635084 (https://phabricator.wikimedia.org/T265949) (owner: 10BryanDavis) [21:47:03] (03PS1) 10Dzahn: etherpad: explicitly use the colibris skin [puppet] - 10https://gerrit.wikimedia.org/r/635087 [21:48:19] (03PS2) 10Dzahn: etherpad: explicitly use the colibris skin and default variant [puppet] - 10https://gerrit.wikimedia.org/r/635087 [21:53:07] 10Operations, 10SRE-Access-Requests, 10Security-Team: Access to peek2001.codfw.wmnet - https://phabricator.wikimedia.org/T265922 (10Dzahn) a:03Dzahn [21:55:00] 10Operations, 10SRE-Access-Requests, 10Security-Team: Access to peek2001.codfw.wmnet - https://phabricator.wikimedia.org/T265922 (10Dzahn) Yes, SRE has root access to it. There is no other admin role applied though. [21:56:42] (03CR) 10Subramanya Sastry: Enable parsoid on api_appserver (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635086 (https://phabricator.wikimedia.org/T265295) (owner: 10Ppchelko) [22:00:06] (03PS1) 10Dzahn: allow secteam-users access to nodes with role(peek) [puppet] - 10https://gerrit.wikimedia.org/r/635090 (https://phabricator.wikimedia.org/T265922) [22:01:11] (03PS2) 10Dzahn: allow secteam-users access to nodes with role(peek) [puppet] - 10https://gerrit.wikimedia.org/r/635090 (https://phabricator.wikimedia.org/T265922) [22:03:20] PROBLEM - LVS wikifeeds codfw port 4101/tcp - A node webservice supporting featured wiki content feeds. termbox.svc.eqiad.wmnet IPv4 #page on wikifeeds.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.47 and port 4101: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:03:25] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [mostread] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:03:32] here [22:03:33] (03PS1) 10Dzahn: etherpad: activate shortcut keys [puppet] - 10https://gerrit.wikimedia.org/r/635091 [22:03:48] is that the 22:00 page, back unexpectedly? checking [22:03:55] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/re [22:03:59] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected st [22:03:59] ng: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [22:04:26] (03CR) 10Ppchelko: Enable parsoid on api_appserver (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635086 (https://phabricator.wikimedia.org/T265295) (owner: 10Ppchelko) [22:04:38] RECOVERY - LVS wikifeeds codfw port 4101/tcp - A node webservice supporting featured wiki content feeds. termbox.svc.eqiad.wmnet IPv4 #page on wikifeeds.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1003 bytes in 1.225 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:04:39] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:05:06] (03CR) 10Ppchelko: Enable parsoid on api_appserver (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635086 (https://phabricator.wikimedia.org/T265295) (owner: 10Ppchelko) [22:05:09] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:05:15] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [22:05:16] rzl: I sent the "resolved" in VP [22:05:23] spike in incoming requests for v1_page_summary_-title- https://grafana.wikimedia.org/d/000000068/restbase?orgId=1&from=now-3h&to=now [22:05:36] which is consistent with the midnight iOS issue but weird that it wasn't cacheable this time [22:06:00] well, that's conjecture -- weird that the cache didn't protect us, for any reason including possibly that one :) [22:06:07] mutante: ack thanks [22:07:54] looks like just the "from internal" rate went up, not from external clients [22:15:11] not sure what that internal/external distinction is, but this was definitely a spike from the iOS app [22:15:39] ack [22:16:25] (03PS1) 10Dzahn: etherpad: add 'trustProxy' config setting and enable it [puppet] - 10https://gerrit.wikimedia.org/r/635094 [22:16:39] (03CR) 10Subramanya Sastry: Enable parsoid on api_appserver (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635086 (https://phabricator.wikimedia.org/T265295) (owner: 10Ppchelko) [22:17:58] https://logstash.wikimedia.org/goto/7e7057052d0a19de13b8a15256761998 [22:18:40] some resets and some timeouts, but all for root_req.uri = /${some_language}.wikipedia.org/v1/page/random/summary [22:20:02] so it might be the traffic changed and our rate limiting on /random/ is no longer catching it (easy to check, will do shortly) or it might be the rate limit is set too high, and it was still enough to page [22:20:07] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:20:13] (although in that case I'd wonder why it was fine for several days in a row and paged today) [22:20:34] (03PS1) 10Ppchelko: Enable Parsoid REST API when loading it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635095 (https://phabricator.wikimedia.org/T265295) [22:20:39] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:21:34] checking if that is planned maintenance ^ [22:22:09] hmmmm, like I say it's coming in as /de.wikipedia.org/v1/page/random/summary but our rate limit checks for req.url == "/api/rest_v1/page/random/summary" [22:22:21] so it's possible that's what changed, I need to see if that's how it was before [22:22:53] it's also coming in with a "content-location: https://de.wikipedia.org/api/rest_v1/page/random/summary" header [22:23:26] oh I bet that's just the difference between the request urls as seen by varnish and restbase? [22:23:28] I need an adult [22:26:01] (03CR) 10Ppchelko: "Before creating a setting, disabled by default, we need to enable it in the config where needed. This is a no-op." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635095 (https://phabricator.wikimedia.org/T265295) (owner: 10Ppchelko) [22:27:06] ACKNOWLEDGEMENT - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: daniel_zahn Lumen Scheduled Maintenance #: 19722045, Scheduled https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:27:06] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: daniel_zahn Lumen Scheduled Maintenance #: 19722045, Scheduled https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:27:27] (03PS4) 10Ppchelko: Enable parsoid on api_appserver [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635086 (https://phabricator.wikimedia.org/T265295) [22:27:34] yes, I was able to find an entry in the maint announce calendar matching that router interface going down [22:28:14] it is eqiad-esams Level 3 10Gbps wave [22:28:22] now called Lumen [22:28:23] (03CR) 10Ppchelko: Enable parsoid on api_appserver (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635086 (https://phabricator.wikimedia.org/T265295) (owner: 10Ppchelko) [22:28:37] 👍 [22:29:24] still not sure what changed here -- it's not urgent, the actual production harm ended with the spike, but we might keep getting paged at 22:00 daily until it's addressed :) so I'd like to figure it out [22:30:06] rzl: what you said looked for a moment like "v1" vs "rest_v1" as part of the URL [22:30:25] besides the domain prefix that is [22:31:28] yeah, I'm not sure if that's just a URL changes happening within our stack -- rewrite in the cache layer before it's forwarded to restbase, most likely [22:31:37] I just wasn't the one digging into this last time so it's taking me a moment to catch up :) [22:37:48] okay yeah, from turnilo it looks like it's exactly that, so the URL difference was a red herringb [22:40:30] (03CR) 10Subramanya Sastry: "Given this is close to next train release, we decided on IRC that it is simplest to wait a week to get all the pieces lined up. So, Petr w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635086 (https://phabricator.wikimedia.org/T265295) (owner: 10Ppchelko) [22:42:30] (03PS1) 10Dzahn: etherpad: add commitRateLimiting config, set duration 10, points 100 [puppet] - 10https://gerrit.wikimedia.org/r/635098 (https://phabricator.wikimedia.org/T265490) [22:42:48] (03CR) 10jerkins-bot: [V: 04-1] etherpad: add commitRateLimiting config, set duration 10, points 100 [puppet] - 10https://gerrit.wikimedia.org/r/635098 (https://phabricator.wikimedia.org/T265490) (owner: 10Dzahn) [22:42:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:44:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:45:44] (03PS2) 10Dzahn: etherpad: add commitRateLimiting config, set higher values [puppet] - 10https://gerrit.wikimedia.org/r/635098 (https://phabricator.wikimedia.org/T265490) [22:46:04] (03CR) 10jerkins-bot: [V: 04-1] etherpad: add commitRateLimiting config, set higher values [puppet] - 10https://gerrit.wikimedia.org/r/635098 (https://phabricator.wikimedia.org/T265490) (owner: 10Dzahn) [22:47:10] (03PS3) 10Dzahn: etherpad: add commitRateLimiting config, set higher values [puppet] - 10https://gerrit.wikimedia.org/r/635098 (https://phabricator.wikimedia.org/T265490) [22:48:14] (03PS2) 10Ppchelko: Enable Parsoid REST API when loading it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635095 (https://phabricator.wikimedia.org/T265954) [22:48:16] (03PS5) 10Ppchelko: Enable parsoid on api_appserver [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635086 (https://phabricator.wikimedia.org/T265954) [22:53:47] (03CR) 10Cwhite: [C: 03+1] profile: add alerts for Thanos sidecar not uploading or failing to do so [puppet] - 10https://gerrit.wikimedia.org/r/634475 (https://phabricator.wikimedia.org/T265632) (owner: 10Filippo Giunchedi) [22:59:14] (03CR) 10Dzahn: [C: 03+2] etherpad: add commitRateLimiting config, set higher values [puppet] - 10https://gerrit.wikimedia.org/r/635098 (https://phabricator.wikimedia.org/T265490) (owner: 10Dzahn) [23:00:04] RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Evening backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201019T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:02:22] !log etherpad got restarted with new config options related to rate limiting - hopefully this fixed T265490 [23:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:31] T265490: rate limited etherpad - https://phabricator.wikimedia.org/T265490 [23:05:32] 10Operations, 10Wikimedia-Etherpad, 10Patch-For-Review: rate limited etherpad - https://phabricator.wikimedia.org/T265490 (10Dzahn) https://grafana.wikimedia.org/d/000000193/etherpad?viewPanel=16&orgId=1&from=now-24h&to=now ^ Uhm.. but this went up. Trying to manually reproduce I could not... also before... [23:07:15] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@4bfd6c9]: spark: case insensitive schema validation [23:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:48] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@4bfd6c9]: spark: case insensitive schema validation (duration: 04m 33s) [23:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:04] (03PS2) 10Dzahn: etherpad: add 'trustProxy' config setting and enable it [puppet] - 10https://gerrit.wikimedia.org/r/635094 (https://phabricator.wikimedia.org/T265490) [23:31:53] 10Operations, 10Wikimedia-Etherpad, 10Patch-For-Review: rate limited etherpad - https://phabricator.wikimedia.org/T265490 (10Dzahn) @hashar Please let me know if you still see the issue or not. I am hopeful it might be fixed and can't reproduce it right now but also it seems we only had spikes during meeting... [23:37:46] (03CR) 10Dzahn: docker::registry: hiera->lookup, add data types (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/633835 (owner: 10Dzahn) [23:39:00] (03PS2) 10Dzahn: docker::registry: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/633835 [23:40:57] (03PS2) 10Dzahn: puppetmaster: pass $servers parameter to gitclone class [puppet] - 10https://gerrit.wikimedia.org/r/634368 [23:42:01] (03CR) 10Dzahn: puppetmaster: pass $servers parameter to gitclone class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/634368 (owner: 10Dzahn) [23:42:18] (03PS3) 10Dzahn: puppetmaster: pass $servers parameter to gitclone class [puppet] - 10https://gerrit.wikimedia.org/r/634368 [23:42:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:43:45] (03PS2) 10Dzahn: mariadb::grants: hiera()->lookup() [puppet] - 10https://gerrit.wikimedia.org/r/634387 (https://phabricator.wikimedia.org/T256972) [23:44:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:52:36] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) setup/install deploy1002 - https://phabricator.wikimedia.org/T265653 (10Dzahn) [23:53:04] (03PS2) 10Dzahn: add deploy1002 to deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/635079 (https://phabricator.wikimedia.org/T265963) [23:56:05] !log gehel@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [23:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:20] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-reload [23:57:21] !log gehel@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [23:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:40] (03PS3) 10Dzahn: add deploy1002 to deployment_hosts for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/635079 (https://phabricator.wikimedia.org/T265963) [23:57:43] (03PS1) 10Dzahn: hiera/scap: switch deployment server to deploy1002 [puppet] - 10https://gerrit.wikimedia.org/r/635105 (https://phabricator.wikimedia.org/T265963) [23:57:44] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-reload [23:57:44] !log gehel@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [23:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log