[00:02:23] (03PS1) 10Urbanecm: Add redirect from otrs-wiki.wikimedia.org to vrt-wiki.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/692454 (https://phabricator.wikimedia.org/T280400) [00:02:37] (03CR) 10Urbanecm: [C: 04-1] "not now" [puppet] - 10https://gerrit.wikimedia.org/r/692454 (https://phabricator.wikimedia.org/T280400) (owner: 10Urbanecm) [00:04:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: drop_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:22] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, Kerberos, and LDAP for ODimitrijevic - https://phabricator.wikimedia.org/T282836 (10odimitrijevic) [00:45:45] RECOVERY - BGP status on cr3-knams is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:49:18] (03CR) 10Krinkle: [C: 03+1] "LGTM, but should be thoroughly tested on mwdebug both a wiki where the override exists and one where it does not." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692413 (owner: 10Zabe) [01:11:06] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, Kerberos, and LDAP for ODimitrijevic - https://phabricator.wikimedia.org/T282836 (10ttaylor) [01:15:56] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, Kerberos, and LDAP for ODimitrijevic - https://phabricator.wikimedia.org/T282836 (10ttaylor) Approved by me [01:37:05] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change, 10User-notice: Mailman3 schema change: change utf8 columns to utf8mb4 - https://phabricator.wikimedia.org/T282621 (10Legoktm) [02:07:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.37.0-wmf.6 [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/692464 [02:07:57] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.37.0-wmf.6 [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/692464 (owner: 10TrainBranchBot) [02:24:30] (03CR) 10jerkins-bot: [V: 04-1] Branch commit for wmf/1.37.0-wmf.6 [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/692464 (owner: 10TrainBranchBot) [03:06:03] (03PS1) 10Razzi: yarn: temporarily stop allowing jobs to be submitted to yarn [puppet] - 10https://gerrit.wikimedia.org/r/692465 (https://phabricator.wikimedia.org/T278423) [03:46:31] PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (8741) = 92.2% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [04:51:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:56:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:06:36] !log Restart db1115 mysql [05:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:44] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change, 10User-notice: Mailman3 schema change: change utf8 columns to utf8mb4 - https://phabricator.wikimedia.org/T282621 (10Marostegui) >>! In T282621#7094350, @Legoktm wrote: > How about 2021-05-19 06:00 UTC? Or any other day at that time > Works for me! [05:08:44] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change, 10User-notice: Mailman3 schema change: change utf8 columns to utf8mb4 - https://phabricator.wikimedia.org/T282621 (10Marostegui) p:05Triage→03Medium [05:09:01] RECOVERY - MariaDB memory on db1115 is OK: OK Memory 2% used https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [05:09:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106', diff saved to https://phabricator.wikimedia.org/P16028 and previous config saved to /var/cache/conftool/dbconfig/20210518-050949-marostegui.json [05:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:19] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 112140768 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:14:41] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 382568 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:18:26] (03PS1) 10Marostegui: db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/692470 (https://phabricator.wikimedia.org/T280492) [05:19:14] (03CR) 10Marostegui: [C: 03+2] db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/692470 (https://phabricator.wikimedia.org/T280492) (owner: 10Marostegui) [05:22:10] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Archive destacado-l - https://phabricator.wikimedia.org/T282291 (10Ladsgroup) 05Open→03Resolved [05:23:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1114', diff saved to https://phabricator.wikimedia.org/P16029 and previous config saved to /var/cache/conftool/dbconfig/20210518-052324-marostegui.json [05:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1106.eqiad.wmnet with reason: REIMAGE [05:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1106.eqiad.wmnet with reason: REIMAGE [05:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:47] (03PS1) 10Marostegui: db2108: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/692471 (https://phabricator.wikimedia.org/T282535) [05:28:29] (03CR) 10Marostegui: [C: 03+2] db2108: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/692471 (https://phabricator.wikimedia.org/T282535) (owner: 10Marostegui) [05:32:27] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change, 10User-notice: Mailman3 schema change: change utf8 columns to utf8mb4 - https://phabricator.wikimedia.org/T282621 (10Legoktm) Announced: https://lists.wikimedia.org/hyperkitty/list/listadmins-announce@lists.wikimedia.org/thread/THQY2OJYW5NZIBFS3OK... [05:38:22] (03PS1) 10Marostegui: mariadb: Decommission labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/692472 (https://phabricator.wikimedia.org/T282522) [05:41:50] (03CR) 10Ladsgroup: "Another batch :D" [puppet] - 10https://gerrit.wikimedia.org/r/692280 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [05:42:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts labsdb1009.eqiad.wmnet [05:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:49:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:52:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts labsdb1009.eqiad.wmnet [05:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:56] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/692472 (https://phabricator.wikimedia.org/T282522) (owner: 10Marostegui) [05:54:33] 10ops-eqiad, 10decommission-hardware: decommission labsdb1009.eqiad.wmnet - https://phabricator.wikimedia.org/T282522 (10Marostegui) a:05Marostegui→03wiki_willy This is ready for #dc-ops [05:55:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 25%: Repool db1114', diff saved to https://phabricator.wikimedia.org/P16030 and previous config saved to /var/cache/conftool/dbconfig/20210518-055522-root.json [05:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:53] 10ops-eqiad, 10decommission-hardware: decommission labsdb1009.eqiad.wmnet - https://phabricator.wikimedia.org/T282522 (10Marostegui) a:05Marostegui→03wiki_willy [05:56:10] 10ops-eqiad, 10decommission-hardware: decommission labsdb1009.eqiad.wmnet - https://phabricator.wikimedia.org/T282522 (10Marostegui) a:05wiki_willy→03Marostegui [06:06:26] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 78 probes of 628 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:10:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 50%: Repool db1114', diff saved to https://phabricator.wikimedia.org/P16031 and previous config saved to /var/cache/conftool/dbconfig/20210518-061026-root.json [06:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:32] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 43 probes of 628 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:14:04] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Services, 10Service-deployment-requests: New Service Request tegola - https://phabricator.wikimedia.org/T274390 (10jijiki) [06:15:51] (03CR) 10Marostegui: "I don't think we use this, but let's see if Ariel does." [puppet] - 10https://gerrit.wikimedia.org/r/692370 (owner: 10Zabe) [06:16:35] 10SRE, 10netops: drmrs: network configuration - https://phabricator.wikimedia.org/T283050 (10ayounsi) p:05Triage→03Medium [06:23:33] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think we should actually favour logging to sdout/stderr in the image, and possibly remove the env variable." (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/691293 (owner: 10Dzahn) [06:25:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 75%: Repool db1114', diff saved to https://phabricator.wikimedia.org/P16032 and previous config saved to /var/cache/conftool/dbconfig/20210518-062529-root.json [06:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:29] (03PS1) 10Marostegui: instances.yaml: Remove db1083 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/692478 (https://phabricator.wikimedia.org/T281445) [06:29:10] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1083 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/692478 (https://phabricator.wikimedia.org/T281445) (owner: 10Marostegui) [06:29:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1083 from dbctl T281445', diff saved to https://phabricator.wikimedia.org/P16033 and previous config saved to /var/cache/conftool/dbconfig/20210518-062947-marostegui.json [06:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:52] T281445: decommission db1083.eqiad.wmnet - https://phabricator.wikimedia.org/T281445 [06:32:06] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:32:37] (03PS1) 10Marostegui: mariadb: Decommission db1083 [puppet] - 10https://gerrit.wikimedia.org/r/692479 (https://phabricator.wikimedia.org/T281445) [06:33:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1083.eqiad.wmnet [06:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 100%: Repool db1114', diff saved to https://phabricator.wikimedia.org/P16034 and previous config saved to /var/cache/conftool/dbconfig/20210518-064033-root.json [06:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:23] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1083 [puppet] - 10https://gerrit.wikimedia.org/r/692479 (https://phabricator.wikimedia.org/T281445) (owner: 10Marostegui) [06:41:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1083.eqiad.wmnet [06:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:33] 10ops-eqiad, 10DBA, 10DC-Ops, 10decommission-hardware: decommission db1083.eqiad.wmnet - https://phabricator.wikimedia.org/T281445 (10Marostegui) a:05Marostegui→03wiki_willy [06:43:10] 10ops-eqiad, 10DBA, 10DC-Ops, 10decommission-hardware: decommission db1083.eqiad.wmnet - https://phabricator.wikimedia.org/T281445 (10Marostegui) Ready for #dc-ops [06:43:19] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:43:30] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1083.eqiad.wmnet - https://phabricator.wikimedia.org/T281445 (10Marostegui) [06:44:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1111', diff saved to https://phabricator.wikimedia.org/P16035 and previous config saved to /var/cache/conftool/dbconfig/20210518-064426-marostegui.json [06:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add canary support in scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/685748 (https://phabricator.wikimedia.org/T282148) (owner: 10Effie Mouzeli) [06:47:30] (03CR) 10Ayounsi: [C: 03+2] cloudsw: manage OSPF [homer/public] - 10https://gerrit.wikimedia.org/r/682956 (owner: 10Ayounsi) [06:47:56] (03Merged) 10jenkins-bot: Add canary support in scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/685748 (https://phabricator.wikimedia.org/T282148) (owner: 10Effie Mouzeli) [06:53:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) (owner: 10Giuseppe Lavagetto) [06:54:04] !log Homerify cloudsw ospf [06:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:33] (03Merged) 10jenkins-bot: Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) (owner: 10Giuseppe Lavagetto) [06:56:08] (03PS5) 10Ayounsi: cloudsw: policy-options [homer/public] - 10https://gerrit.wikimedia.org/r/682957 [07:05:58] !log Deploy schema change on s4 codfw, lag will appear in codfw T266486 T268392 T273360 [07:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:04] T268392: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 [07:06:04] T273360: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 [07:06:05] T266486: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 [07:06:10] (03CR) 10Muehlenhoff: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/692258 (owner: 10Muehlenhoff) [07:06:17] (03CR) 10Muehlenhoff: [C: 03+2] Enable SLO for piwik [puppet] - 10https://gerrit.wikimedia.org/r/692258 (owner: 10Muehlenhoff) [07:09:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 25%: Repool db1111', diff saved to https://phabricator.wikimedia.org/P16036 and previous config saved to /var/cache/conftool/dbconfig/20210518-070947-root.json [07:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:27] (03CR) 10Muehlenhoff: "Worked fine in my tests" [puppet] - 10https://gerrit.wikimedia.org/r/692258 (owner: 10Muehlenhoff) [07:17:16] (03CR) 10Ayounsi: [C: 03+2] cloudsw: policy-options [homer/public] - 10https://gerrit.wikimedia.org/r/682957 (owner: 10Ayounsi) [07:21:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:23:39] (03CR) 10Muehlenhoff: [C: 03+2] Enable SLO for Turnilo [puppet] - 10https://gerrit.wikimedia.org/r/692262 (owner: 10Muehlenhoff) [07:24:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:24:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 50%: Repool db1111', diff saved to https://phabricator.wikimedia.org/P16037 and previous config saved to /var/cache/conftool/dbconfig/20210518-072451-root.json [07:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:24] (03CR) 10ArielGlenn: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/692370 (owner: 10Zabe) [07:30:28] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for JStephenson1980 - https://phabricator.wikimedia.org/T282521 (10ayounsi) @JStephenson Hello! There are 3 wikitech accounts matching your email :) > uid: jstep > sn: JStephenson > uid: jstep1980 > sn: Jstep > uid: jstep2021 > sn: JStephenson1980 W... [07:31:13] (03PS5) 10Ayounsi: cloudsw: loopback firewall filter [homer/public] - 10https://gerrit.wikimedia.org/r/682972 [07:34:14] (03CR) 10Ayounsi: [C: 03+2] cloudsw: loopback firewall filter [homer/public] - 10https://gerrit.wikimedia.org/r/682972 (owner: 10Ayounsi) [07:34:59] (03Merged) 10jenkins-bot: cloudsw: loopback firewall filter [homer/public] - 10https://gerrit.wikimedia.org/r/682972 (owner: 10Ayounsi) [07:39:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 75%: Repool db1111', diff saved to https://phabricator.wikimedia.org/P16038 and previous config saved to /var/cache/conftool/dbconfig/20210518-073955-root.json [07:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:58] (03CR) 10Muehlenhoff: "This does it's job (the session is no longer usable since the nodejs backend service can't make a connection), but could use some later UI" [puppet] - 10https://gerrit.wikimedia.org/r/692262 (owner: 10Muehlenhoff) [07:41:32] 10SRE, 10MW-on-K8s, 10serviceops: Create a basic helm chart to test MediaWiki on kubernetes - https://phabricator.wikimedia.org/T265327 (10Joe) 05Open→03Resolved [07:43:39] (03PS1) 10Muehlenhoff: Enable SLO for yarn [puppet] - 10https://gerrit.wikimedia.org/r/692542 [07:46:12] 10SRE, 10MW-on-K8s, 10serviceops: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Joe) p:05Triage→03High [07:49:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/692542 (owner: 10Muehlenhoff) [07:53:05] (03PS1) 10Muehlenhoff: Fix date format [puppet] - 10https://gerrit.wikimedia.org/r/692544 [07:54:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 100%: Repool db1111', diff saved to https://phabricator.wikimedia.org/P16039 and previous config saved to /var/cache/conftool/dbconfig/20210518-075458-root.json [07:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1126', diff saved to https://phabricator.wikimedia.org/P16040 and previous config saved to /var/cache/conftool/dbconfig/20210518-075532-marostegui.json [07:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:53] (03CR) 10Muehlenhoff: [C: 03+2] Fix date format [puppet] - 10https://gerrit.wikimedia.org/r/692544 (owner: 10Muehlenhoff) [07:58:15] (03PS1) 10Muehlenhoff: Enable SLO for Hue [puppet] - 10https://gerrit.wikimedia.org/r/692545 [07:59:08] (03PS1) 10Ayounsi: cloudsw: small improvements to loopback filter [homer/public] - 10https://gerrit.wikimedia.org/r/692566 [08:00:35] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/692545 (owner: 10Muehlenhoff) [08:02:18] (03PS1) 10Muehlenhoff: Enable SLO for Superset [puppet] - 10https://gerrit.wikimedia.org/r/692567 [08:04:18] (03PS1) 10Ayounsi: Aggregate Icmp6_InPktTooBigs by clusters [puppet] - 10https://gerrit.wikimedia.org/r/692568 [08:08:07] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/692567 (owner: 10Muehlenhoff) [08:09:56] (03PS2) 10Ayounsi: Aggregate Icmp6_InPktTooBigs by clusters [puppet] - 10https://gerrit.wikimedia.org/r/692568 [08:13:17] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:15:23] (03CR) 10Ayounsi: [C: 03+2] cloudsw: small improvements to loopback filter [homer/public] - 10https://gerrit.wikimedia.org/r/692566 (owner: 10Ayounsi) [08:21:22] (03PS1) 10Muehlenhoff: Enable SLO for Graphite [puppet] - 10https://gerrit.wikimedia.org/r/692569 [08:23:54] (03PS1) 10Muehlenhoff: Enable SLO for librenms [puppet] - 10https://gerrit.wikimedia.org/r/692570 [08:24:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/692569 (owner: 10Muehlenhoff) [08:24:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/692570 (owner: 10Muehlenhoff) [08:31:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 25%: Repool db1126', diff saved to https://phabricator.wikimedia.org/P16041 and previous config saved to /var/cache/conftool/dbconfig/20210518-083139-root.json [08:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:39] ACKNOWLEDGEMENT - dump of m5 in codfw on alert1001 is CRITICAL: Last dump for m5 at codfw (db2078.codfw.wmnet:3325) taken on 2021-05-18 04:19:12 is 33 GB, but previous one was 25 GB, a change of 33.0% Jcrespo mailman import - The acknowledgement expires at: 2021-05-25 06:38:50. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [08:40:39] ACKNOWLEDGEMENT - dump of m5 in eqiad on alert1001 is CRITICAL: Last dump for m5 at eqiad (db1117.eqiad.wmnet:3325) taken on 2021-05-18 03:02:14 is 33 GB, but previous one was 25 GB, a change of 34.6% Jcrespo mailman import - The acknowledgement expires at: 2021-05-25 06:38:50. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [08:44:33] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:44:35] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:45:16] (03PS1) 10Marostegui: check_private_data_report: Remove labsdb* support [puppet] - 10https://gerrit.wikimedia.org/r/692571 (https://phabricator.wikimedia.org/T282662) [08:46:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 50%: Repool db1126', diff saved to https://phabricator.wikimedia.org/P16042 and previous config saved to /var/cache/conftool/dbconfig/20210518-084643-root.json [08:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:32] (03CR) 10Kormat: [C: 03+1] check_private_data_report: Remove labsdb* support [puppet] - 10https://gerrit.wikimedia.org/r/692571 (https://phabricator.wikimedia.org/T282662) (owner: 10Marostegui) [08:52:37] (03CR) 10Filippo Giunchedi: [C: 03+1] Aggregate Icmp6_InPktTooBigs by clusters [puppet] - 10https://gerrit.wikimedia.org/r/692568 (owner: 10Ayounsi) [08:52:46] (03CR) 10Elukey: "We may not need the extra istio/ namespace, I am checking helm charts and there seems to be a way instruct the istioctl manifest to use cu" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [08:52:49] (03CR) 10Marostegui: [C: 03+2] check_private_data_report: Remove labsdb* support [puppet] - 10https://gerrit.wikimedia.org/r/692571 (https://phabricator.wikimedia.org/T282662) (owner: 10Marostegui) [08:53:20] (03CR) 10Filippo Giunchedi: [C: 03+1] Enable SLO for Graphite [puppet] - 10https://gerrit.wikimedia.org/r/692569 (owner: 10Muehlenhoff) [08:53:27] what's SLO? cc jbond42 moritzm [08:53:47] godog: Single Logout [08:53:58] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: Remove absented crons [puppet] - 10https://gerrit.wikimedia.org/r/692280 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [08:54:15] i.e. loging out at the idp sould kill the session in all loged in services [08:54:36] ah! got it, thanks jbond42 [08:55:06] I was puzzled for a second because surely idp configuration has nothing to do with service level objectives [08:55:44] (03PS1) 10Kormat: Revert "db1131: Disable notifications." [puppet] - 10https://gerrit.wikimedia.org/r/692560 [08:56:39] (03CR) 10Kormat: [C: 03+2] Revert "db1131: Disable notifications." [puppet] - 10https://gerrit.wikimedia.org/r/692560 (owner: 10Kormat) [08:58:51] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/692542 (owner: 10Muehlenhoff) [09:01:57] !log kormat@cumin1001 dbctl commit (dc=all): 'Remove s6 eqiad primary from 'api' group T280751', diff saved to https://phabricator.wikimedia.org/P16043 and previous config saved to /var/cache/conftool/dbconfig/20210518-090156-kormat.json [09:01:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 75%: Repool db1126', diff saved to https://phabricator.wikimedia.org/P16044 and previous config saved to /var/cache/conftool/dbconfig/20210518-090159-root.json [09:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:01] T280751: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 [09:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:16] !log kormat@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 25%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P16045 and previous config saved to /var/cache/conftool/dbconfig/20210518-090215-kormat.json [09:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:57] godog: ahh yes perhaps we should change it i originally had SSLout (single sign out) [09:04:50] !log kormat@cumin1001 dbctl commit (dc=all): 'Set db1131 to weight 400 in s6/eqiad T280751', diff saved to https://phabricator.wikimedia.org/P16046 and previous config saved to /var/cache/conftool/dbconfig/20210518-090449-kormat.json [09:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:06] jbond42: heheh SSO-L ? starting to become nasa-level nomenclature [09:05:19] seriously though, I think it is fine [09:05:39] :D lol ack ok will leave as is for now [09:12:00] (03CR) 10Ayounsi: [C: 03+2] Aggregate Icmp6_InPktTooBigs by clusters [puppet] - 10https://gerrit.wikimedia.org/r/692568 (owner: 10Ayounsi) [09:13:46] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:15:27] 10SRE, 10User-Kormat: cumin: If no command is provided, output nodelist to stdout - https://phabricator.wikimedia.org/T261861 (10Volans) 05Open→03Resolved This has been implemented, released in Cumin 4.1.0 and deployed everywhere in production. See https://doc.wikimedia.org/cumin/master/release.html#v4-1-0... [09:17:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 100%: Repool db1126', diff saved to https://phabricator.wikimedia.org/P16047 and previous config saved to /var/cache/conftool/dbconfig/20210518-091702-root.json [09:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1178', diff saved to https://phabricator.wikimedia.org/P16048 and previous config saved to /var/cache/conftool/dbconfig/20210518-091717-marostegui.json [09:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:26] !log kormat@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 50%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P16049 and previous config saved to /var/cache/conftool/dbconfig/20210518-091725-kormat.json [09:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:29] T280751: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 [09:17:32] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/692545 (owner: 10Muehlenhoff) [09:18:00] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/692567 (owner: 10Muehlenhoff) [09:18:34] (03CR) 10Jbond: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/692569 (owner: 10Muehlenhoff) [09:27:10] !log push test SNMP filter config on asw-a-codfw - T283060 [09:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:16] T283060: SNMP: filter out default sub interfaces - https://phabricator.wikimedia.org/T283060 [09:29:48] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 145, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:29:48] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:30:19] !log add peering sessions to AS8708 RCS & RDS on cr2-esams [09:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:20] (03CR) 10Filippo Giunchedi: wmflib::role_hosts: new function return list of hosts running a role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/692286 (https://phabricator.wikimedia.org/T282880) (owner: 10Jbond) [09:32:29] !log kormat@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 75%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P16050 and previous config saved to /var/cache/conftool/dbconfig/20210518-093228-kormat.json [09:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:33] T280751: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 [09:33:26] (03CR) 10Elukey: "> Patch Set 19:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [09:34:04] (03PS1) 10Marostegui: instances.yaml: Remove db1087 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/692574 (https://phabricator.wikimedia.org/T282093) [09:35:06] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1087 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/692574 (https://phabricator.wikimedia.org/T282093) (owner: 10Marostegui) [09:35:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1087 from dbctl T282093', diff saved to https://phabricator.wikimedia.org/P16051 and previous config saved to /var/cache/conftool/dbconfig/20210518-093552-marostegui.json [09:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:56] T282093: decommission db1087.eqiad.wmnet - https://phabricator.wikimedia.org/T282093 [09:40:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 25%: Repool db1178', diff saved to https://phabricator.wikimedia.org/P16052 and previous config saved to /var/cache/conftool/dbconfig/20210518-094056-root.json [09:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:30] Did I ever tried that? [09:44:36] !log 👍 [09:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:55] niiice [09:46:30] impressive :) [09:47:33] !log kormat@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 100%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P16053 and previous config saved to /var/cache/conftool/dbconfig/20210518-094732-kormat.json [09:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:37] T280751: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 [09:56:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 50%: Repool db1178', diff saved to https://phabricator.wikimedia.org/P16054 and previous config saved to /var/cache/conftool/dbconfig/20210518-095600-root.json [09:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:22] (03PS1) 10Kormat: db1085: Disable notifications for decom [puppet] - 10https://gerrit.wikimedia.org/r/692575 (https://phabricator.wikimedia.org/T282096) [09:58:24] (03CR) 10Kormat: [C: 03+2] db1085: Disable notifications for decom [puppet] - 10https://gerrit.wikimedia.org/r/692575 (https://phabricator.wikimedia.org/T282096) (owner: 10Kormat) [09:58:56] 10SRE, 10netops: SNMP: filter out default sub interfaces - https://phabricator.wikimedia.org/T283060 (10ayounsi) Quick test shows an almost 50% improvement. https://librenms.wikimedia.org/graphs/type=device_poller_perf/device=95/from=1621310100/to=1621331700/ And a cleaner interfaces list. [10:03:16] !log stopping mariadb on db1085 T282096 [10:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:20] T282096: decommission db1085.eqiad.wmnet - https://phabricator.wikimedia.org/T282096 [10:06:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:07:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:08:23] (03PS1) 10Ayounsi: SNMP: filter out default logical interfaces (.0) [homer/public] - 10https://gerrit.wikimedia.org/r/692576 (https://phabricator.wikimedia.org/T283060) [10:09:03] (03CR) 10Elukey: "> Patch Set 19:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [10:09:50] (03CR) 10Muehlenhoff: [C: 03+2] Enable SLO for Graphite [puppet] - 10https://gerrit.wikimedia.org/r/692569 (owner: 10Muehlenhoff) [10:11:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 75%: Repool db1178', diff saved to https://phabricator.wikimedia.org/P16055 and previous config saved to /var/cache/conftool/dbconfig/20210518-101104-root.json [10:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:19] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1012.eqiad.wmnet with reason: REIMAGE [10:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:28] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1012.eqiad.wmnet with reason: REIMAGE [10:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:25] (03CR) 10Muehlenhoff: "Works as expected (quite a bit of the presented data is cached client-side, but with the next access of a new dashboard one gets redirecte" [puppet] - 10https://gerrit.wikimedia.org/r/692569 (owner: 10Muehlenhoff) [10:24:39] (03PS1) 10Filippo Giunchedi: grafana: restore rsync::quickdatacopy between active/standby hosts [puppet] - 10https://gerrit.wikimedia.org/r/692577 [10:25:54] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29604/console" [puppet] - 10https://gerrit.wikimedia.org/r/692577 (owner: 10Filippo Giunchedi) [10:26:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 100%: Repool db1178', diff saved to https://phabricator.wikimedia.org/P16056 and previous config saved to /var/cache/conftool/dbconfig/20210518-102607-root.json [10:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:55] !log installing OpenJDK updates on Hadoop/Druid/AQS/kafka-Jumbo [10:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:42] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, Kerberos, and LDAP for ODimitrijevic - https://phabricator.wikimedia.org/T282836 (10elukey) [10:30:44] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, Kerberos, and LDAP for ODimitrijevic - https://phabricator.wikimedia.org/T282836 (10elukey) Verified that the L3 document has been signed, all checks marked, we can proceed. [10:31:07] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, Kerberos, and LDAP for ODimitrijevic - https://phabricator.wikimedia.org/T282836 (10elukey) @odimitrijevic have you created the developer account? https://wikitech.wikimedia.org/wiki/Help:Create_a_Wikimedi... [10:31:45] PROBLEM - grafana.wikimedia.org on grafana2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [10:32:48] yeah that's me [10:33:07] going to lunch, should fix itself shortly [10:33:41] RECOVERY - grafana.wikimedia.org on grafana2001 is OK: HTTP OK: HTTP/1.1 302 Found - 435 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [10:34:58] (03PS1) 10Muehlenhoff: Extend Cumin alias for AQS [puppet] - 10https://gerrit.wikimedia.org/r/692578 [10:39:37] (03CR) 10Volans: [C: 03+1] "Syntax wise is ok :)" [puppet] - 10https://gerrit.wikimedia.org/r/692578 (owner: 10Muehlenhoff) [10:41:38] (03PS1) 10Elukey: admin: add user oljad account and groups [puppet] - 10https://gerrit.wikimedia.org/r/692579 (https://phabricator.wikimedia.org/T282836) [10:51:45] (03PS1) 10Marostegui: pc1010: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/692581 (https://phabricator.wikimedia.org/T282761) [10:52:41] 10SRE: Updated java security policy in OpenJDK 11.9 - https://phabricator.wikimedia.org/T266782 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This got fixed with https://gerrit.wikimedia.org/r/c/operations/puppet/+/688246/ [10:52:49] (03CR) 10Marostegui: [C: 03+2] pc1010: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/692581 (https://phabricator.wikimedia.org/T282761) (owner: 10Marostegui) [10:53:21] !log upgrade idp-test to OpenJDK 11.0.11 T281345 [10:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:25] T281345: Tomcat/CAS fails to start with OpenJDK 11.0.11 - https://phabricator.wikimedia.org/T281345 [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210518T1100). [11:00:04] Zabe: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:18] o/ [11:03:06] I don’t feel confident deploying that patch, sorry [11:07:35] (03CR) 10Hnowlan: [C: 03+1] "LGTM - all things going well aqs_next will disappear in the coming weeks but I will clean up when that time comes." [puppet] - 10https://gerrit.wikimedia.org/r/692578 (owner: 10Muehlenhoff) [11:19:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:25:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:26:00] I've only deployed once before and it's all a bit foggy so I definitely can't help either :D. Zabe, do you have examples of wikis where the override exists and one where it does not as per Timo's comment? [11:27:00] Zabe: also you might try pinging the other deployers? [11:28:56] (03CR) 10Muehlenhoff: [C: 03+2] Extend Cumin alias for AQS [puppet] - 10https://gerrit.wikimedia.org/r/692578 (owner: 10Muehlenhoff) [11:29:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1177', diff saved to https://phabricator.wikimedia.org/P16057 and previous config saved to /var/cache/conftool/dbconfig/20210518-112942-marostegui.json [11:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:47] usually big wikis like meta, enwiki or dewiki have content at 'MediaWiki:Robots.txt' defined and small wikis don't [11:33:40] mvolz: test plan can be accessing via mwdebug, https://en.wikipedia.org/wiki/MediaWiki:Robots.txt https://en.wikipedia.org/w/robots.php https://test2.wikipedia.org/wiki/MediaWiki:Robots.txt https://test2.wikipedia.org/w/robots.php [11:34:12] verifying the enwiki php response contains the common stuff + wiki content txt (same as when debug is off) [11:34:31] and that test2wiki responds only with the common stuff (same as when debug is off) [11:34:49] avoid being fooled by browser cache (e.g. devtools cache off) [11:35:57] seeking reviewers for https://gerrit.wikimedia.org/r/c/operations/puppet/+/692577 [11:41:42] !log upgrading idp2001 to Java 11.0.11 [11:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:05] godog: I'll have a look in a bit [11:42:54] (03PS1) 10Muehlenhoff: Failover IDP to idp2001 [dns] - 10https://gerrit.wikimedia.org/r/692586 [11:45:11] moritzm: thanks! appreciate it [11:46:04] (03PS1) 10Filippo Giunchedi: rsync: move quickdatacopy to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/692587 (https://phabricator.wikimedia.org/T273673) [11:50:57] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:53:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:54:50] (03PS1) 10Filippo Giunchedi: prometheus: exclude netbox_device_statistics from job availability alerting [puppet] - 10https://gerrit.wikimedia.org/r/692589 (https://phabricator.wikimedia.org/T276749) [11:55:54] Urbanecm: are you willing to deploy that patch ^ [11:57:29] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/692577 (owner: 10Filippo Giunchedi) [11:57:31] Zabe: not in last three minutes of the window. also I'm not able to focus enough to lead deployment rn. Just schedule it at a different time, please. [11:57:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 25%: Repool db1177', diff saved to https://phabricator.wikimedia.org/P16058 and previous config saved to /var/cache/conftool/dbconfig/20210518-115736-root.json [11:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:42] sure, will do [11:59:07] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/692577 (owner: 10Filippo Giunchedi) [11:59:32] (03CR) 10Hashar: [C: 03+2] Branch commit for wmf/1.37.0-wmf.6 [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/692464 (owner: 10TrainBranchBot) [12:00:04] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210518T1200) [12:04:26] !log scap clean 1.37.0-wmf.1 1.37.0-wmf.3 and 1.37.0-wmf.4 # T281147 [12:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:31] T281147: 1.37.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T281147 [12:04:59] !log hashar@deploy1002 clean aborted: Pruned MediaWiki: 1.37.0-wmf.1 (duration: 01m 16s) [12:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:02] (03PS2) 10Filippo Giunchedi: rsync: move quickdatacopy to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/692587 (https://phabricator.wikimedia.org/T273673) [12:07:20] !log hashar@deploy1002 Pruned MediaWiki: 1.37.0-wmf.3 (duration: 01m 50s) [12:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:56] !log hashar@deploy1002 Pruned MediaWiki: 1.37.0-wmf.4 (duration: 01m 28s) [12:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:33] (03CR) 10Jbond: [C: 03+2] Enable SLO for librenms [puppet] - 10https://gerrit.wikimedia.org/r/692570 (owner: 10Muehlenhoff) [12:12:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 50%: Repool db1177', diff saved to https://phabricator.wikimedia.org/P16059 and previous config saved to /var/cache/conftool/dbconfig/20210518-121240-root.json [12:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:27] PROBLEM - aqs endpoints health on aqs1012 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRI [12:18:27] ggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page vi [12:18:27] unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:20:46] this is for a host recently reimaged, nothing on fire [12:20:49] hnowlan: --^ [12:21:18] (03Merged) 10jenkins-bot: Branch commit for wmf/1.37.0-wmf.6 [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/692464 (owner: 10TrainBranchBot) [12:23:07] (03Abandoned) 10Hashar: download_bower: download to GERRIT_CACHE_HOME when set [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/689859 (owner: 10Hashar) [12:23:35] (03CR) 10Muehlenhoff: "For LibreNMS more work is needed, after logging out the session continues to be active as before. I'll open a separate task." [puppet] - 10https://gerrit.wikimedia.org/r/692570 (owner: 10Muehlenhoff) [12:26:16] (03CR) 10Muehlenhoff: "Code looks good, one comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/692587 (https://phabricator.wikimedia.org/T273673) (owner: 10Filippo Giunchedi) [12:26:56] (03CR) 10Filippo Giunchedi: rsync: move quickdatacopy to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/692587 (https://phabricator.wikimedia.org/T273673) (owner: 10Filippo Giunchedi) [12:27:01] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 15): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29607/console" [puppet] - 10https://gerrit.wikimedia.org/r/692587 (https://phabricator.wikimedia.org/T273673) (owner: 10Filippo Giunchedi) [12:27:19] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for JStephenson1980 - https://phabricator.wikimedia.org/T282521 (10JStephenson) Hi I am sorry about that, I seemed to have made several attempts! I can keep this one and eliminate the other two if that is ok. uid: jstep sn: JStephenson [12:27:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 75%: Repool db1177', diff saved to https://phabricator.wikimedia.org/P16061 and previous config saved to /var/cache/conftool/dbconfig/20210518-122744-root.json [12:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:06] (03PS3) 10Filippo Giunchedi: rsync: move quickdatacopy to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/692587 (https://phabricator.wikimedia.org/T273673) [12:28:32] (03CR) 10Muehlenhoff: [C: 03+2] Failover IDP to idp2001 [dns] - 10https://gerrit.wikimedia.org/r/692586 (owner: 10Muehlenhoff) [12:29:05] (03CR) 10Filippo Giunchedi: "See PCC run, a bunch of hosts affected" [puppet] - 10https://gerrit.wikimedia.org/r/692587 (https://phabricator.wikimedia.org/T273673) (owner: 10Filippo Giunchedi) [12:29:18] (03CR) 10Filippo Giunchedi: rsync: move quickdatacopy to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/692587 (https://phabricator.wikimedia.org/T273673) (owner: 10Filippo Giunchedi) [12:34:25] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/692587 (https://phabricator.wikimedia.org/T273673) (owner: 10Filippo Giunchedi) [12:40:28] !log krinkle@mw1002 purge-parsercache-now.php on pc1010 (spare, depooled), ref P16060, T280605, T282761 [12:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:34] T282761: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 [12:40:34] T280605: Reduce parser cache retention temporarily for DiscussionTools - https://phabricator.wikimedia.org/T280605 [12:42:31] PROBLEM - AQS root url on aqs1012 is CRITICAL: connect to address 10.64.32.16 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [12:42:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 100%: Repool db1177', diff saved to https://phabricator.wikimedia.org/P16062 and previous config saved to /var/cache/conftool/dbconfig/20210518-124247-root.json [12:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:46] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on aqs1012.eqiad.wmnet with reason: new AQS node [12:43:46] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aqs1012.eqiad.wmnet with reason: new AQS node [12:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:12] (03CR) 10Muehlenhoff: [C: 03+2] Enable SLO for Superset [puppet] - 10https://gerrit.wikimedia.org/r/692567 (owner: 10Muehlenhoff) [12:51:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:51:15] RECOVERY - aqs endpoints health on aqs1012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:51:45] RECOVERY - AQS root url on aqs1012 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [12:51:58] (03CR) 10Muehlenhoff: "Works fine in the sense that the session correctly gets terminated, but the user experience on logout could use some improvement, the dash" [puppet] - 10https://gerrit.wikimedia.org/r/692567 (owner: 10Muehlenhoff) [12:53:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:55:02] (03PS1) 10Krinkle: [Beta Cluster] Remove deploymentwiki site configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692592 (https://phabricator.wikimedia.org/T198673) [12:57:56] (03PS3) 10Krinkle: [Beta Cluster] Remove deploymentwiki site configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649609 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [12:58:27] (03CR) 10Jbond: [C: 03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/692576 (https://phabricator.wikimedia.org/T283060) (owner: 10Ayounsi) [12:58:29] (03PS2) 10Krinkle: [Beta Cluster] Remove deploymentwiki site configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692592 [12:58:40] (03Abandoned) 10Krinkle: [Beta Cluster] Remove deploymentwiki site configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692592 (owner: 10Krinkle) [12:58:55] (03PS4) 10Krinkle: [Beta Cluster] Remove deploymentwiki site configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649609 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [12:59:01] (03CR) 10Krinkle: [C: 03+1] [Beta Cluster] Remove deploymentwiki site configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649609 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [12:59:32] Krinkle: are you going to deploy that too at some point or should I? [12:59:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1172', diff saved to https://phabricator.wikimedia.org/P16063 and previous config saved to /var/cache/conftool/dbconfig/20210518-125945-marostegui.json [12:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] hashar and dancy: (Dis)respected human, time to deploy MediaWiki train - European+American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210518T1300). Please do the needful. [13:01:54] (03PS1) 10Hashar: testwikis wikis to 1.37.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692593 [13:01:56] right in time [13:01:56] (03CR) 10Hashar: [C: 03+2] testwikis wikis to 1.37.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692593 (owner: 10Hashar) [13:02:09] though jouncebot is 2 minutes in advance (or maybe that is my computer) [13:02:38] (03Merged) 10jenkins-bot: testwikis wikis to 1.37.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692593 (owner: 10Hashar) [13:02:39] hashar: jouncebot announced at 13:00:04 UTC according to my computer [13:02:41] !log hashar@deploy1002 Started scap: testwikis wikis to 1.37.0-wmf.6 [13:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:46] hashar: it fired at 13:00:05 according to my computer [13:03:05] so it is my computer damn [13:03:08] thank you vgutierrez ! [13:03:26] I blame my DSL for that extra second of lag compared to Majavah [13:04:00] [14:57:52] jouncebot: :-\ [13:04:26] now I have to figure out whether systemd replaced good old ntpdate and how to check the time sync on my debian :D [13:04:44] 10Puppet, 10GitLab (Initialization): Puppitise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10jbond) p:05Triage→03Medium [13:05:45] 10Puppet, 10GitLab (Initialization): Puppitise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10jbond) a:05Sergey.Trofimovsky.SF→03jbond [13:08:16] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/692579 (https://phabricator.wikimedia.org/T282836) (owner: 10Elukey) [13:08:41] mystery solved, I had no ntp client installed. Thank you vgutierrez and Majavah ! [13:08:59] (03CR) 10Muehlenhoff: [C: 03+2] Enable SLO for Hue [puppet] - 10https://gerrit.wikimedia.org/r/692545 (owner: 10Muehlenhoff) [13:14:44] (03CR) 10Muehlenhoff: "The logout works fine per se, but the UI feedback could be better, when accessing a new menu element, it e.g. only prints "Unknown error o" [puppet] - 10https://gerrit.wikimedia.org/r/692545 (owner: 10Muehlenhoff) [13:18:13] (03PS1) 10Hashar: Merge 'upstream/stable-3.2' into wmf/stable-3.2 [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/692598 [13:20:21] (03CR) 10jerkins-bot: [V: 04-1] Merge 'upstream/stable-3.2' into wmf/stable-3.2 [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/692598 (owner: 10Hashar) [13:20:47] (03PS2) 10Urbanecm: Move otrs-wiki.wikimedia.org to vrt-wiki.wikimedia.org (part 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692452 (https://phabricator.wikimedia.org/T280400) [13:21:43] (03PS5) 10Hashar: [WMF] register our plugins as submodules [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684336 [13:21:54] (03PS9) 10Hashar: [WMF] script to build our plugins [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684411 [13:22:04] (03CR) 10Elukey: [C: 03+2] admin: add user oljad account and groups [puppet] - 10https://gerrit.wikimedia.org/r/692579 (https://phabricator.wikimedia.org/T282836) (owner: 10Elukey) [13:23:11] (03CR) 10jerkins-bot: [V: 04-1] [WMF] register our plugins as submodules [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684336 (owner: 10Hashar) [13:24:48] (03CR) 10Cathal Mooney: [C: 03+1] SNMP: filter out default logical interfaces (.0) [homer/public] - 10https://gerrit.wikimedia.org/r/692576 (https://phabricator.wikimedia.org/T283060) (owner: 10Ayounsi) [13:26:08] (03CR) 10Clarakosi: "This change is ready for review." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/692404 (https://phabricator.wikimedia.org/T260591) (owner: 10Clarakosi) [13:26:39] (03CR) 10Clarakosi: [C: 04-2] "-2 until its been tested" [deployment-charts] - 10https://gerrit.wikimedia.org/r/692404 (https://phabricator.wikimedia.org/T260591) (owner: 10Clarakosi) [13:27:02] 10SRE, 10CAS-SSO: Tomcat/CAS fails to start with OpenJDK 11.0.11 - https://phabricator.wikimedia.org/T281345 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is resolved (and OpenJDK updated to 11.0.11) [13:27:29] (03CR) 10Ppchelko: api-gateway: Implement new ratelimit configurations from envoy 1.16 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/692404 (https://phabricator.wikimedia.org/T260591) (owner: 10Clarakosi) [13:28:00] (03CR) 10Jbond: wmflib::role_hosts: new function return list of hosts running a role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/692286 (https://phabricator.wikimedia.org/T282880) (owner: 10Jbond) [13:29:57] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users, Kerberos, and LDAP for ODimitrijevic - https://phabricator.wikimedia.org/T282836 (10elukey) ` elukey@krb1001:~$ sudo manage_principals.py create oljad --email_address=odimitrijevic@wi... [13:34:44] (03CR) 10Muehlenhoff: [C: 03+2] Enable SLO for yarn [puppet] - 10https://gerrit.wikimedia.org/r/692542 (owner: 10Muehlenhoff) [13:35:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 25%: Repool db1172', diff saved to https://phabricator.wikimedia.org/P16064 and previous config saved to /var/cache/conftool/dbconfig/20210518-133531-root.json [13:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:44] (03CR) 10Volans: wmflib::role_hosts: new function return list of hosts running a role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/692286 (https://phabricator.wikimedia.org/T282880) (owner: 10Jbond) [13:40:46] (03PS1) 10Muehlenhoff: Enable SLO for grafana [puppet] - 10https://gerrit.wikimedia.org/r/692605 [13:42:20] (03CR) 10Ladsgroup: "One nitpick and it's good to go. I don't know the details of rsync but I'm tiny bit worried if we don't disable logging, it'll explode the" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/692587 (https://phabricator.wikimedia.org/T273673) (owner: 10Filippo Giunchedi) [13:47:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/692605 (owner: 10Muehlenhoff) [13:47:54] (03PS1) 10Muehlenhoff: Enable SLO by default [puppet] - 10https://gerrit.wikimedia.org/r/692607 [13:47:56] (03CR) 10Clarakosi: [C: 04-2] api-gateway: Implement new ratelimit configurations from envoy 1.16 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/692404 (https://phabricator.wikimedia.org/T260591) (owner: 10Clarakosi) [13:48:47] 10Puppet, 10GitLab (Initialization): Puppitise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10jbond) this is what we get from those files ` kernel.sem = 250 32000 32 262 kernel.shmall = 4194304 kernel.shmmax = 17179869184 net.core.somaxconn = 1024 ` [13:50:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 50%: Repool db1172', diff saved to https://phabricator.wikimedia.org/P16065 and previous config saved to /var/cache/conftool/dbconfig/20210518-135034-root.json [13:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:51:44] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/692607 (owner: 10Muehlenhoff) [13:53:18] (03CR) 10Filippo Giunchedi: [C: 03+1] wmflib::role_hosts: new function return list of hosts running a role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/692286 (https://phabricator.wikimedia.org/T282880) (owner: 10Jbond) [13:53:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:02:35] (03PS1) 10Jbond: gitlab: manage sysctl files [puppet] - 10https://gerrit.wikimedia.org/r/692609 (https://phabricator.wikimedia.org/T283076) [14:03:37] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 200925528 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:04:47] (03CR) 10Jbond: [C: 03+2] gitlab: manage sysctl files [puppet] - 10https://gerrit.wikimedia.org/r/692609 (https://phabricator.wikimedia.org/T283076) (owner: 10Jbond) [14:05:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 75%: Repool db1172', diff saved to https://phabricator.wikimedia.org/P16066 and previous config saved to /var/cache/conftool/dbconfig/20210518-140538-root.json [14:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:09] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 251008 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:06:15] (03CR) 10Filippo Giunchedi: "> Patch Set 1: Verified+1 Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/691231 (owner: 10Filippo Giunchedi) [14:09:24] (03CR) 10Filippo Giunchedi: "> Patch Set 3:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/692587 (https://phabricator.wikimedia.org/T273673) (owner: 10Filippo Giunchedi) [14:14:43] (03CR) 10Jcrespo: [C: 03+1] "This is ok to me, but manuel to decide if to go for 10.4 or 10.5 in the end." [puppet] - 10https://gerrit.wikimedia.org/r/686393 (https://phabricator.wikimedia.org/T276589) (owner: 10Jcrespo) [14:14:49] hashar: dancy hi! mforns and I would like to deploy a mw config patch. is the train inprogress or are we clear? [14:15:06] ottomata: it is in progress still :/ [14:15:13] rebuilding cdb files [14:15:16] (03PS1) 10Mforns: Migrate VirtualPageView to EventPlatform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692611 (https://phabricator.wikimedia.org/T238138) [14:15:22] ok ty, lemme know when clear! :) good luck! [14:15:50] oh it is merely running in the background and I just press ENTER from time to time to speed it up :D [14:16:28] right, but you have to know the crucial moments when to press ENTER [14:16:33] its a high stress high skill job [14:17:53] !log installing remaining postgresql-11 updates (client tools and libs, servers already done) [14:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 100%: Repool db1172', diff saved to https://phabricator.wikimedia.org/P16067 and previous config saved to /var/cache/conftool/dbconfig/20210518-142042-root.json [14:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:21:36] ottomata: more or less. Tehre is a bug somewhere in the stack that stall the progress and pressing ENTER unlocks it :-\ [14:21:48] !log hashar@deploy1002 Finished scap: testwikis wikis to 1.37.0-wmf.6 (duration: 79m 07s) [14:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:53] checking stuff [14:22:22] hashar: sounds like great job security, we'll always need an expert in hitting enter :) [14:22:44] ahahah [14:23:06] (03CR) 10Filippo Giunchedi: [C: 03+2] "Going ahead per task" [puppet] - 10https://gerrit.wikimedia.org/r/692589 (https://phabricator.wikimedia.org/T276749) (owner: 10Filippo Giunchedi) [14:23:11] https://phabricator.wikimedia.org/T223287 which literally states "press ENTER repeatedly" :-\ [14:23:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:24:27] ottomata: it is all your :] [14:24:42] seems the basics are working and there are no errors showing up so far [14:25:15] ottomata: OH NO sorry [14:25:19] gotta promote group 0 :\ [14:25:28] (03PS7) 10Giuseppe Lavagetto: Add diff tasks to rake [deployment-charts] - 10https://gerrit.wikimedia.org/r/685721 [14:25:30] (03PS3) 10Giuseppe Lavagetto: Rakefile: split more of it into submodules [deployment-charts] - 10https://gerrit.wikimedia.org/r/688265 [14:25:53] (03PS1) 10Hashar: group0 wikis to 1.37.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692613 [14:25:55] (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.37.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692613 (owner: 10Hashar) [14:26:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1083.eqiad.wmnet - https://phabricator.wikimedia.org/T281445 (10wiki_willy) a:05wiki_willy→03Cmjohnson [14:26:23] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission labsdb1009.eqiad.wmnet - https://phabricator.wikimedia.org/T282522 (10wiki_willy) a:05wiki_willy→03Cmjohnson [14:26:36] (03Merged) 10jenkins-bot: group0 wikis to 1.37.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692613 (owner: 10Hashar) [14:27:15] k good luck with your ENTER on that one [14:27:24] :) [14:28:24] (03PS1) 10Jbond: gitlab: disable grafana, node-exporter, promethous and alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/692614 (https://phabricator.wikimedia.org/T283076) [14:29:24] (03CR) 10Jbond: [C: 03+2] gitlab: disable grafana, node-exporter, promethous and alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/692614 (https://phabricator.wikimedia.org/T283076) (owner: 10Jbond) [14:31:14] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/692587 (https://phabricator.wikimedia.org/T273673) (owner: 10Filippo Giunchedi) [14:32:09] !log kormat@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1085.eqiad.wmnet [14:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:56] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.37.0-wmf.6 [14:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:04] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/692605 (owner: 10Muehlenhoff) [14:35:06] ottomata: looks good now :] Please deploy! [14:35:38] (03CR) 10Ladsgroup: [C: 03+1] rsync: move quickdatacopy to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/692587 (https://phabricator.wikimedia.org/T273673) (owner: 10Filippo Giunchedi) [14:36:44] ok thank you! [14:37:04] (03CR) 10Ottomata: [C: 03+2] Migrate VirtualPageView to EventPlatform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692611 (https://phabricator.wikimedia.org/T238138) (owner: 10Mforns) [14:37:06] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add diff tasks to rake [deployment-charts] - 10https://gerrit.wikimedia.org/r/685721 (owner: 10Giuseppe Lavagetto) [14:38:55] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Migrate VirtualPageView to EventPlatform on all wikis - T238138 (duration: 01m 06s) [14:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:00] T238138: VirtualPageView Event Platform Migration - https://phabricator.wikimedia.org/T238138 [14:40:16] (03Merged) 10jenkins-bot: Add diff tasks to rake [deployment-charts] - 10https://gerrit.wikimedia.org/r/685721 (owner: 10Giuseppe Lavagetto) [14:43:22] (03CR) 10Giuseppe Lavagetto: [C: 03+2] eventgate: add kafka egress policy stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/684855 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [14:43:31] (03PS8) 10Giuseppe Lavagetto: eventgate: add kafka egress policy stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/684855 (https://phabricator.wikimedia.org/T253058) [14:44:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Rakefile: split more of it into submodules [deployment-charts] - 10https://gerrit.wikimedia.org/r/688265 (owner: 10Giuseppe Lavagetto) [14:45:05] (03CR) 10Ottomata: [C: 03+1] "Cool!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/684855 (https://phabricator.wikimedia.org/T253058) (owner: 10Giuseppe Lavagetto) [14:45:22] (03CR) 10Muehlenhoff: [C: 03+2] mariadb: Install 10.5 client on host with only client packages [puppet] - 10https://gerrit.wikimedia.org/r/686393 (https://phabricator.wikimedia.org/T276589) (owner: 10Jcrespo) [14:46:26] (03Merged) 10jenkins-bot: Rakefile: split more of it into submodules [deployment-charts] - 10https://gerrit.wikimedia.org/r/688265 (owner: 10Giuseppe Lavagetto) [14:48:05] log spam looks ok. I will have a break now [14:51:23] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1085.eqiad.wmnet [14:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:19] (03CR) 10Dzahn: "duplicate of https://gerrit.wikimedia.org/r/c/operations/puppet/+/673097" [puppet] - 10https://gerrit.wikimedia.org/r/692587 (https://phabricator.wikimedia.org/T273673) (owner: 10Filippo Giunchedi) [14:58:45] (03Abandoned) 10Dzahn: rsync::quickdatacopy: replace cron job with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/673097 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [14:59:00] (03PS1) 10Kormat: db1085: Remove from puppet [puppet] - 10https://gerrit.wikimedia.org/r/692618 (https://phabricator.wikimedia.org/T282096) [14:59:06] (03CR) 10Dzahn: "abendoned that one because the reviews already happened here now" [puppet] - 10https://gerrit.wikimedia.org/r/692587 (https://phabricator.wikimedia.org/T273673) (owner: 10Filippo Giunchedi) [15:02:50] (03CR) 10Cwhite: [V: 03+1 C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/691231 (owner: 10Filippo Giunchedi) [15:08:44] (03CR) 10Filippo Giunchedi: [C: 03+1] Enable SLO for grafana [puppet] - 10https://gerrit.wikimedia.org/r/692605 (owner: 10Muehlenhoff) [15:09:33] (03PS1) 10Muehlenhoff: Add grant for cumin2002 [puppet] - 10https://gerrit.wikimedia.org/r/692621 (https://phabricator.wikimedia.org/T276589) [15:10:08] 10SRE: Decom failoid1001/failoid2001 - https://phabricator.wikimedia.org/T282405 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete. [15:10:37] (03CR) 10Jbond: Add diff tasks to rake (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/685721 (owner: 10Giuseppe Lavagetto) [15:14:17] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 64228264 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:16:49] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 7792 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:18:06] 10SRE, 10Analytics-Clusters, 10Analytics-Kanban: Switch kafka/Hadoop away from java::security - https://phabricator.wikimedia.org/T282454 (10Ottomata) Can we add defaults for the profile::java parameters? I see some duplicate values copy and pasted in quite a few role yamls already. [15:18:15] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/692618 (https://phabricator.wikimedia.org/T282096) (owner: 10Kormat) [15:18:44] (03CR) 10Kormat: [C: 03+2] db1085: Remove from puppet [puppet] - 10https://gerrit.wikimedia.org/r/692618 (https://phabricator.wikimedia.org/T282096) (owner: 10Kormat) [15:20:24] jouncebot: now [15:20:24] No deployments scheduled for the next 0 hour(s) and 39 minute(s) [15:21:45] 10SRE, 10Analytics-Clusters, 10Analytics-Kanban: Switch kafka/Hadoop away from java::security - https://phabricator.wikimedia.org/T282454 (10Ottomata) Oh, you put the defaults in the common/profile/java.yaml class parameter hiera? Huh. I had thought that wasn't allowed: https://wikitech.wikimedia.org/wiki/P... [15:22:31] 10SRE, 10Analytics-Clusters, 10Analytics-Kanban: Switch kafka/Hadoop away from java::security - https://phabricator.wikimedia.org/T282454 (10MoritzMuehlenhoff) >>! In T282454#7095851, @Ottomata wrote: > Can we add defaults for the profile::java parameters? I see some duplicate values copy and pasted in quit... [15:23:11] !log urbanecm@deploy1002 Synchronized private/PrivateSettings.php: Update T250887 mitigations (duration: 01m 07s) [15:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:56] (03PS1) 10Ssingh: Add zone for wikimedia-dns.org (Wikidough) [dns] - 10https://gerrit.wikimedia.org/r/692625 (https://phabricator.wikimedia.org/T252132) [15:28:58] 10ops-eqiad, 10decommission-hardware: decommission db1085.eqiad.wmnet - https://phabricator.wikimedia.org/T282096 (10Kormat) Hey DC ops, this machine is now ready for your tender mercies. Thanks! [15:29:03] 10ops-eqiad, 10decommission-hardware: decommission db1085.eqiad.wmnet - https://phabricator.wikimedia.org/T282096 (10Kormat) [15:29:06] 10ops-eqiad, 10decommission-hardware: decommission db1085.eqiad.wmnet - https://phabricator.wikimedia.org/T282096 (10Kormat) a:05Kormat→03None [15:30:29] !log urbanecm@deploy1002 Synchronized private/PrivateSettings.php: Update T250887 mitigations (duration: 01m 05s) [15:30:30] (03CR) 10Volans: "Question inline" (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/692625 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:54] (03Abandoned) 10Cwhite: remove alerting_host role from icinga[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/677656 (https://phabricator.wikimedia.org/T247966) (owner: 10Cwhite) [15:33:58] (03CR) 10Kormat: [C: 04-1] Add grant for cumin2002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/692621 (https://phabricator.wikimedia.org/T276589) (owner: 10Muehlenhoff) [15:34:46] (03CR) 10Ssingh: "> Patch Set 1:" (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/692625 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:36:50] (03PS1) 10Ottomata: Hadoop - set hardened_tls: true and remove java::securty from hadoop::common [puppet] - 10https://gerrit.wikimedia.org/r/692626 (https://phabricator.wikimedia.org/T282454) [15:36:55] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:37:28] (03CR) 10Muehlenhoff: Add grant for cumin2002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/692621 (https://phabricator.wikimedia.org/T276589) (owner: 10Muehlenhoff) [15:38:04] topranks: any chance that's you re: "Uncommitted dbctl configuration changes" on cumin2001? [15:38:13] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:38:19] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:38:28] topranks: disregard [15:38:42] cdanis: hii [15:38:54] do i need to do a manual dbctl config commit to remove the host? [15:39:02] kormat: ok no worries, eh I don't think it was me but maybe I did something wrong, wasn't doing anything recently there [15:39:21] topranks: i should have looked at the diff first, rather than at who was recently active :) [15:39:42] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29608/console" [puppet] - 10https://gerrit.wikimedia.org/r/692626 (https://phabricator.wikimedia.org/T282454) (owner: 10Ottomata) [15:40:18] ACKNOWLEDGEMENT - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff Kormat Fanciness is only partial? https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:40:18] ACKNOWLEDGEMENT - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff Kormat Fanciness is only partial? https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:43:00] (03CR) 10Ottomata: [V: 03+1] "Great, a no-op!" [puppet] - 10https://gerrit.wikimedia.org/r/692626 (https://phabricator.wikimedia.org/T282454) (owner: 10Ottomata) [15:43:08] 10SRE, 10Prod-Kubernetes, 10Pybal, 10Traffic, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10akosiaris) PR is at https://github.com/projectcalico/confd/pull/515, waiting for review now. It's been tested locally in a coup... [15:46:23] (03PS1) 10Giuseppe Lavagetto: New golang version in the standard image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/692629 [15:47:12] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/692607 (owner: 10Muehlenhoff) [15:47:34] (03CR) 10Kormat: [C: 04-1] Add grant for cumin2002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/692621 (https://phabricator.wikimedia.org/T276589) (owner: 10Muehlenhoff) [15:49:42] (03CR) 10Ottomata: [V: 03+1] "Actually, this is not a no-op, its only a no-op on the hadoop workers and masters. Investigating." [puppet] - 10https://gerrit.wikimedia.org/r/692626 (https://phabricator.wikimedia.org/T282454) (owner: 10Ottomata) [15:51:47] (03CR) 10Elukey: [C: 03+1] "I assume that it builds locally etc.., LGTM :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/692629 (owner: 10Giuseppe Lavagetto) [15:53:01] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] New golang version in the standard image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/692629 (owner: 10Giuseppe Lavagetto) [15:54:03] (03CR) 10Bstorm: "To make review and discussion easier, I've rendered the templates this produces locally (without giving it a release name and understandin" [puppet] - 10https://gerrit.wikimedia.org/r/685715 (https://phabricator.wikimedia.org/T264221) (owner: 10Majavah) [15:54:48] (03PS1) 10Jbond: system::role: add ability to specify role owners [puppet] - 10https://gerrit.wikimedia.org/r/692632 [15:56:52] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1085.eqiad.wmnet - https://phabricator.wikimedia.org/T282096 (10wiki_willy) a:03Cmjohnson [15:57:09] (03CR) 10Bstorm: "That can be compared to values in modules/toolforge/templates/k8s/nginx-ingress.yaml.erb" [puppet] - 10https://gerrit.wikimedia.org/r/685715 (https://phabricator.wikimedia.org/T264221) (owner: 10Majavah) [15:57:43] (03PS10) 10Hashar: [WMF] script to build our plugins [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684411 [15:57:46] (03PS6) 10Hashar: [WMF] register our plugins as submodules [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684336 [15:59:45] (03PS2) 10Ottomata: Hadoop - set hardened_tls: true and remove java::securty from hadoop::common [puppet] - 10https://gerrit.wikimedia.org/r/692626 (https://phabricator.wikimedia.org/T282454) [16:00:05] jbond42 and cdanis: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210518T1600). [16:00:54] !log kormat@cumin1001 dbctl commit (dc=all): 'db1085 being decommissioned T282096', diff saved to https://phabricator.wikimedia.org/P16073 and previous config saved to /var/cache/conftool/dbconfig/20210518-160053-kormat.json [16:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:58] T282096: decommission db1085.eqiad.wmnet - https://phabricator.wikimedia.org/T282096 [16:02:39] (03PS3) 10Ottomata: Hadoop - set hardened_tls: true and remove java::securty from hadoop::common [puppet] - 10https://gerrit.wikimedia.org/r/692626 (https://phabricator.wikimedia.org/T282454) [16:03:25] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:04:12] (03CR) 10Elukey: "Andrew I see the new option duplicated in some files with different comments, is it on purpose?" [puppet] - 10https://gerrit.wikimedia.org/r/692626 (https://phabricator.wikimedia.org/T282454) (owner: 10Ottomata) [16:04:29] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:04:35] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:07:44] (03PS1) 10Herron: install_server: add buster-rescue tftp config [puppet] - 10https://gerrit.wikimedia.org/r/692634 (https://phabricator.wikimedia.org/T282575) [16:07:59] (03CR) 10Ottomata: "Nope, fixed, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/692626 (https://phabricator.wikimedia.org/T282454) (owner: 10Ottomata) [16:08:41] (03CR) 10Andrew Bogott: [C: 03+2] Trove: open up a lot of read-only policies [puppet] - 10https://gerrit.wikimedia.org/r/692354 (https://phabricator.wikimedia.org/T282809) (owner: 10Andrew Bogott) [16:10:12] (03PS4) 10Ottomata: Hadoop - set hardened_tls: true and remove java::securty from hadoop::common [puppet] - 10https://gerrit.wikimedia.org/r/692626 (https://phabricator.wikimedia.org/T282454) [16:10:46] (03PS1) 10Jbond: P:role_data: create a new profile to call system::role [puppet] - 10https://gerrit.wikimedia.org/r/692635 [16:10:48] (03PS1) 10Jbond: (test) migrate sretest to new role_data profile [puppet] - 10https://gerrit.wikimedia.org/r/692636 [16:11:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29610/console" [puppet] - 10https://gerrit.wikimedia.org/r/692636 (owner: 10Jbond) [16:11:50] (03CR) 10jerkins-bot: [V: 04-1] (test) migrate sretest to new role_data profile [puppet] - 10https://gerrit.wikimedia.org/r/692636 (owner: 10Jbond) [16:12:43] (03PS2) 10Jbond: system::role: add ability to specify role owners [puppet] - 10https://gerrit.wikimedia.org/r/692632 [16:12:54] (03PS2) 10Jbond: P:role_data: create a new profile to call system::role [puppet] - 10https://gerrit.wikimedia.org/r/692635 [16:13:06] (03PS2) 10Jbond: (test) migrate sretest to new role_data profile [puppet] - 10https://gerrit.wikimedia.org/r/692636 [16:13:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29611/console" [puppet] - 10https://gerrit.wikimedia.org/r/692636 (owner: 10Jbond) [16:16:02] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Services, and 3 others: New Service Request tegola - https://phabricator.wikimedia.org/T274390 (10jijiki) [16:16:18] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (NOOP 3 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29612/console" [puppet] - 10https://gerrit.wikimedia.org/r/692626 (https://phabricator.wikimedia.org/T282454) (owner: 10Ottomata) [16:17:17] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Services, and 2 others: New Service Request tegola - https://phabricator.wikimedia.org/T274390 (10jijiki) [16:21:07] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:21:09] (03CR) 10Ottomata: [V: 03+1] "Thar we go, that's a no-op" [puppet] - 10https://gerrit.wikimedia.org/r/692626 (https://phabricator.wikimedia.org/T282454) (owner: 10Ottomata) [16:22:18] (03PS1) 10Andrew Bogott: wmcs-policy-tests: don't hardcode trove datastore and version [puppet] - 10https://gerrit.wikimedia.org/r/692639 (https://phabricator.wikimedia.org/T279845) [16:22:20] (03PS1) 10Andrew Bogott: Trove: fix a regexp misfire in the policy file [puppet] - 10https://gerrit.wikimedia.org/r/692640 [16:23:24] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-policy-tests: don't hardcode trove datastore and version [puppet] - 10https://gerrit.wikimedia.org/r/692639 (https://phabricator.wikimedia.org/T279845) (owner: 10Andrew Bogott) [16:23:33] (03CR) 10Andrew Bogott: [C: 03+2] Trove: fix a regexp misfire in the policy file [puppet] - 10https://gerrit.wikimedia.org/r/692640 (owner: 10Andrew Bogott) [16:23:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:25:06] (03CR) 10Elukey: "The profile is used in a lot of places:" [puppet] - 10https://gerrit.wikimedia.org/r/692626 (https://phabricator.wikimedia.org/T282454) (owner: 10Ottomata) [16:29:33] (03CR) 10Jforrester: [C: 03+1] [Beta Cluster] Remove deploymentwiki site configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649609 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [16:41:00] (03PS1) 10Jcrespo: Revert "bacula: Reenable read-write ES database backups, disable read-only" [puppet] - 10https://gerrit.wikimedia.org/r/692650 [16:41:13] (03PS1) 10RLazarus: admin: Add a schema for ldap_only_users. [puppet] - 10https://gerrit.wikimedia.org/r/692643 [16:44:10] (03CR) 10Jcrespo: [C: 04-1] "Waiting for read-write backups to finish." [puppet] - 10https://gerrit.wikimedia.org/r/692650 (owner: 10Jcrespo) [17:00:05] chrisalbon and accraze: Time to snap out of that daydream and deploy Services – Graphoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210518T1700). [17:21:30] 10SRE, 10MW-on-K8s, 10serviceops: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10jijiki) [17:30:34] (03PS1) 10Effie Mouzeli: Add tokens and users for mwdebug service [puppet] - 10https://gerrit.wikimedia.org/r/692667 (https://phabricator.wikimedia.org/T283056) [17:32:40] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Services, and 2 others: New Service Request maps-vector-server - https://phabricator.wikimedia.org/T274390 (10jijiki) [17:34:34] (03CR) 10Legoktm: [C: 03+2] Add shellbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 (owner: 10Legoktm) [17:34:49] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Add shellbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 (owner: 10Legoktm) [17:35:30] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset/Turnilo for Kgordon - https://phabricator.wikimedia.org/T283057 (10Maryana) Just commenting to confirm this request for Kinneret to be added to the contractors' access group for Superset/Turnilo! LMK if you need anything else – thanks! [17:42:02] (03PS1) 10Effie Mouzeli: (WIP) Add tokens and users for mwdebug service [puppet] - 10https://gerrit.wikimedia.org/r/692669 [17:44:33] (03PS1) 10Zabe: Update IP addresses for Wiki Education Dashboard exemptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692670 (https://phabricator.wikimedia.org/T283096) [17:44:52] (03PS1) 10Herron: initial file import [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/692671 [17:46:03] (03CR) 10jerkins-bot: [V: 04-1] Update IP addresses for Wiki Education Dashboard exemptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692670 (https://phabricator.wikimedia.org/T283096) (owner: 10Zabe) [17:46:38] (03PS2) 10Zabe: Update IP addresses for Wiki Education Dashboard exemptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692670 (https://phabricator.wikimedia.org/T283096) [17:47:11] (03PS2) 10Effie Mouzeli: (WIP) Add tokens and users for maps-vector-server [puppet] - 10https://gerrit.wikimedia.org/r/692669 [17:48:44] (03PS1) 10Effie Mouzeli: Add tokens and users for mwdebug [labs/private] - 10https://gerrit.wikimedia.org/r/692672 (https://phabricator.wikimedia.org/T283056) [17:49:05] (03CR) 10Herron: "This change is ready for review." [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/692671 (owner: 10Herron) [17:49:15] (03CR) 10Herron: [V: 03+2 C: 03+2] initial file import [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/692671 (owner: 10Herron) [17:49:24] (03PS3) 10Zabe: Update IP addresses for Wiki Education Dashboard exemptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692670 (https://phabricator.wikimedia.org/T283096) [17:50:32] 10SRE, 10serviceops, 10Developer Productivity, 10Performance-Team (Radar), and 2 others: All debug hosts give (likely spurious) message: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp) - https://phabricator.wikimedia.org/T214734 (10thcipriani) [17:51:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:53:39] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:59:00] (03PS4) 10Zabe: Update IP addresses for Wiki Education Dashboard exemptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692670 (https://phabricator.wikimedia.org/T283096) [18:00:05] RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210518T1800). [18:00:05] Zabe: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:11] o/ [18:01:27] i can deploy today [18:02:18] (03CR) 10Urbanecm: [C: 03+2] robots.php: avoid using ContentHandler::getContentText() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692413 (owner: 10Zabe) [18:03:03] (03Merged) 10jenkins-bot: robots.php: avoid using ContentHandler::getContentText() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692413 (owner: 10Zabe) [18:06:36] Zabe: pulled onto mwdebug1001 [18:06:44] I'm also going to test this one myself [18:07:24] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 pending moderator request inaccessible after list migration (wikimedia-us-chi) - https://phabricator.wikimedia.org/T283095 (10Airplaneman) Yep, the link goes to https://lists.wikimedia.org/mailman/admindb/... Thanks for pointing me to {T282310}. Looks like it's the sa... [18:08:35] ha, i'm getting [4a1b01fa-1a3c-48dd-b25a-e5e3a4201c2a] 2021-05-18 18:08:16: Fatal exception of type "Error" [18:08:42] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 pending moderator request inaccessible after list migration (wikimedia-us-chi) - https://phabricator.wikimedia.org/T283095 (10Airplaneman) 05Open→03Resolved a:03Airplaneman [18:08:46] that's now what should happen [18:09:03] confirmed, reverting [18:09:40] (03PS1) 10Urbanecm: Revert "robots.php: avoid using ContentHandler::getContentText()" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692679 [18:09:42] (03CR) 10Urbanecm: [C: 03+2] Revert "robots.php: avoid using ContentHandler::getContentText()" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692679 (owner: 10Urbanecm) [18:10:06] Zabe: is there a task for this fix? [18:10:24] if not, can you create it, so i can paste the error somewhere? [18:10:38] (03Merged) 10jenkins-bot: Revert "robots.php: avoid using ContentHandler::getContentText()" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692679 (owner: 10Urbanecm) [18:10:59] * Urbanecm clears mwdebug1001 [18:11:04] There is T268041 [18:11:04] T268041: Deprecate and remove ContentHandler::getContentText() and $wgContentHandlerTextFallback - https://phabricator.wikimedia.org/T268041 [18:12:11] cool, I'll note it there [18:12:33] actually... [18:12:35] ...no i won't [18:12:38] easy fix [18:13:30] oh, I see ... [18:13:35] (03CR) 10Ottomata: [V: 03+1] "Right but, the require of java::security is wrapped in a if $ensure_ssl_config { ... } conditional, so its removal only affects nodes wher" [puppet] - 10https://gerrit.wikimedia.org/r/692626 (https://phabricator.wikimedia.org/T282454) (owner: 10Ottomata) [18:13:36] (03PS1) 10Urbanecm: robots.php: avoid using ContentHandler::getContentText() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692680 [18:13:47] Zabe: ^^please review^^ [18:14:21] (03CR) 10Zabe: [C: 03+1] "Looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692680 (owner: 10Urbanecm) [18:14:35] Zabe: thx. Pulled to mwdebug1001, please test there [18:16:08] Urbanecm: yeah, now it's working for me. Shows the same stuff as if debug is disabled, as it should be. [18:17:08] noted [18:17:14] * Urbanecm finishing his test script [18:17:58] !log razzi@deploy1002 Started deploy [analytics/refinery@9392f1d]: Regular analytics weekly train [analytics/refinery@9392f1db6e66975304c8e9b2b7031acd3ed87fa7] [18:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:45] my tests passed, double checking logs [18:23:07] all looks well, merging [18:23:15] (03CR) 10Urbanecm: [C: 03+2] robots.php: avoid using ContentHandler::getContentText() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692680 (owner: 10Urbanecm) [18:24:04] (03Merged) 10jenkins-bot: robots.php: avoid using ContentHandler::getContentText() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692680 (owner: 10Urbanecm) [18:24:30] and syncing [18:24:44] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10RobH) Unisys tech Scott Jones will be going onsite tomorrow via ticket 1962822 (cyrusone escort) to swap out the SSD and return the de... [18:26:08] !log urbanecm@deploy1002 Synchronized w/robots.php: 8224e53f6da61bf037bb3e3ad1cf367bf9b5a588: robots.php: avoid using ContentHandler::getContentText() (T268041) (duration: 01m 04s) [18:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:12] T268041: Deprecate and remove ContentHandler::getContentText() and $wgContentHandlerTextFallback - https://phabricator.wikimedia.org/T268041 [18:26:16] and...done Zabe [18:26:26] (03PS1) 10Ottomata: Remove profile::analytics::cluster::packages::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/692682 (https://phabricator.wikimedia.org/T275786) [18:26:31] (03CR) 10Ottomata: [V: 03+1] "ok, reran with more nodes:" [puppet] - 10https://gerrit.wikimedia.org/r/692626 (https://phabricator.wikimedia.org/T282454) (owner: 10Ottomata) [18:26:57] (03CR) 10Urbanecm: [C: 03+2] Update IP addresses for Wiki Education Dashboard exemptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692670 (https://phabricator.wikimedia.org/T283096) (owner: 10Zabe) [18:27:34] Zabe: can you please add the second robots.txt patch (the one with a fix) to calendar, too? [18:27:49] (03Merged) 10jenkins-bot: Update IP addresses for Wiki Education Dashboard exemptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692670 (https://phabricator.wikimedia.org/T283096) (owner: 10Zabe) [18:27:50] yes [18:27:55] thank youi [18:27:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1074.eqiad.wmnet - https://phabricator.wikimedia.org/T281959 (10wiki_willy) a:05Marostegui→03Cmjohnson [18:28:20] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29614/console" [puppet] - 10https://gerrit.wikimedia.org/r/692682 (https://phabricator.wikimedia.org/T275786) (owner: 10Ottomata) [18:28:22] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10RobH) p:05Medium→03High [18:29:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:29:05] (03PS2) 10Ottomata: Remove profile::analytics::cluster::packages::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/692682 (https://phabricator.wikimedia.org/T275786) [18:29:34] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 3da5a8bc93c734e93c3012dace49ee1b881927a8: Update IP addresses for Wiki Education Dashboard exemptions (T283096) (duration: 01m 06s) [18:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:39] T283096: Update IP addresses for Wiki Education Dashboard exemptions to rate-limiting and global block - https://phabricator.wikimedia.org/T283096 [18:29:43] Zabe: synced the other one [18:29:54] thanks [18:30:36] anything else Zabe ? [18:31:01] no [18:33:37] !log razzi@deploy1002 Finished deploy [analytics/refinery@9392f1d]: Regular analytics weekly train [analytics/refinery@9392f1db6e66975304c8e9b2b7031acd3ed87fa7] (duration: 15m 39s) [18:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:34:55] !log razzi@deploy1002 Started deploy [analytics/refinery@9392f1d] (thin): Regular analytics weekly train THIN [analytics/refinery@9392f1db6e66975304c8e9b2b7031acd3ed87fa7] [18:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:02] !log razzi@deploy1002 Finished deploy [analytics/refinery@9392f1d] (thin): Regular analytics weekly train THIN [analytics/refinery@9392f1db6e66975304c8e9b2b7031acd3ed87fa7] (duration: 00m 07s) [18:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:39] !log razzi@deploy1002 Started deploy [analytics/refinery@9392f1d] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@9392f1db6e66975304c8e9b2b7031acd3ed87fa7] [18:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:23] (03CR) 10Ottomata: [C: 03+2] Remove profile::analytics::cluster::packages::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/692682 (https://phabricator.wikimedia.org/T275786) (owner: 10Ottomata) [18:40:55] !log razzi@deploy1002 Finished deploy [analytics/refinery@9392f1d] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@9392f1db6e66975304c8e9b2b7031acd3ed87fa7] (duration: 05m 16s) [18:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:29] (03PS1) 10Andrew Bogott: Keystone: provide a hiera flag to adopt new policy enforcement [puppet] - 10https://gerrit.wikimedia.org/r/692686 [18:48:36] (03CR) 10jerkins-bot: [V: 04-1] Keystone: provide a hiera flag to adopt new policy enforcement [puppet] - 10https://gerrit.wikimedia.org/r/692686 (owner: 10Andrew Bogott) [18:48:55] (03PS2) 10Andrew Bogott: Keystone: provide a hiera flag to adopt new policy enforcement [puppet] - 10https://gerrit.wikimedia.org/r/692686 [18:48:57] (03PS1) 10Andrew Bogott: wmcs-policy-tests.py: add a few tests for public Trove APIs [puppet] - 10https://gerrit.wikimedia.org/r/692687 (https://phabricator.wikimedia.org/T279845) [18:50:14] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-policy-tests.py: add a few tests for public Trove APIs [puppet] - 10https://gerrit.wikimedia.org/r/692687 (https://phabricator.wikimedia.org/T279845) (owner: 10Andrew Bogott) [18:50:28] (03CR) 10jerkins-bot: [V: 04-1] Keystone: provide a hiera flag to adopt new policy enforcement [puppet] - 10https://gerrit.wikimedia.org/r/692686 (owner: 10Andrew Bogott) [18:53:22] (03PS3) 10Andrew Bogott: Keystone: provide a hiera flag to adopt new policy enforcement [puppet] - 10https://gerrit.wikimedia.org/r/692686 [18:53:41] (03PS1) 10Volans: setup.py: upgrade dependencies [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692690 [18:53:43] (03PS1) 10Volans: static: update CSS and JS libraries [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692691 [18:53:45] (03PS1) 10Volans: config: improve CSP headers [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692692 [18:54:59] (03CR) 10Volans: "This, along with the other patches in the series can be seen live on https://sso-debmon.wmcloud.org/" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692690 (owner: 10Volans) [18:55:14] (03CR) 10Volans: "This, along with the other patches in the series can be seen live on https://sso-debmon.wmcloud.org/" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692691 (owner: 10Volans) [18:55:22] (03CR) 10Volans: "This, along with the other patches in the series can be seen live on https://sso-debmon.wmcloud.org/" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692692 (owner: 10Volans) [18:58:15] (03PS4) 10Andrew Bogott: Keystone: provide a hiera flag to adopt new policy enforcement [puppet] - 10https://gerrit.wikimedia.org/r/692686 [18:58:19] 10SRE, 10Traffic, 10Patch-For-Review: Offer Wikidough as an anycasted service - https://phabricator.wikimedia.org/T283027 (10ssingh) [19:00:05] hashar and dancy: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - European+American Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210518T1900). [19:02:18] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 pending moderator request inaccessible after list migration (wikimedia-us-chi) - https://phabricator.wikimedia.org/T283095 (10Legoktm) [19:02:29] 10SRE, 10Wikimedia-Mailing-lists: Old pending actions in migrated ML were not imported - https://phabricator.wikimedia.org/T282310 (10Legoktm) [19:21:49] (03PS1) 10Kosta Harlan: Add a link: Set contentedtiable=false on mobile [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/692653 (https://phabricator.wikimedia.org/T281771) [19:25:31] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: provide a hiera flag to adopt new policy enforcement [puppet] - 10https://gerrit.wikimedia.org/r/692686 (owner: 10Andrew Bogott) [19:26:22] (03PS1) 10Clarakosi: DNM: Changes to use envoyproxy's image of envoy 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/692695 [19:27:01] (03CR) 10Clarakosi: [C: 04-2] "For testing purposes. Not to be merged." [deployment-charts] - 10https://gerrit.wikimedia.org/r/692695 (owner: 10Clarakosi) [19:32:40] (03PS2) 10Clarakosi: api-gateway: Implement new ratelimit configurations from envoy 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/692404 (https://phabricator.wikimedia.org/T260591) [19:41:03] (03CR) 10Andrew Bogott: [C: 04-1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/689262 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [19:47:10] (03CR) 10Cwhite: [C: 03+2] logstash: remove logstash-output-statsd plugin [puppet] - 10https://gerrit.wikimedia.org/r/692337 (owner: 10Cwhite) [19:49:22] (03PS3) 10Clarakosi: api-gateway: Implement new ratelimit configurations from envoy 1.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/692404 (https://phabricator.wikimedia.org/T260591) [19:50:40] (03CR) 10Clarakosi: [C: 04-2] api-gateway: Implement new ratelimit configurations from envoy 1.16 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/692404 (https://phabricator.wikimedia.org/T260591) (owner: 10Clarakosi) [20:05:47] (03CR) 10Cwhite: [C: 03+2] logstash: add nodejs ecs migration config and tests [puppet] - 10https://gerrit.wikimedia.org/r/690759 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [20:12:20] (03PS1) 10Cwhite: logstash: set partition metadata for node transition [puppet] - 10https://gerrit.wikimedia.org/r/692701 [20:15:20] (03CR) 10Cwhite: [C: 03+2] logstash: set partition metadata for node transition [puppet] - 10https://gerrit.wikimedia.org/r/692701 (owner: 10Cwhite) [20:18:08] (03PS1) 10Andrew Bogott: codfw1dev: don't apply new scoped and default keystone roles yet [puppet] - 10https://gerrit.wikimedia.org/r/692702 [20:18:21] 10SRE, 10Prod-Kubernetes, 10Release Pipeline, 10Documentation, 10Release-Engineering-Team (Seen): Document helm chart creation - https://phabricator.wikimedia.org/T213197 (10thcipriani) 05Open→03Resolved a:03jeena @jeena documented this on: https://wikitech.wikimedia.org/wiki/Deployment_pipeline/Mi... [20:18:24] 10SRE, 10Prod-Kubernetes, 10Release Pipeline, 10Documentation, 10Release-Engineering-Team (Pipeline): TEC3:O6:O:6.1:Q3: Deployment Pipeline Documentation - https://phabricator.wikimedia.org/T213090 (10thcipriani) [20:18:57] (03PS1) 10Volans: admin: add additional validation tests [puppet] - 10https://gerrit.wikimedia.org/r/692703 [20:19:07] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: don't apply new scoped and default keystone roles yet [puppet] - 10https://gerrit.wikimedia.org/r/692702 (owner: 10Andrew Bogott) [20:19:24] 10SRE, 10Platform Engineering, 10Release Pipeline, 10Release-Engineering-Team, and 5 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10thcipriani) [20:20:10] (03CR) 10Volans: "This is a follow up from the cross-validate-accounts email with a stacktrace for an invalid date in the data.yaml." [puppet] - 10https://gerrit.wikimedia.org/r/692703 (owner: 10Volans) [20:35:00] (03PS1) 10Legoktm: lists: Redirect all admin/ URLs to postorius [puppet] - 10https://gerrit.wikimedia.org/r/692707 [20:38:38] (03PS1) 10Andrew Bogott: Trove logs -> ELK [puppet] - 10https://gerrit.wikimedia.org/r/692708 [20:39:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:41:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:43:08] (03CR) 10Cwhite: [C: 03+1] Trove logs -> ELK [puppet] - 10https://gerrit.wikimedia.org/r/692708 (owner: 10Andrew Bogott) [20:43:32] (03CR) 10Andrew Bogott: [C: 03+2] Trove logs -> ELK [puppet] - 10https://gerrit.wikimedia.org/r/692708 (owner: 10Andrew Bogott) [20:47:07] (03PS1) 10Ppchelko: Use envoy 1.16 nested json feature for access logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/692709 (https://phabricator.wikimedia.org/T260820) [20:51:27] (03PS1) 10Clarakosi: api-gateway: Add default_value to dynamic_metadata if JWT is not set [deployment-charts] - 10https://gerrit.wikimedia.org/r/692714 (https://phabricator.wikimedia.org/T261350) [20:59:07] 10SRE, 10GitLab (Initialization), 10Patch-For-Review, 10Release-Engineering-Team (Doing), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10Dzahn) Since this ticket has been moved to "Done" on a workboard, should it also be closed as resolved now? [21:00:36] (03PS1) 10Ppchelko: api-gateway: Make use of host_rewrite_path_regex option [deployment-charts] - 10https://gerrit.wikimedia.org/r/692717 [21:07:09] (03PS1) 10Zabe: Fix call to non-existent var [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/692654 (https://phabricator.wikimedia.org/T283098) [21:07:29] (03PS1) 10Zabe: Fix call to non-existent var [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/692655 (https://phabricator.wikimedia.org/T283098) [21:07:53] (03PS1) 10Cwhite: logstash: normalize_level to parse bunyan level integers [puppet] - 10https://gerrit.wikimedia.org/r/692719 [21:13:22] (03PS1) 10Ahmon Dancy: update train-versions.json (1) [mediawiki-config] (sandbox/dancy) - 10https://gerrit.wikimedia.org/r/692720 [21:13:32] 10SRE, 10Services, 10Patch-For-Review, 10Service-deployment-requests: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 (10Dzahn) [21:13:54] (03PS1) 10Ahmon Dancy: update train-versions.json (2) [mediawiki-config] (sandbox/dancy) - 10https://gerrit.wikimedia.org/r/692721 [21:14:28] (03CR) 10Ahmon Dancy: [C: 03+2] update train-versions.json (1) [mediawiki-config] (sandbox/dancy) - 10https://gerrit.wikimedia.org/r/692720 (owner: 10Ahmon Dancy) [21:14:38] (03CR) 10Ahmon Dancy: [C: 03+2] update train-versions.json (2) [mediawiki-config] (sandbox/dancy) - 10https://gerrit.wikimedia.org/r/692721 (owner: 10Ahmon Dancy) [21:15:12] (03Merged) 10jenkins-bot: update train-versions.json (1) [mediawiki-config] (sandbox/dancy) - 10https://gerrit.wikimedia.org/r/692720 (owner: 10Ahmon Dancy) [21:15:14] (03CR) 10jerkins-bot: [V: 04-1] update train-versions.json (2) [mediawiki-config] (sandbox/dancy) - 10https://gerrit.wikimedia.org/r/692721 (owner: 10Ahmon Dancy) [21:15:16] (03CR) 10Cwhite: [C: 03+2] logstash: normalize_level to parse bunyan level integers [puppet] - 10https://gerrit.wikimedia.org/r/692719 (owner: 10Cwhite) [21:15:25] 10SRE, 10Services, 10Patch-For-Review, 10Service-deployment-requests: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 (10Dzahn) p:05Medium→03High [21:16:53] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10Jclark-ctr) Reset default configurations is showing up at this time [21:20:27] (03PS1) 10Krinkle: Simplify mc.php (1/7): Fix load order in Beta to match production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692722 [21:20:28] (03PS1) 10Krinkle: Simplify mc.php (2/7): Remove mc-labs.php overrides that match prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692723 [21:20:31] (03PS1) 10Krinkle: Simplify mc.php (3/7): Remove $wgMemCachedTimeout setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692724 [21:20:33] (03PS1) 10Krinkle: Simplify mc.php (4/7): Move wgMemCachedServers to mc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692725 [21:20:34] (03PS1) 10Krinkle: Simplify mc.php (5/7): Move wgMemCachedServers to mc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692726 [21:20:35] effie: ^ FYI :) [21:20:37] (03PS1) 10Krinkle: Simplify mc.php (6/7): Move mc.php and db.php inclusion a few lines up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692727 [21:20:39] (03PS1) 10Krinkle: Simplify mc.php (7/7): Define 'wancache-main-mcrouter' unconditionally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692728 [21:20:41] (03PS1) 10Krinkle: [Beta Cluster] mc-labs.php: Enable onHostRoutingPrefix for WAN cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692729 (https://phabricator.wikimedia.org/T264604) [21:23:17] (03PS2) 10Krinkle: Simplify mc.php (2/7): Remove mc-labs.php overrides that match prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692723 [21:23:19] (03PS2) 10Krinkle: Simplify mc.php (3/7): Remove $wgMemCachedTimeout setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692724 [21:23:21] (03PS2) 10Krinkle: Simplify mc.php (4/7): Move wgMemCachedServers to mc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692725 [21:23:23] (03PS2) 10Krinkle: Simplify mc.php (5/7): Move wgMemCachedServers to mc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692726 [21:23:25] (03PS2) 10Krinkle: Simplify mc.php (6/7): Move mc.php and db.php inclusion a few lines up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692727 [21:23:27] (03PS2) 10Krinkle: Simplify mc.php (7/7): Define 'wancache-main-mcrouter' unconditionally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692728 [21:23:29] (03PS2) 10Krinkle: [Beta Cluster] mc-labs.php: Enable onHostRoutingPrefix for WAN cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692729 (https://phabricator.wikimedia.org/T264604) [21:25:30] (03CR) 10Legoktm: [C: 03+2] lists: Redirect all admin/ URLs to postorius [puppet] - 10https://gerrit.wikimedia.org/r/692707 (owner: 10Legoktm) [21:27:08] (03PS1) 10Dzahn: admin: add brennen to gitlab-roots [puppet] - 10https://gerrit.wikimedia.org/r/692730 (https://phabricator.wikimedia.org/T283022) [21:27:48] 10SRE, 10Wikimedia-Mailing-lists: Cannot manually add users to wmfreqs list due to regex banning any addresses - https://phabricator.wikimedia.org/T283103 (10Legoktm) >>! In T283103#7096658, @Legoktm wrote: > Also it's a mistake that doesn't redirect... [21:28:10] (03CR) 10Dzahn: "Brandon, adding you as clinic duty. Addition to the group needs approval by wkandek. Wolfgang, added you for approval." [puppet] - 10https://gerrit.wikimedia.org/r/692730 (https://phabricator.wikimedia.org/T283022) (owner: 10Dzahn) [21:30:07] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:30:45] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to gitlab1001 for Brennen Bearnes (brennen) - https://phabricator.wikimedia.org/T283022 (10Dzahn) [21:30:48] (03CR) 10Wolfgang Kandek: [C: 03+1] "Approved." [puppet] - 10https://gerrit.wikimedia.org/r/692730 (https://phabricator.wikimedia.org/T283022) (owner: 10Dzahn) [21:32:20] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to gitlab1001 for Brennen Bearnes (brennen) - https://phabricator.wikimedia.org/T283022 (10wkandek) [21:32:57] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to gitlab1001 for Brennen Bearnes (brennen) - https://phabricator.wikimedia.org/T283022 (10wkandek) Approved. [21:35:37] (03CR) 10Dzahn: [C: 03+2] admin: add brennen to gitlab-roots [puppet] - 10https://gerrit.wikimedia.org/r/692730 (https://phabricator.wikimedia.org/T283022) (owner: 10Dzahn) [21:41:18] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to gitlab1001 for Brennen Bearnes (brennen) - https://phabricator.wikimedia.org/T283022 (10Dzahn) 05Open→03Resolved a:03Dzahn merged and deployed ` [gitlab1001:~] $ id brennen uid=20958(brennen) gid=500(wikidev) groups=500(wikidev),82... [21:42:52] (03PS1) 10Ottomata: kafka - Use hardened_tls instead of java::security if $ssl_enabled [puppet] - 10https://gerrit.wikimedia.org/r/692734 (https://phabricator.wikimedia.org/T282454) [21:45:49] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29618/console" [puppet] - 10https://gerrit.wikimedia.org/r/692734 (https://phabricator.wikimedia.org/T282454) (owner: 10Ottomata) [21:49:24] 10SRE, 10discovery-system: Document what #discovery-system is - https://phabricator.wikimedia.org/T282948 (10Dzahn) Pretty sure it's T95662 and service discovery as in https://platform9.com/blog/kubernetes-service-discovery-principles-in-practice/ but @Joe will be able to confirm. [21:49:44] 10SRE, 10Kubernetes, 10discovery-system: Document what #discovery-system is - https://phabricator.wikimedia.org/T282948 (10Dzahn) [21:55:39] (03PS3) 10Cwhite: logstash: add openstack ECS transition config and tests [puppet] - 10https://gerrit.wikimedia.org/r/689262 (https://phabricator.wikimedia.org/T234565) [21:59:42] (03CR) 10Cwhite: logstash: add openstack ECS transition config and tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/689262 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [22:02:20] (03CR) 10Bstorm: [C: 03+1] "So the resulting output looks nearly exactly like our existing source manifest yaml. It includes the additional RBAC needed for the ingres" [puppet] - 10https://gerrit.wikimedia.org/r/685715 (https://phabricator.wikimedia.org/T264221) (owner: 10Majavah) [22:08:57] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/691154 (owner: 10Arturo Borrero Gonzalez) [22:13:09] (03PS1) 10Legoktm: Add helmfile.d for shellbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/692736 (https://phabricator.wikimedia.org/T281423) [22:13:49] 10SRE, 10Services, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 (10Legoktm) [22:14:34] (03CR) 10jerkins-bot: [V: 04-1] Add helmfile.d for shellbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/692736 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [22:16:47] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 76 probes of 628 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:23:31] (03CR) 10Legoktm: "Error: chart "shellbox" matching version "" not found in wmf-stable index. (try 'helm repo update'). no chart name found" [deployment-charts] - 10https://gerrit.wikimedia.org/r/692736 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [22:24:14] (03CR) 10Thcipriani: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/689192 (owner: 10Ahmon Dancy) [22:41:22] 10SRE, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, 10Platform Engineering (Icebox): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jdforrester-WMF) [22:42:37] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 42 probes of 628 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:51:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:56:27] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:57:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10Jclark-ctr) a:05Jclark-ctr→03RobH Finished moving host attached is ports and cable ID cloudcephosd1016. rack C8. U30 port 1... [22:57:05] 10SRE, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, 10Platform Engineering (Icebox): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Iniquity) [22:57:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10Jclark-ctr) [22:58:17] 10SRE, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, 10Platform Engineering (Icebox): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jdforrester-WMF) [23:00:05] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210518T2300) [23:00:05] Zabe: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:25] o/ [23:17:41] (03CR) 10Bstorm: "Strange that I don't see Jenkins running checks. I'll have to mess with that in a bit." [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [23:20:43] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [23:22:10] (03CR) 10Bstorm: "It might be sensible to fork base::remote_syslog to something simpler and cloud-specific. Otherwise, we will have puppet breaking in cloud" [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [23:22:33] (03CR) 10jerkins-bot: [V: 04-1] Add WMCS specific cloud role for syslog server [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [23:23:14] (03CR) 10Bstorm: "Thanks, Daniel, I was about to try that 😊" [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [23:23:29] (03CR) 10Reedy: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [23:23:46] (03CR) 10Dzahn: "is it possible this is related to "only 'trusted users' can trigger CI checks"?" [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [23:24:23] (03CR) 10Dzahn: "what Reedy said :)" [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [23:24:32] Okay, a lot of comments ;) [23:25:06] An allowlist for CI? That's new to me.. [23:26:38] (03CR) 10Dzahn: "on the other hand forking base classes and creating new roles just for labs could also be seen as the root cause of "it's not like prod an" [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [23:27:29] /ac/ac [23:35:38] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10Glrx) The Commons file * https://commons.wikimedia.org/w/index.php?title=File%3ASystemLanguage.svg offers a quick and easy `switch` diagnostic because it displays the IETF lan... [23:40:19] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [23:42:19] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10Dzahn) note: Having the librsvg package installed does not mean you get the rsvg-convert command as well. You'll have to apt install `librsvg2-bin` for that. [23:44:49] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 for Brennen Bearnes (brennen) - https://phabricator.wikimedia.org/T283022 (10brennen) Thanks! [23:46:05] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10Dzahn) LANG=zh-hans rsvg-convert -w 512 -h 224 -o result-zh-hans.png SystemLanguage.svg Fontconfig warning: ignoring zh-hans: not a valid region tag LANG=zh-hant rsvg-conver... [23:50:57] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:54:18] (03CR) 10Bstorm: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [23:55:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets