[00:22:42] (03PS2) 10Tim Starling: Use the RequestTimeout library to set time limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672579 (https://phabricator.wikimedia.org/T269326) [00:22:47] (03CR) 10Tim Starling: Use the RequestTimeout library to set time limits (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672579 (https://phabricator.wikimedia.org/T269326) (owner: 10Tim Starling) [00:28:47] PROBLEM - Check nf_conntrack usage in neutron netns on cloudnet1003 is CRITICAL: CRITICAL: nf_conntrack usage over 80% in netns qrouter-d93771ba-2711-4f88-804a-8df6fd03978a https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:31:17] RECOVERY - Check nf_conntrack usage in neutron netns on cloudnet1003 is OK: OK: everything is apparently fine https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:46:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:56:03] PROBLEM - Check nf_conntrack usage in neutron netns on cloudnet1003 is CRITICAL: CRITICAL: nf_conntrack usage over 80% in netns qrouter-d93771ba-2711-4f88-804a-8df6fd03978a https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:58:31] RECOVERY - Check nf_conntrack usage in neutron netns on cloudnet1003 is OK: OK: everything is apparently fine https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:02:57] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 119843008 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:05:17] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:08:25] PROBLEM - Check nf_conntrack usage in neutron netns on cloudnet1003 is CRITICAL: CRITICAL: nf_conntrack usage over 80% in netns qrouter-d93771ba-2711-4f88-804a-8df6fd03978a https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:10:47] RECOVERY - Check nf_conntrack usage in neutron netns on cloudnet1003 is OK: OK: everything is apparently fine https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:12:22] (03PS7) 10KartikMistry: Update cxserver to 2021-03-15-131520-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/672386 (https://phabricator.wikimedia.org/T271711) [04:18:37] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2021-03-15-131520-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/672386 (https://phabricator.wikimedia.org/T271711) (owner: 10KartikMistry) [04:20:04] (03Merged) 10jenkins-bot: Update cxserver to 2021-03-15-131520-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/672386 (https://phabricator.wikimedia.org/T271711) (owner: 10KartikMistry) [04:28:58] !log kartik@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [04:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:32:00] PROBLEM - Check nf_conntrack usage in neutron netns on cloudnet1003 is CRITICAL: CRITICAL: nf_conntrack usage over 80% in netns qrouter-d93771ba-2711-4f88-804a-8df6fd03978a https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:34:20] RECOVERY - Check nf_conntrack usage in neutron netns on cloudnet1003 is OK: OK: everything is apparently fine https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:40:24] Not continuing cxserver deployment as: https://grafana.wikimedia.org/d/F7rttgqmz/cxserver?orgId=1&refresh=30s&from=now-15m&to=now&var-dc=eqiad%20prometheus%2Fk8s-staging&var-service=cxserver -- will wait for Alex to debug what's wrong. [05:16:23] PROBLEM - Check nf_conntrack usage in neutron netns on cloudnet1003 is CRITICAL: CRITICAL: nf_conntrack usage over 80% in netns qrouter-d93771ba-2711-4f88-804a-8df6fd03978a https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [05:18:43] RECOVERY - Check nf_conntrack usage in neutron netns on cloudnet1003 is OK: OK: everything is apparently fine https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [06:11:00] !log Sanitize db1124 db2094 db1154: taywiki trvwiki mnwwiktionary [06:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:28] (03PS1) 10Marostegui: instances.yaml: Remove db1084 [puppet] - 10https://gerrit.wikimedia.org/r/673838 (https://phabricator.wikimedia.org/T276302) [06:36:35] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1084 [puppet] - 10https://gerrit.wikimedia.org/r/673838 (https://phabricator.wikimedia.org/T276302) (owner: 10Marostegui) [06:37:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1084 from dbctl T276302', diff saved to https://phabricator.wikimedia.org/P14959 and previous config saved to /var/cache/conftool/dbconfig/20210322-063732-marostegui.json [06:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:42] T276302: decommission db1084.eqiad.wmnet - https://phabricator.wikimedia.org/T276302 [06:47:27] 10SRE, 10netops: Higher latency on Lumen eqiad/esams link - https://phabricator.wikimedia.org/T277654 (10ayounsi) Seems expected: > Lumen Maintenance 20795082 is currently ongoing on your service. You may experience a full service interruption or degradation from Wed, 2021/03/17 04:00:00 GMT to Wed, 2021/03/2... [06:51:20] (03PS1) 10Marostegui: db1161: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/673843 (https://phabricator.wikimedia.org/T258361) [06:51:43] 10SRE, 10ops-eqiad, 10Analytics-Radar: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10ayounsi) [06:52:06] 10SRE, 10ops-eqiad, 10DC-Ops: ps1-a7-eqiad power over threshold alerts - https://phabricator.wikimedia.org/T276743 (10ayounsi) 05Resolved→03Open Got another similar alert, see: https://librenms.wikimedia.org/graphs/id=8980/type=sensor_power/from=1616136600/to=1616223000 It's barely touching the alerting... [06:52:24] (03CR) 10Marostegui: [C: 03+2] db1161: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/673843 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:52:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 25%: Slowly repool db1161', diff saved to https://phabricator.wikimedia.org/P14960 and previous config saved to /var/cache/conftool/dbconfig/20210322-065236-root.json [06:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:59] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:53:09] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1161 is now being pooled [07:03:32] (03PS2) 10Giuseppe Lavagetto: [WiP] test harness for php-fpm images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/672768 [07:03:34] (03PS1) 10Giuseppe Lavagetto: mcrouter: add healthz script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/673845 [07:07:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 50%: Slowly repool db1161', diff saved to https://phabricator.wikimedia.org/P14961 and previous config saved to /var/cache/conftool/dbconfig/20210322-070740-root.json [07:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:35] (03CR) 10ArielGlenn: dumpwikibasejson: Make segment separation more robust (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673679 (https://phabricator.wikimedia.org/T277300) (owner: 10Hoo man) [07:08:56] (03PS1) 10Marostegui: production-m5.sql: Add mailman3 databases grant [puppet] - 10https://gerrit.wikimedia.org/r/673846 (https://phabricator.wikimedia.org/T256538) [07:12:26] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Create test databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Marostegui) Grants added: ` root@db1128.eqiad.wmnet[(none)]> show grants for 'testmailman3'@'208.80.154.13'; +----------------------------------------------------... [07:13:05] (03PS2) 10Marostegui: production-m5.sql: Add mailman3 databases grant [puppet] - 10https://gerrit.wikimedia.org/r/673846 (https://phabricator.wikimedia.org/T256538) [07:13:49] (03CR) 10Marostegui: [C: 03+2] production-m5.sql: Add mailman3 databases grant [puppet] - 10https://gerrit.wikimedia.org/r/673846 (https://phabricator.wikimedia.org/T256538) (owner: 10Marostegui) [07:14:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1141 for schema change', diff saved to https://phabricator.wikimedia.org/P14962 and previous config saved to /var/cache/conftool/dbconfig/20210322-071430-marostegui.json [07:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:13] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Create test databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Ladsgroup) Yup, the ferm is missing (T277286) I'll add it. [07:22:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 75%: Slowly repool db1161', diff saved to https://phabricator.wikimedia.org/P14963 and previous config saved to /var/cache/conftool/dbconfig/20210322-072243-root.json [07:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:55] 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to Mailman3 - https://phabricator.wikimedia.org/T52864 (10Marostegui) [07:26:29] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Create test databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Marostegui) 05Open→03Resolved Excellent, if there's a tracking task for that missing bit, I am going to close this as done. Please re-open if you find issues wi... [07:28:53] (03PS1) 10Elukey: Reduce buffer pool memory for dbstore1004's mariadb instances [puppet] - 10https://gerrit.wikimedia.org/r/673849 (https://phabricator.wikimedia.org/T270112) [07:32:43] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable CodeMirror accessibility colors on initial wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673326 (https://phabricator.wikimedia.org/T276346) (owner: 10Andrew-WMDE) [07:37:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 100%: Slowly repool db1161', diff saved to https://phabricator.wikimedia.org/P14964 and previous config saved to /var/cache/conftool/dbconfig/20210322-073747-root.json [07:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:00] (03CR) 10Marostegui: [C: 03+1] "This needs mysql restart" [puppet] - 10https://gerrit.wikimedia.org/r/673849 (https://phabricator.wikimedia.org/T270112) (owner: 10Elukey) [07:43:40] (03PS15) 10Elukey: hadoop: add a profile to deploy the capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) [07:44:03] (03PS4) 10Muehlenhoff: Adapt condition to mask puppet service [puppet] - 10https://gerrit.wikimedia.org/r/673025 [07:44:19] (03CR) 10Elukey: "Thanks a lot for the review Joseph, applied all the suggestions and left one question/open-point for non-analytics-privatedata users :)" (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [07:44:35] (03CR) 10Elukey: [C: 03+2] Reduce buffer pool memory for dbstore1004's mariadb instances [puppet] - 10https://gerrit.wikimedia.org/r/673849 (https://phabricator.wikimedia.org/T270112) (owner: 10Elukey) [07:45:00] (03PS10) 10Ayounsi: Add Capirca support to Homer [software/homer] - 10https://gerrit.wikimedia.org/r/663536 (https://phabricator.wikimedia.org/T273865) [07:45:16] (03CR) 10Ayounsi: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/663536 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [07:51:24] !log stop/start mariadb instances on dbstore1004 to reduce buffer pool memory settings - T273865 [07:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:32] T273865: Investigate Capirca - https://phabricator.wikimedia.org/T273865 [07:55:10] (03PS1) 10Marostegui: install_server: Reimage db1158 as Stretch [puppet] - 10https://gerrit.wikimedia.org/r/673931 (https://phabricator.wikimedia.org/T258361) [07:55:57] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] eventrouter: Update build and base image, switch to nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/669846 (https://phabricator.wikimedia.org/T274852) (owner: 10JMeybohm) [07:56:11] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] ratelimit: Switch to nobody, update build and base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/670836 (https://phabricator.wikimedia.org/T274852) (owner: 10JMeybohm) [07:56:21] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] fluent-bit: Switch to nobody and use seed_image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/670838 (https://phabricator.wikimedia.org/T274852) (owner: 10JMeybohm) [07:57:07] (03PS3) 10JMeybohm: fluent-bit: Switch to nobody and use seed_image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/670838 (https://phabricator.wikimedia.org/T274852) [07:57:16] (03CR) 10Muehlenhoff: [C: 03+2] Adapt condition to mask puppet service [puppet] - 10https://gerrit.wikimedia.org/r/673025 (owner: 10Muehlenhoff) [07:57:22] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1158 as Stretch [puppet] - 10https://gerrit.wikimedia.org/r/673931 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [07:58:20] elukey: wrong task :) [07:59:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:59:11] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1158.eqiad.wmnet'] ` The log ca... [08:00:12] XioNoX: ahahahha sorryyy [08:00:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1085 to clone db1165', diff saved to https://phabricator.wikimedia.org/P14965 and previous config saved to /var/cache/conftool/dbconfig/20210322-080020-marostegui.json [08:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:41] !log Stop MySQL on db1085 to clone db1165 (lag will appear on s6 on wiki replicas) T258361 [08:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:47] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [08:01:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:02:10] !log build and release docker-registry.discovery.wmnet/eventrouter:0.3.0-6, docker-registry.discovery.wmnet/fluent-bit:1.5.3-3, docker-registry.discovery.wmnet/ratelimit:1.5.1-s3 [08:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:20] (03PS1) 10JMeybohm: proton: Remove unused nodePort, enable telemetry [deployment-charts] - 10https://gerrit.wikimedia.org/r/673932 [08:05:15] (03CR) 10jerkins-bot: [V: 04-1] Add Capirca support to Homer [software/homer] - 10https://gerrit.wikimedia.org/r/663536 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [08:07:54] (03PS1) 10Muehlenhoff: Additional PHP config tweak for tendril/buster [puppet] - 10https://gerrit.wikimedia.org/r/673933 [08:08:09] (03CR) 10Joal: [C: 03+1] "LGTM :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [08:09:13] (03PS7) 10Elukey: Add BGP configuration for the new ML Serve eqiad/codfw clusters [homer/public] - 10https://gerrit.wikimedia.org/r/661055 (https://phabricator.wikimedia.org/T272918) [08:11:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1158.eqiad.wmnet with reason: REIMAGE [08:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:10] (03PS1) 10JMeybohm: api-gateway: Bump ratelimit and fluent-bit image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/673934 (https://phabricator.wikimedia.org/T274852) [08:11:13] (03PS1) 10JMeybohm: eventrouter: Use debian based image [deployment-charts] - 10https://gerrit.wikimedia.org/r/673935 (https://phabricator.wikimedia.org/T274852) [08:12:09] (03CR) 10Kosta Harlan: [C: 03+1] Update GrowthExperiments cronjob parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673631 (https://phabricator.wikimedia.org/T275172) (owner: 10Gergő Tisza) [08:12:33] (03CR) 10Elukey: [C: 03+2] hadoop: add a profile to deploy the capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [08:12:48] (03PS17) 10Elukey: hadoop: set the Yarn capacity scheduler for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/672654 (https://phabricator.wikimedia.org/T277062) [08:13:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1158.eqiad.wmnet with reason: REIMAGE [08:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:27] !log swift eqiad-prod: less weight for ms-be[1019-1026] / more weight to ms-be106[0-3] - T272836 T268435 [08:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:35] T268435: Add ms-be106[0-3] to swift - https://phabricator.wikimedia.org/T268435 [08:13:35] T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836 [08:14:02] (03CR) 10Elukey: [C: 03+2] hadoop: set the Yarn capacity scheduler for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/672654 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [08:14:11] (03CR) 10Ayounsi: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/666876 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [08:14:19] (03CR) 10Gergő Tisza: Update GrowthExperiments cronjob parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673631 (https://phabricator.wikimedia.org/T275172) (owner: 10Gergő Tisza) [08:18:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:18:52] (03CR) 10JMeybohm: [C: 03+1] "I won't dare to argue about the actual FCGI implementation because FCGI is weird. But apart from that, this looks reasonable 😊" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/672767 (https://phabricator.wikimedia.org/T276908) (owner: 10Giuseppe Lavagetto) [08:19:39] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:20:23] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1158.eqiad.wmnet'] ` and were **ALL** successful. [08:21:34] (03PS3) 10Marostegui: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/673195 (https://phabricator.wikimedia.org/T274336) [08:21:41] (03PS2) 10Marostegui: wmnet: Update s7-master cname [dns] - 10https://gerrit.wikimedia.org/r/673196 (https://phabricator.wikimedia.org/T274336) [08:25:43] (03PS1) 10Elukey: hadoop: fix Yarn capacity scheduler queue mappings [puppet] - 10https://gerrit.wikimedia.org/r/673936 (https://phabricator.wikimedia.org/T277062) [08:27:00] (03CR) 10Elukey: [C: 03+2] hadoop: fix Yarn capacity scheduler queue mappings [puppet] - 10https://gerrit.wikimedia.org/r/673936 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [08:27:29] (03CR) 10Ayounsi: "> Patch Set 6:" (038 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/663535 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [08:29:14] (03PS1) 10Elukey: hadoop: fix typo in capacity scheduler's config [puppet] - 10https://gerrit.wikimedia.org/r/673937 [08:30:22] (03CR) 10Elukey: [C: 03+2] hadoop: fix typo in capacity scheduler's config [puppet] - 10https://gerrit.wikimedia.org/r/673937 (owner: 10Elukey) [08:30:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 25%: Slowly repool db1141', diff saved to https://phabricator.wikimedia.org/P14967 and previous config saved to /var/cache/conftool/dbconfig/20210322-083023-root.json [08:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:00] (03PS4) 10JMeybohm: Add kubernetes1017 to BGP peers [homer/public] - 10https://gerrit.wikimedia.org/r/672709 (owner: 10Alexandros Kosiaris) [08:41:11] (03PS5) 10JMeybohm: Add kubernetes1017 to BGP peers [homer/public] - 10https://gerrit.wikimedia.org/r/672709 (https://phabricator.wikimedia.org/T277741) (owner: 10Alexandros Kosiaris) [08:42:40] (03PS7) 10Effie Mouzeli: hieradata: enable memcached socket mwdebug1003, mwdebug2001 [puppet] - 10https://gerrit.wikimedia.org/r/663796 (https://phabricator.wikimedia.org/T273115) [08:45:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 50%: Slowly repool db1141', diff saved to https://phabricator.wikimedia.org/P14968 and previous config saved to /var/cache/conftool/dbconfig/20210322-084527-root.json [08:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:35] (03CR) 10DCausse: create helmfile.d structure (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [08:47:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:48:46] (03PS2) 10Muehlenhoff: Depool poolcounter1005 for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663584 [08:49:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:49:54] (03PS1) 10Elukey: camus: switch to the yarn 'ingest' queue for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/673943 (https://phabricator.wikimedia.org/T277062) [08:50:18] (03CR) 10Elukey: [C: 03+2] camus: switch to the yarn 'ingest' queue for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/673943 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [08:51:10] (03PS8) 10Effie Mouzeli: hieradata: enable memcached socket mwdebug1003, mwdebug2001 [puppet] - 10https://gerrit.wikimedia.org/r/663796 (https://phabricator.wikimedia.org/T273115) [08:55:03] (03PS4) 10Ayounsi: Add Capirca definitions exporter [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/666876 (https://phabricator.wikimedia.org/T273865) [08:55:50] (03CR) 10jerkins-bot: [V: 04-1] Add Capirca definitions exporter [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/666876 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [09:00:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 75%: Slowly repool db1141', diff saved to https://phabricator.wikimedia.org/P14969 and previous config saved to /var/cache/conftool/dbconfig/20210322-090030-root.json [09:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:38] (03PS1) 10Elukey: analytics: set test refine jobs in a different Yarn queue [puppet] - 10https://gerrit.wikimedia.org/r/673948 (https://phabricator.wikimedia.org/T277062) [09:00:42] (03PS2) 10JMeybohm: kubernetes staging-eqiad: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/671175 (https://phabricator.wikimedia.org/T276305) [09:00:44] (03PS1) 10JMeybohm: kubernetes eqiad: Populate hiera keys for k8s worker updates [puppet] - 10https://gerrit.wikimedia.org/r/673949 (https://phabricator.wikimedia.org/T277741) [09:01:02] (03CR) 10Elukey: [C: 03+2] analytics: set test refine jobs in a different Yarn queue [puppet] - 10https://gerrit.wikimedia.org/r/673948 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [09:01:40] (03CR) 10jerkins-bot: [V: 04-1] analytics: set test refine jobs in a different Yarn queue [puppet] - 10https://gerrit.wikimedia.org/r/673948 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [09:01:50] uff [09:02:15] (03PS2) 10Elukey: analytics: set test refine jobs in a different Yarn queue [puppet] - 10https://gerrit.wikimedia.org/r/673948 (https://phabricator.wikimedia.org/T277062) [09:04:36] (03CR) 10Elukey: [C: 03+2] analytics: set test refine jobs in a different Yarn queue [puppet] - 10https://gerrit.wikimedia.org/r/673948 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [09:07:31] (03PS1) 10JMeybohm: kubernetes eqiad: Apply role and hiera values to new masters [puppet] - 10https://gerrit.wikimedia.org/r/673952 (https://phabricator.wikimedia.org/T277741) [09:08:48] (03PS1) 10Volans: tests: fix pip backtracking [software/cumin] - 10https://gerrit.wikimedia.org/r/673953 [09:11:16] (03PS5) 10Ayounsi: Add Capirca definitions exporter [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/666876 (https://phabricator.wikimedia.org/T273865) [09:11:37] (03PS1) 10Elukey: analytics: move druid load jobs to the analytics Yarn queue [puppet] - 10https://gerrit.wikimedia.org/r/673954 (https://phabricator.wikimedia.org/T277062) [09:11:58] (03CR) 10Elukey: [C: 03+2] analytics: move druid load jobs to the analytics Yarn queue [puppet] - 10https://gerrit.wikimedia.org/r/673954 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [09:12:35] (03CR) 10Elukey: [V: 03+2 C: 03+2] analytics: move druid load jobs to the analytics Yarn queue [puppet] - 10https://gerrit.wikimedia.org/r/673954 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [09:12:42] (03CR) 10Elukey: [C: 03+2] analytics: move druid load jobs to the analytics Yarn queue [puppet] - 10https://gerrit.wikimedia.org/r/673954 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [09:15:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 100%: Slowly repool db1141', diff saved to https://phabricator.wikimedia.org/P14970 and previous config saved to /var/cache/conftool/dbconfig/20210322-091534-root.json [09:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:16] (03CR) 10Marostegui: [C: 03+1] "let's see...." [puppet] - 10https://gerrit.wikimedia.org/r/673933 (owner: 10Muehlenhoff) [09:17:22] PROBLEM - Check nf_conntrack usage in neutron netns on cloudnet1003 is CRITICAL: CRITICAL: nf_conntrack usage over 80% in netns qrouter-d93771ba-2711-4f88-804a-8df6fd03978a https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:18:22] (03CR) 10Ayounsi: "Can be tested on https://netbox-next.wikimedia.org/extras/scripts/capirca.GetHosts/" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/666876 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [09:18:39] (03PS1) 10JMeybohm: admin_ng: Enable eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/673955 (https://phabricator.wikimedia.org/T277741) [09:18:41] (03PS1) 10JMeybohm: Remove helmfile.d/admin [deployment-charts] - 10https://gerrit.wikimedia.org/r/673956 (https://phabricator.wikimedia.org/T277741) [09:19:01] (03PS2) 10Giuseppe Lavagetto: Scaffold: fix template calls for php applications [deployment-charts] - 10https://gerrit.wikimedia.org/r/672980 [09:19:03] (03PS5) 10Giuseppe Lavagetto: Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) [09:19:34] RECOVERY - Check nf_conntrack usage in neutron netns on cloudnet1003 is OK: OK: everything is apparently fine https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:19:56] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:20:28] (03CR) 10jerkins-bot: [V: 04-1] Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) (owner: 10Giuseppe Lavagetto) [09:20:59] (03CR) 10Hoo man: dumpwikibasejson: Make segment separation more robust (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673679 (https://phabricator.wikimedia.org/T277300) (owner: 10Hoo man) [09:24:22] (03PS6) 10Giuseppe Lavagetto: Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) [09:25:04] 10SRE, 10ops-eqiad, 10DC-Ops: Audit down ports - https://phabricator.wikimedia.org/T218751 (10ayounsi) a:05ayounsi→03Cmjohnson I disabled the ones that were obviously unused. I'll let you disable the remaining ones as they have descriptions, etc... [09:25:40] (03CR) 10jerkins-bot: [V: 04-1] Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) (owner: 10Giuseppe Lavagetto) [09:26:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:30:15] (03PS1) 10Volans: tests: fix pip backtracking [software/spicerack] - 10https://gerrit.wikimedia.org/r/673961 [09:30:22] PROBLEM - Check nf_conntrack usage in neutron netns on cloudnet1003 is CRITICAL: CRITICAL: nf_conntrack usage over 80% in netns qrouter-d93771ba-2711-4f88-804a-8df6fd03978a https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:32:32] RECOVERY - Check nf_conntrack usage in neutron netns on cloudnet1003 is OK: OK: everything is apparently fine https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:32:50] (03CR) 10Volans: [C: 03+1] "Thanks for the patch Kunal" [cookbooks] - 10https://gerrit.wikimedia.org/r/673558 (owner: 10Legoktm) [09:33:01] (03CR) 10Jgiannelos: "Adding Effie in the loop in case this patch is useful for the deployment work." [deployment-charts] - 10https://gerrit.wikimedia.org/r/667165 (https://phabricator.wikimedia.org/T275874) (owner: 10Jgiannelos) [09:34:04] (03Abandoned) 10Volans: tests: fix pip backtracking [software/cumin] - 10https://gerrit.wikimedia.org/r/673953 (owner: 10Volans) [09:35:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1142 for schema change', diff saved to https://phabricator.wikimedia.org/P14971 and previous config saved to /var/cache/conftool/dbconfig/20210322-093558-marostegui.json [09:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:38:21] (03CR) 10jerkins-bot: [V: 04-1] tests: fix pip backtracking [software/spicerack] - 10https://gerrit.wikimedia.org/r/673961 (owner: 10Volans) [09:39:32] jouncebot: now [09:39:32] No deployments scheduled for the next 0 hour(s) and 50 minute(s) [09:39:34] jouncebot: next [09:39:34] In 0 hour(s) and 50 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210322T1030) [09:39:50] (03PS3) 10Reedy: Remove wgEnableRestAPI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/671203 [09:39:56] (03CR) 10Reedy: [C: 03+2] Remove wgEnableRestAPI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/671203 (owner: 10Reedy) [09:40:48] PROBLEM - puppet last run on sretest1002 is CRITICAL: CRITICAL: Puppet has been disabled for 604918 seconds, message: test puppet deactivations alerts re-enable after 22/03/21 - jbond, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:41:10] (03Merged) 10jenkins-bot: Remove wgEnableRestAPI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/671203 (owner: 10Reedy) [09:42:24] (03PS3) 10Reedy: Remove deprecated setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618482 (https://phabricator.wikimedia.org/T232542) (owner: 10Awight) [09:42:41] (03PS4) 10Reedy: Remove deprecated setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618482 (https://phabricator.wikimedia.org/T232542) (owner: 10Awight) [09:42:45] (03CR) 10Reedy: [C: 03+2] Remove deprecated setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618482 (https://phabricator.wikimedia.org/T232542) (owner: 10Awight) [09:43:06] (03CR) 10Muehlenhoff: [C: 03+2] Additional PHP config tweak for tendril/buster [puppet] - 10https://gerrit.wikimedia.org/r/673933 (owner: 10Muehlenhoff) [09:43:54] (03Merged) 10jenkins-bot: Remove deprecated setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618482 (https://phabricator.wikimedia.org/T232542) (owner: 10Awight) [09:44:57] (03PS8) 10David Caro: wmcs.toolforge.etcd: Added cookbook to depool and remove a node [cookbooks] - 10https://gerrit.wikimedia.org/r/667183 (https://phabricator.wikimedia.org/T274497) [09:44:59] (03PS6) 10David Caro: wmcs.toolforge: add cookbook to create an instance of a prefix [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) [09:45:01] (03PS4) 10David Caro: wmcs.toolforge: add cookbook to add a new etcd node [cookbooks] - 10https://gerrit.wikimedia.org/r/668090 (https://phabricator.wikimedia.org/T274497) [09:45:20] (03PS3) 10Reedy: Drop ability to use graphoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654954 (https://phabricator.wikimedia.org/T242855) (owner: 10Jforrester) [09:45:29] (03CR) 10Reedy: [C: 03+2] Drop ability to use graphoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654954 (https://phabricator.wikimedia.org/T242855) (owner: 10Jforrester) [09:46:21] (03Merged) 10jenkins-bot: Drop ability to use graphoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654954 (https://phabricator.wikimedia.org/T242855) (owner: 10Jforrester) [09:48:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:48:12] (03PS3) 10Reedy: wgAbuseFilterAflFilterMigrationStage: Make COMPAT_NEW in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657696 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [09:48:26] !log reedy@deploy1002 Synchronized wmf-config/CommonSettings-labs.php: Config cleanup (duration: 01m 20s) [09:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:15] 10SRE, 10serviceops: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10JMeybohm) [09:49:19] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Create a basic helm chart to test MediaWiki on kubernetes - https://phabricator.wikimedia.org/T265327 (10JMeybohm) [09:49:50] !log reedy@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config cleanup (duration: 00m 59s) [09:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:55] !log reedy@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config cleanup (duration: 00m 57s) [09:50:59] (03PS2) 10Volans: tests: fix pip backtracking [software/spicerack] - 10https://gerrit.wikimedia.org/r/673961 [09:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:03] (03PS1) 10Muehlenhoff: Assign tendril role to dbmonitor1002 [puppet] - 10https://gerrit.wikimedia.org/r/673968 [09:51:55] (03PS2) 10Volans: tests: fix pip backtracking [software/cumin] - 10https://gerrit.wikimedia.org/r/673564 (owner: 10Legoktm) [09:52:04] (03CR) 10Volans: "reply inline" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/673564 (owner: 10Legoktm) [09:59:00] 10SRE, 10serviceops: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10jijiki) >>! In T277711#6927861, @JMeybohm wrote: > I don't really like option 3 just because it moves parts of the software stack to the node itself and I would personally lik... [09:59:12] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Standardize the logging format - https://phabricator.wikimedia.org/T234565 (10hashar) Gerrit and Phabricator now have Apache 2 access log ingested. I have created a saved search in Kibana for `source.type: apache2`, added a couple very basic... [09:59:38] (03CR) 10Marostegui: "We probably need to add it to modules/profile/manifests/mariadb/misc/tendril.pp for ferm?" [puppet] - 10https://gerrit.wikimedia.org/r/673968 (owner: 10Muehlenhoff) [10:00:00] (03CR) 10jerkins-bot: [V: 04-1] tests: fix pip backtracking [software/cumin] - 10https://gerrit.wikimedia.org/r/673564 (owner: 10Legoktm) [10:01:14] 10SRE, 10SRE-Access-Requests: Request for access to mailman3-roots role for Ladsgroup - https://phabricator.wikimedia.org/T278078 (10Volans) p:05Triage→03Medium I would guess that the addition of `ladsgroup` to the `mailman3-roots` group was implicitly approved too during the last SRE meeting that approved... [10:02:06] (03PS1) 10Volans: admin: add ladsgroup to mailman3-roots [puppet] - 10https://gerrit.wikimedia.org/r/673971 (https://phabricator.wikimedia.org/T278078) [10:03:25] (03CR) 10Hnowlan: [C: 03+2] "I can handle rollout of these" [deployment-charts] - 10https://gerrit.wikimedia.org/r/673934 (https://phabricator.wikimedia.org/T274852) (owner: 10JMeybohm) [10:03:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:03:44] (03CR) 10Volans: "recheck" [software/cumin] - 10https://gerrit.wikimedia.org/r/673564 (owner: 10Legoktm) [10:04:15] (03CR) 10Volans: [C: 04-1] "Pending approval on task" [puppet] - 10https://gerrit.wikimedia.org/r/673971 (https://phabricator.wikimedia.org/T278078) (owner: 10Volans) [10:04:50] (03Merged) 10jenkins-bot: api-gateway: Bump ratelimit and fluent-bit image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/673934 (https://phabricator.wikimedia.org/T274852) (owner: 10JMeybohm) [10:05:01] 10SRE, 10serviceops: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Joe) >>! In T277711#6933756, @jijiki wrote: >>>! In T277711#6927861, @JMeybohm wrote: >> I don't really like option 3 just because it moves parts of the software stack to the... [10:05:24] thanks hnowlan! <3 [10:08:04] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:08:43] (03CR) 10Elukey: [C: 03+2] Add BGP configuration for the new ML Serve eqiad/codfw clusters [homer/public] - 10https://gerrit.wikimedia.org/r/661055 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [10:12:11] !log run homer for cr1/cr2 eqiad and codfw to add new iBGP session for the k8s ML clusters - https://gerrit.wikimedia.org/r/c/operations/homer/public/+/661055 [10:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:12] starting from codfw [10:13:23] (03PS1) 10Filippo Giunchedi: prometheus: remove ircd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/673972 (https://phabricator.wikimedia.org/T224579) [10:14:38] (03PS1) 10Marostegui: mariadb: Productionize db1161. [puppet] - 10https://gerrit.wikimedia.org/r/673973 (https://phabricator.wikimedia.org/T258361) [10:14:52] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:15:09] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [10:15:09] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [10:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:19] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1161. [puppet] - 10https://gerrit.wikimedia.org/r/673973 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [10:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:34] 10SRE, 10serviceops: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10jijiki) >>! In T277711#6933792, @Joe wrote: >>>! In T277711#6933756, @jijiki wrote: >>>>! In T277711#6927861, @JMeybohm wrote: >>> I don't really like option 3 just because it... [10:16:01] (03CR) 10Marostegui: "this was db1165, not db1161 as the commit message says" [puppet] - 10https://gerrit.wikimedia.org/r/673973 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [10:17:22] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [10:17:23] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [10:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:23] (03CR) 10Muehlenhoff: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/673968 (owner: 10Muehlenhoff) [10:19:38] 10SRE, 10serviceops: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10JMeybohm) >>! In T277711#6933792, @Joe wrote: > That is already done in the MediaWiki chart. But that does now deploy mcrouter as a sidecar in each MW pod. AIUI this might co... [10:21:24] !log jayme@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [10:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:34] !log jayme@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [10:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:39] (03PS14) 10Jbond: P:netbase: parse the service catalogue and inject the service ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 [10:23:53] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1165 is now replicating, I am checking all the tables now. [10:25:57] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [10:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:04] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [10:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:50] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [10:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:40] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [10:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:51] (03CR) 10Jgiannelos: [C: 03+2] Configure prometheus metrics for chromium-renderer [deployment-charts] - 10https://gerrit.wikimedia.org/r/673454 (https://phabricator.wikimedia.org/T277857) (owner: 10Jgiannelos) [10:30:04] jan_drewniak: I, the Bot under the Fountain, allow thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210322T1030). [10:30:24] 10SRE, 10serviceops: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Joe) >>! In T277711#6933874, @JMeybohm wrote: >>>! In T277711#6933792, @Joe wrote: >> That is already done in the MediaWiki chart. > > But that does now deploy mcrouter as a... [10:31:17] (03Merged) 10jenkins-bot: Configure prometheus metrics for chromium-renderer [deployment-charts] - 10https://gerrit.wikimedia.org/r/673454 (https://phabricator.wikimedia.org/T277857) (owner: 10Jgiannelos) [10:32:10] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673979 (https://phabricator.wikimedia.org/T128546) [10:32:45] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:55] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:33:00] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:30] (03PS1) 10Alexandros Kosiaris: cxserver: Use TLS port for apertium network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/673980 [10:34:50] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:34:53] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673979 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:37] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673979 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:40:15] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:673979| Bumping portals to master (T128546)]] (duration: 00m 58s) [10:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:22] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:40:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:41:13] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:673979| Bumping portals to master (T128546)]] (duration: 00m 58s) [10:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:53] !log hnowlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [10:41:53] !log hnowlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [10:41:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 25%: Slowly repool db1142', diff saved to https://phabricator.wikimedia.org/P14973 and previous config saved to /var/cache/conftool/dbconfig/20210322-104156-root.json [10:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:43] (03CR) 10ArielGlenn: [C: 03+1] "This looks good but I have not tested it." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673679 (https://phabricator.wikimedia.org/T277300) (owner: 10Hoo man) [10:43:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, but we also need to drop the collection in profile::prometheus::ops for 9197, right?" [puppet] - 10https://gerrit.wikimedia.org/r/673972 (https://phabricator.wikimedia.org/T224579) (owner: 10Filippo Giunchedi) [10:43:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:44:43] (03PS3) 10Volans: tests: fix pip backtracking [software/cumin] - 10https://gerrit.wikimedia.org/r/673564 (owner: 10Legoktm) [10:44:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 25%: Slowly repool db1085', diff saved to https://phabricator.wikimedia.org/P14974 and previous config saved to /var/cache/conftool/dbconfig/20210322-104443-root.json [10:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:50] (03CR) 10Volans: "reply inline" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/673564 (owner: 10Legoktm) [10:45:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:46:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10aborrero) a:05aborrero→03RobH I explained in T270705#6750907 (procurement) why this needs 10G (at least in the dataplane). T... [10:46:59] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [10:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:14] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [10:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:00] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:08] (03CR) 10Alexandros Kosiaris: [C: 03+2] cxserver: Use TLS port for apertium network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/673980 (owner: 10Alexandros Kosiaris) [10:48:15] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:26] !log hnowlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [10:48:26] !log hnowlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [10:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:38] (03Merged) 10jenkins-bot: cxserver: Use TLS port for apertium network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/673980 (owner: 10Alexandros Kosiaris) [10:51:41] !log installing libdbi-perl security updates [10:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:59] (03PS1) 10Elukey: Enable coredns for k8s ml-serve clusters [puppet] - 10https://gerrit.wikimedia.org/r/673985 (https://phabricator.wikimedia.org/T272918) [10:53:07] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'production' . [10:53:07] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [10:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:08] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28695/console" [puppet] - 10https://gerrit.wikimedia.org/r/673985 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [10:55:12] (03PS2) 10Filippo Giunchedi: prometheus: remove ircd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/673972 (https://phabricator.wikimedia.org/T224579) [10:55:52] (03CR) 10Filippo Giunchedi: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/673972 (https://phabricator.wikimedia.org/T224579) (owner: 10Filippo Giunchedi) [10:56:46] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/673972 (https://phabricator.wikimedia.org/T224579) (owner: 10Filippo Giunchedi) [10:57:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 50%: Slowly repool db1142', diff saved to https://phabricator.wikimedia.org/P14975 and previous config saved to /var/cache/conftool/dbconfig/20210322-105700-root.json [10:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:00] 10SRE: Integrate Buster 10.6 point update - https://phabricator.wikimedia.org/T263974 (10MoritzMuehlenhoff) [10:58:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:58:41] (03CR) 10Marostegui: [C: 03+1] Assign tendril role to dbmonitor1002 [puppet] - 10https://gerrit.wikimedia.org/r/673968 (owner: 10Muehlenhoff) [10:59:21] (03CR) 10Volans: [C: 03+2] tests: fix pip backtracking [software/cumin] - 10https://gerrit.wikimedia.org/r/673564 (owner: 10Legoktm) [10:59:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 50%: Slowly repool db1085', diff saved to https://phabricator.wikimedia.org/P14976 and previous config saved to /var/cache/conftool/dbconfig/20210322-105947-root.json [10:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the European mid-day backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210322T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:00:49] 10SRE: Integrate Buster 10.6 point update - https://phabricator.wikimedia.org/T263974 (10MoritzMuehlenhoff) [11:01:43] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: remove ircd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/673972 (https://phabricator.wikimedia.org/T224579) (owner: 10Filippo Giunchedi) [11:05:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:05:43] (03Merged) 10jenkins-bot: tests: fix pip backtracking [software/cumin] - 10https://gerrit.wikimedia.org/r/673564 (owner: 10Legoktm) [11:09:17] (03CR) 10David Caro: [C: 03+1] "Just got a question, adding +1 to avoid blocking in case the answer is "yep"." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/673606 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [11:09:50] !log hnowlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [11:09:50] !log hnowlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [11:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:22] (03CR) 10Elukey: [V: 03+1 C: 03+2] Enable coredns for k8s ml-serve clusters [puppet] - 10https://gerrit.wikimedia.org/r/673985 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [11:12:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 75%: Slowly repool db1142', diff saved to https://phabricator.wikimedia.org/P14977 and previous config saved to /var/cache/conftool/dbconfig/20210322-111203-root.json [11:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:02] (03PS1) 10Volans: tests: fix pip backtracking [software/homer] - 10https://gerrit.wikimedia.org/r/673990 [11:14:08] (03PS2) 10Muehlenhoff: Assign tendril role to dbmonitor1002 [puppet] - 10https://gerrit.wikimedia.org/r/673968 [11:14:47] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [11:14:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 75%: Slowly repool db1085', diff saved to https://phabricator.wikimedia.org/P14978 and previous config saved to /var/cache/conftool/dbconfig/20210322-111451-root.json [11:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:54] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [11:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:01] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [11:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:09] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [11:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:07] (03CR) 10David Caro: [C: 03+2] wmcs.toolforge: add cookbook to add a new etcd node [cookbooks] - 10https://gerrit.wikimedia.org/r/668090 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [11:21:09] (03CR) 10David Caro: [C: 03+2] wmcs.toolforge: add cookbook to create an instance of a prefix [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [11:21:13] (03CR) 10David Caro: [C: 03+2] wmcs.toolforge.etcd: Added cookbook to depool and remove a node [cookbooks] - 10https://gerrit.wikimedia.org/r/667183 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [11:25:11] (03Merged) 10jenkins-bot: wmcs.toolforge.etcd: Added cookbook to depool and remove a node [cookbooks] - 10https://gerrit.wikimedia.org/r/667183 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [11:25:13] (03Merged) 10jenkins-bot: wmcs.toolforge: add cookbook to create an instance of a prefix [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [11:25:31] (03Merged) 10jenkins-bot: wmcs.toolforge: add cookbook to add a new etcd node [cookbooks] - 10https://gerrit.wikimedia.org/r/668090 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [11:27:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 100%: Slowly repool db1142', diff saved to https://phabricator.wikimedia.org/P14979 and previous config saved to /var/cache/conftool/dbconfig/20210322-112707-root.json [11:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 100%: Slowly repool db1085', diff saved to https://phabricator.wikimedia.org/P14980 and previous config saved to /var/cache/conftool/dbconfig/20210322-112954-root.json [11:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:10] 10Puppet, 10SRE-tools, 10Python3-Porting, 10User-MoritzMuehlenhoff, and 2 others: Convert .py.erb files to files with configurations - https://phabricator.wikimedia.org/T277892 (10jbond) [11:30:46] (03CR) 10Muehlenhoff: [C: 03+2] Assign tendril role to dbmonitor1002 [puppet] - 10https://gerrit.wikimedia.org/r/673968 (owner: 10Muehlenhoff) [11:30:58] 10Puppet, 10SRE, 10Continuous-Integration-Config, 10Patch-For-Review, and 2 others: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10jbond) [11:32:40] 10Puppet, 10SRE, 10Continuous-Integration-Config, 10Patch-For-Review, and 2 others: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10jbond) [11:35:38] (03PS1) 10Muehlenhoff: Fix component name [puppet] - 10https://gerrit.wikimedia.org/r/673991 [11:35:43] (03PS7) 10Giuseppe Lavagetto: Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) [11:35:45] (03PS1) 10Giuseppe Lavagetto: Rakefile: allow running tests on individual charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/673992 [11:36:30] (03PS2) 10Muehlenhoff: Fix component name [puppet] - 10https://gerrit.wikimedia.org/r/673991 [11:38:08] (03PS4) 10Volans: tests: add tests for the configuration files [homer/public] - 10https://gerrit.wikimedia.org/r/672765 (https://phabricator.wikimedia.org/T272688) [11:38:13] (03CR) 10Muehlenhoff: [C: 03+2] Fix component name [puppet] - 10https://gerrit.wikimedia.org/r/673991 (owner: 10Muehlenhoff) [11:39:35] (03PS5) 10David Caro: wmcs.backups: Retry a VM backup 3 times before failing [puppet] - 10https://gerrit.wikimedia.org/r/668097 (https://phabricator.wikimedia.org/T276096) [11:40:22] (03CR) 10Jbond: [C: 03+1] "LGTM, inline optional suggestion" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672773 (owner: 10Effie Mouzeli) [11:41:02] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:42:05] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/670990 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [11:43:03] (03PS2) 10Giuseppe Lavagetto: Rakefile: fix most rubocop violations [deployment-charts] - 10https://gerrit.wikimedia.org/r/673992 [11:43:05] (03PS8) 10Giuseppe Lavagetto: Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) [11:43:07] (03PS1) 10Giuseppe Lavagetto: Rakefile: allow running on a subset of charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/673994 [11:43:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:54:59] (03CR) 10Jbond: "confirmed with sretest1002 that it went critical today (after 7 days disabled)" [puppet] - 10https://gerrit.wikimedia.org/r/672677 (owner: 10Jbond) [12:05:36] (03PS1) 10Vgutierrez: Drop globalsign from the CAA records [dns] - 10https://gerrit.wikimedia.org/r/673997 (https://phabricator.wikimedia.org/T266503) [12:17:18] (03CR) 10David Caro: paws: block using the Jupyterhub from Tor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/671286 (https://phabricator.wikimedia.org/T276615) (owner: 10Bstorm) [12:18:06] (03PS1) 10Muehlenhoff: Fix PHP version in one more place [puppet] - 10https://gerrit.wikimedia.org/r/674000 [12:19:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1143 for schema change', diff saved to https://phabricator.wikimedia.org/P14981 and previous config saved to /var/cache/conftool/dbconfig/20210322-121924-marostegui.json [12:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:45] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [12:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:45] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [12:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:56] (03CR) 10JMeybohm: [C: 03+2] eventrouter: Use debian based image [deployment-charts] - 10https://gerrit.wikimedia.org/r/673935 (https://phabricator.wikimedia.org/T274852) (owner: 10JMeybohm) [12:23:14] (03Merged) 10jenkins-bot: eventrouter: Use debian based image [deployment-charts] - 10https://gerrit.wikimedia.org/r/673935 (https://phabricator.wikimedia.org/T274852) (owner: 10JMeybohm) [12:23:16] (03CR) 10Muehlenhoff: [C: 03+2] Fix PHP version in one more place [puppet] - 10https://gerrit.wikimedia.org/r/674000 (owner: 10Muehlenhoff) [12:27:13] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [12:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:01] (03PS1) 10Jbond: puppetdb::app Add GC parameters to ro-host [puppet] - 10https://gerrit.wikimedia.org/r/674003 [12:28:03] (03CR) 10jerkins-bot: [V: 04-1] puppetdb::app Add GC parameters to ro-host [puppet] - 10https://gerrit.wikimedia.org/r/674003 (owner: 10Jbond) [12:28:09] (03CR) 10JMeybohm: [C: 03+1] Rakefile: fix most rubocop violations [deployment-charts] - 10https://gerrit.wikimedia.org/r/673992 (owner: 10Giuseppe Lavagetto) [12:28:11] (03PS1) 10Jbond: puppetdb::app Add GC parameters to ro-host [puppet] - 10https://gerrit.wikimedia.org/r/674004 [12:28:13] (03CR) 10JMeybohm: [C: 03+1] Rakefile: allow running on a subset of charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/673994 (owner: 10Giuseppe Lavagetto) [12:28:13] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [12:28:15] (03CR) 10jerkins-bot: [V: 04-1] puppetdb::app Add GC parameters to ro-host [puppet] - 10https://gerrit.wikimedia.org/r/674004 (owner: 10Jbond) [12:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:21] (03PS2) 10Jbond: puppetdb::app Add GC parameters to ro-host [puppet] - 10https://gerrit.wikimedia.org/r/674003 [12:31:11] (03PS3) 10Jbond: puppetdb::app Add GC parameters to ro-host [puppet] - 10https://gerrit.wikimedia.org/r/674003 [12:32:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28699/console" [puppet] - 10https://gerrit.wikimedia.org/r/674003 (owner: 10Jbond) [12:39:32] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/674003 (owner: 10Jbond) [12:42:37] (03PS1) 10JMeybohm: admin_ng: Don't set kubernetesApi.port in eventrouter values [deployment-charts] - 10https://gerrit.wikimedia.org/r/674007 [12:44:45] (03PS1) 10JMeybohm: admin_ng: Switch to cluster internal DNS name for API [deployment-charts] - 10https://gerrit.wikimedia.org/r/674008 [12:44:54] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Don't set kubernetesApi.port in eventrouter values [deployment-charts] - 10https://gerrit.wikimedia.org/r/674007 (owner: 10JMeybohm) [12:46:16] (03Merged) 10jenkins-bot: admin_ng: Don't set kubernetesApi.port in eventrouter values [deployment-charts] - 10https://gerrit.wikimedia.org/r/674007 (owner: 10JMeybohm) [12:47:44] (03CR) 10Ottomata: Declare WMDE Technical Wishes streams and migrate to EventGate on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666423 (https://phabricator.wikimedia.org/T275005) (owner: 10Ottomata) [12:50:47] (03PS1) 10Andrew-WMDE: Allow access to the Maps service from MediaWiki-Vagrant [puppet] - 10https://gerrit.wikimedia.org/r/674009 [12:57:29] (03CR) 10Awight: [C: 03+1] "This would be useful for our team's development." [puppet] - 10https://gerrit.wikimedia.org/r/674009 (owner: 10Andrew-WMDE) [13:08:59] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Add Swagger UI environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/673471 (https://phabricator.wikimedia.org/T277644) (owner: 10Kosta Harlan) [13:11:20] (03Merged) 10jenkins-bot: linkrecommendation: Add Swagger UI environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/673471 (https://phabricator.wikimedia.org/T277644) (owner: 10Kosta Harlan) [13:12:04] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/674016 [13:13:50] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/674016 (owner: 10Kosta Harlan) [13:15:14] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/674016 (owner: 10Kosta Harlan) [13:16:30] !log kharlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [13:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:45] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [13:20:45] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [13:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 25%: Slowly repool db1143', diff saved to https://phabricator.wikimedia.org/P14982 and previous config saved to /var/cache/conftool/dbconfig/20210322-132249-root.json [13:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:09] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for looking into that" [puppet] - 10https://gerrit.wikimedia.org/r/674003 (owner: 10Jbond) [13:25:44] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetdb::app Add GC parameters to ro-host [puppet] - 10https://gerrit.wikimedia.org/r/674003 (owner: 10Jbond) [13:26:17] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [13:26:17] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [13:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 50%: Slowly repool db1143', diff saved to https://phabricator.wikimedia.org/P14983 and previous config saved to /var/cache/conftool/dbconfig/20210322-133753-root.json [13:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 75%: Slowly repool db1143', diff saved to https://phabricator.wikimedia.org/P14984 and previous config saved to /var/cache/conftool/dbconfig/20210322-135256-root.json [13:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:10] (03PS1) 10Muehlenhoff: Extend tendril ferm rules with dbmonitor1002 [puppet] - 10https://gerrit.wikimedia.org/r/674020 [13:56:30] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Scaffold: fix template calls for php applications [deployment-charts] - 10https://gerrit.wikimedia.org/r/672980 (owner: 10Giuseppe Lavagetto) [13:56:41] (03PS1) 10Muehlenhoff: Point dbtree.w.o to dbmonitor1002 [puppet] - 10https://gerrit.wikimedia.org/r/674021 (https://phabricator.wikimedia.org/T224589) [13:57:48] (03Merged) 10jenkins-bot: Scaffold: fix template calls for php applications [deployment-charts] - 10https://gerrit.wikimedia.org/r/672980 (owner: 10Giuseppe Lavagetto) [14:02:30] (03CR) 10Marostegui: [C: 03+1] "🙏" [puppet] - 10https://gerrit.wikimedia.org/r/674021 (https://phabricator.wikimedia.org/T224589) (owner: 10Muehlenhoff) [14:03:14] (03CR) 10Muehlenhoff: [C: 03+2] Point dbtree.w.o to dbmonitor1002 [puppet] - 10https://gerrit.wikimedia.org/r/674021 (https://phabricator.wikimedia.org/T224589) (owner: 10Muehlenhoff) [14:07:21] !log rename cloud-hosts1-b-eqiad to cloud-hosts1-eqiad [14:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:42] !log rename cloud-hosts1-b-eqiad to cloud-hosts1-eqiad - T277771 [14:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:49] T277771: Rename cloud-hosts1-b-eqiad to cloud-hosts1-eqiad - https://phabricator.wikimedia.org/T277771 [14:07:51] arturo: ^ [14:08:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 100%: Slowly repool db1143', diff saved to https://phabricator.wikimedia.org/P14985 and previous config saved to /var/cache/conftool/dbconfig/20210322-140800-root.json [14:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:16] volans: Hi, I'm wondering if we can ask for an unusual exception: https://gerrit.wikimedia.org/r/c/maps/kartotherian/deploy/+/674023 or if you can suggest a good reviewer to CC? [14:09:34] (03CR) 10David Caro: [C: 03+1] "👌" [cookbooks] - 10https://gerrit.wikimedia.org/r/673597 (owner: 10Legoktm) [14:11:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1144:3314 for schema change', diff saved to https://phabricator.wikimedia.org/P14986 and previous config saved to /var/cache/conftool/dbconfig/20210322-141146-marostegui.json [14:11:49] awight: looking who that could be [14:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:12] volans: Thanks. Trail goes cold a couple of years ago ;-) [14:12:31] (03PS2) 10Volans: sre.ganeti.makevm: Update example after 22c586eb2ac23 [cookbooks] - 10https://gerrit.wikimedia.org/r/673597 (owner: 10Legoktm) [14:14:01] (03PS1) 10Filippo Giunchedi: alerts: deploy to Prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/674025 (https://phabricator.wikimedia.org/T272977) [14:14:06] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:27] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:04] (03CR) 10Volans: [C: 03+2] sre.ganeti.makevm: Update example after 22c586eb2ac23 [cookbooks] - 10https://gerrit.wikimedia.org/r/673597 (owner: 10Legoktm) [14:15:06] (03CR) 10jerkins-bot: [V: 04-1] alerts: deploy to Prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/674025 (https://phabricator.wikimedia.org/T272977) (owner: 10Filippo Giunchedi) [14:16:05] (03PS1) 10Marostegui: tendril.sql: Add dbmonitor1002 grants [puppet] - 10https://gerrit.wikimedia.org/r/674026 (https://phabricator.wikimedia.org/T224589) [14:17:37] (03Merged) 10jenkins-bot: sre.ganeti.makevm: Update example after 22c586eb2ac23 [cookbooks] - 10https://gerrit.wikimedia.org/r/673597 (owner: 10Legoktm) [14:19:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674026 (https://phabricator.wikimedia.org/T224589) (owner: 10Marostegui) [14:20:02] (03PS1) 10Ayounsi: Revert "interface automation: 2nd fix for cloud-hosts VLAN" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/674046 [14:20:21] (03PS2) 10Marostegui: tendril.sql: Add dbmonitor1002 grants [puppet] - 10https://gerrit.wikimedia.org/r/674026 (https://phabricator.wikimedia.org/T224589) [14:20:49] (03CR) 10Marostegui: tendril.sql: Add dbmonitor1002 grants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674026 (https://phabricator.wikimedia.org/T224589) (owner: 10Marostegui) [14:20:58] 10SRE, 10ops-codfw: ganeti2015 doesn't boot - https://phabricator.wikimedia.org/T277537 (10Papaul) p:05Triage→03Medium [14:21:13] (03CR) 10Marostegui: [C: 03+2] tendril.sql: Add dbmonitor1002 grants [puppet] - 10https://gerrit.wikimedia.org/r/674026 (https://phabricator.wikimedia.org/T224589) (owner: 10Marostegui) [14:21:33] (03PS2) 10Filippo Giunchedi: alerts: deploy to Prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/674025 (https://phabricator.wikimedia.org/T272977) [14:22:20] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [14:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:20] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [14:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:50] 10SRE, 10ops-codfw, 10decommission-hardware: decommission frqueue1001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T277171 (10Papaul) @Jgreen @Cmjohnson any reason why this is assigned to me? Thanks. [14:25:36] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:44] 10SRE, 10ops-codfw, 10decommission-hardware: decommission frqueue1001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T277171 (10Papaul) Is it frqueue2001 or 1001? [14:26:23] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:53] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674047 (https://phabricator.wikimedia.org/T278131) (owner: 10DannyS712) [14:28:59] 10SRE, 10ops-codfw, 10decommission-hardware: decommission frqueue1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T277171 (10Jgreen) [14:29:17] 10SRE, 10ops-codfw, 10decommission-hardware: decommission frqueue1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T277171 (10Jgreen) >>! In T277171#6934778, @Papaul wrote: > Is it frqueue2001 or 1001? Thanks for catching this and sorry for the confusion, It's 1001, the eqiad host! [14:29:33] awight: my sources tell me that your best bets are msantos and jgiannelos [14:29:47] and no, a good journalist never reveals his sources :-P [14:30:01] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frqueue1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T277171 (10Jgreen) a:05Papaul→03Cmjohnson [14:30:46] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM overall, would be nice to have (even minimal) tests for the Python code" [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) (owner: 10Cwhite) [14:31:58] (03CR) 10Filippo Giunchedi: [C: 03+1] "Untested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/670972 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [14:33:05] volans: Nicely done! Unrelatedly, complementary tickets to a Broadway show are in the glove box :-) [14:33:36] lol [14:35:53] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:35:58] volans: Strange, jgiannelos seems to be impossible to add as a reviewer O-o [14:36:11] (03PS2) 10JMeybohm: admin_ng: Switch to cluster internal DNS name for API [deployment-charts] - 10https://gerrit.wikimedia.org/r/674008 [14:36:13] (03PS1) 10JMeybohm: eventrouter: Fix type conversion issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/674028 [14:36:40] (03PS2) 10JMeybohm: eventrouter: Fix type conversion issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/674028 [14:37:02] I could add as a "CC" [14:37:24] awight: it's already in CC [14:37:33] * awight facepalms. Someone had already added as a reviewer, which was not reflected in my interface yet. [14:38:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:39:09] (03CR) 10CDanis: "first pass on this but looks good so far, thanks!" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [14:39:53] (03PS2) 10Gergő Tisza: Update GrowthExperiments cronjob parameters [puppet] - 10https://gerrit.wikimedia.org/r/673631 (https://phabricator.wikimedia.org/T275171) [14:43:23] (03PS3) 10JMeybohm: eventrouter: Fix type conversion issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/674028 [14:47:28] (03CR) 10Mforns: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673075 (https://phabricator.wikimedia.org/T267347) (owner: 10Ottomata) [14:51:21] (03PS1) 10ArielGlenn: add --test option to bash worker wrapper [dumps] - 10https://gerrit.wikimedia.org/r/674068 [14:51:40] (03CR) 10David Caro: [C: 03+2] wmcs.backups: Retry a VM backup 3 times before failing [puppet] - 10https://gerrit.wikimedia.org/r/668097 (https://phabricator.wikimedia.org/T276096) (owner: 10David Caro) [14:52:44] (03PS5) 10Volans: tests: add tests for the configuration files [homer/public] - 10https://gerrit.wikimedia.org/r/672765 (https://phabricator.wikimedia.org/T272688) [14:52:46] (03PS3) 10Volans: WIP. tests: generate documentation from schemas [homer/public] - 10https://gerrit.wikimedia.org/r/673071 (https://phabricator.wikimedia.org/T272688) [14:52:48] (03PS1) 10Volans: tests: update deprecated pytest option [homer/public] - 10https://gerrit.wikimedia.org/r/674069 [14:55:19] (03CR) 10Marostegui: [C: 03+1] Extend tendril ferm rules with dbmonitor1002 [puppet] - 10https://gerrit.wikimedia.org/r/674020 (owner: 10Muehlenhoff) [14:59:00] (03PS1) 10Hashar: gerrit: remove Apache MaxClients limit [puppet] - 10https://gerrit.wikimedia.org/r/674070 (https://phabricator.wikimedia.org/T277127) [14:59:26] (03PS1) 10Elukey: analytics: fix admin/submit policies for the yarn capacity scheduler [puppet] - 10https://gerrit.wikimedia.org/r/674071 (https://phabricator.wikimedia.org/T277062) [15:01:31] (03PS1) 10David Caro: wmcs.ceph.codfw: Upgrade to latest 5.X kernel [puppet] - 10https://gerrit.wikimedia.org/r/674074 (https://phabricator.wikimedia.org/T274565) [15:03:05] (03CR) 10Elukey: [C: 03+2] analytics: fix admin/submit policies for the yarn capacity scheduler [puppet] - 10https://gerrit.wikimedia.org/r/674071 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [15:03:32] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/674046 (owner: 10Ayounsi) [15:03:51] RECOVERY - Host ganeti2015 is UP: PING WARNING - Packet loss = 50%, RTA = 33.09 ms [15:04:17] PROBLEM - SSH on ganeti2015 is CRITICAL: connect to address 10.192.48.48 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:05:51] 10SRE, 10Mail: Domains of most projects do not have DMARC policy - https://phabricator.wikimedia.org/T211403 (10Reedy) [15:07:25] (03PS1) 10Majavah: etcd: Use cfssl for peer-to-peer communication [puppet] - 10https://gerrit.wikimedia.org/r/674077 [15:08:30] (03CR) 10jerkins-bot: [V: 04-1] etcd: Use cfssl for peer-to-peer communication [puppet] - 10https://gerrit.wikimedia.org/r/674077 (owner: 10Majavah) [15:10:21] PROBLEM - Host ganeti2015 is DOWN: PING CRITICAL - Packet loss = 100% [15:11:29] 10SRE, 10serviceops, 10User-jijiki: Put rdb200[78] into service - https://phabricator.wikimedia.org/T255681 (10Legoktm) [15:12:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1144:3314 (re)pooling @ 25%: Slowly repool db1144:3314', diff saved to https://phabricator.wikimedia.org/P14987 and previous config saved to /var/cache/conftool/dbconfig/20210322-151257-root.json [15:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:19] RECOVERY - Host ganeti2015 is UP: PING OK - Packet loss = 0%, RTA = 33.28 ms [15:16:25] PROBLEM - Check systemd state on ganeti2015 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:45] (03CR) 10Jbond: etcd: Use cfssl for peer-to-peer communication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674077 (owner: 10Majavah) [15:18:11] RECOVERY - SSH on ganeti2015 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:18:43] RECOVERY - Check systemd state on ganeti2015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:32] 10SRE, 10Performance-Team, 10observability: Add monitoring for performance.wikimedia.org - https://phabricator.wikimedia.org/T277927 (10lmata) hi @Legoktm let us (o11y) know if you need some help! [15:19:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs200[123] - https://phabricator.wikimedia.org/T276647 (10Papaul) [15:20:20] (03PS4) 10Volans: tests: generate documentation from schemas [homer/public] - 10https://gerrit.wikimedia.org/r/673071 (https://phabricator.wikimedia.org/T272688) [15:20:43] RECOVERY - Host ml-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 38.32 ms [15:21:18] (03CR) 10Ayounsi: [C: 03+2] Revert "interface automation: 2nd fix for cloud-hosts VLAN" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/674046 (owner: 10Ayounsi) [15:24:01] PROBLEM - Stale file for node-exporter textfile in codfw on alert1001 is CRITICAL: cluster=ganeti file=device_smart.prom instance=ganeti2015 job=node site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [15:24:39] (03PS2) 10Majavah: etcd: Use cfssl for peer-to-peer communication [puppet] - 10https://gerrit.wikimedia.org/r/674077 [15:24:49] (03PS1) 10Ayounsi: cloud-hosts1-b renamed [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/674080 (https://phabricator.wikimedia.org/T277771) [15:25:09] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install conf200[456].codfw.wmnet - https://phabricator.wikimedia.org/T275637 (10Papaul) [15:28:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1144:3314 (re)pooling @ 50%: Slowly repool db1144:3314', diff saved to https://phabricator.wikimedia.org/P14988 and previous config saved to /var/cache/conftool/dbconfig/20210322-152800-root.json [15:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:25] (03CR) 10Jbond: "Thanks updated" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [15:30:35] (03CR) 10Muehlenhoff: [C: 03+2] Extend tendril ferm rules with dbmonitor1002 [puppet] - 10https://gerrit.wikimedia.org/r/674020 (owner: 10Muehlenhoff) [15:31:24] (03CR) 10Alexandros Kosiaris: [C: 03+1] proton: Remove unused nodePort, enable telemetry [deployment-charts] - 10https://gerrit.wikimedia.org/r/673932 (owner: 10JMeybohm) [15:31:58] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/674080 (https://phabricator.wikimedia.org/T277771) (owner: 10Ayounsi) [15:33:42] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [15:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:19] (03CR) 10Ayounsi: [C: 03+2] cloud-hosts1-b renamed [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/674080 (https://phabricator.wikimedia.org/T277771) (owner: 10Ayounsi) [15:38:33] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:20] (03PS3) 10Majavah: etcd: Use cfssl for peer-to-peer communication [puppet] - 10https://gerrit.wikimedia.org/r/674077 [15:41:03] (03CR) 10Ahmon Dancy: Helm chart to run MediaWiki (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) (owner: 10Giuseppe Lavagetto) [15:42:20] (03PS4) 10Majavah: etcd: Use cfssl for peer-to-peer communication [puppet] - 10https://gerrit.wikimedia.org/r/674077 [15:43:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1144:3314 (re)pooling @ 75%: Slowly repool db1144:3314', diff saved to https://phabricator.wikimedia.org/P14989 and previous config saved to /var/cache/conftool/dbconfig/20210322-154304-root.json [15:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:03] (03CR) 10Volans: "Did a first pass, some comments inline." (039 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/666876 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [15:50:27] PROBLEM - Host ganeti2015 is DOWN: PING CRITICAL - Packet loss = 100% [15:51:55] PROBLEM - Host ml-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [15:54:09] (03CR) 10Cwhite: [C: 03+1] logstash: add field checks to filter throttle [puppet] - 10https://gerrit.wikimedia.org/r/633224 (owner: 10Herron) [15:54:17] (03PS2) 10Herron: logstash: add field checks to filter throttle [puppet] - 10https://gerrit.wikimedia.org/r/633224 [15:56:41] (03CR) 10jerkins-bot: [V: 04-1] logstash: add field checks to filter throttle [puppet] - 10https://gerrit.wikimedia.org/r/633224 (owner: 10Herron) [15:57:14] 10SRE, 10netops, 10Patch-For-Review: Rename cloud-hosts1-b-eqiad to cloud-hosts1-eqiad - https://phabricator.wikimedia.org/T277771 (10ayounsi) 05Open→03Resolved All done! [15:57:34] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM overall" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/671283 (owner: 10Herron) [15:58:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1144:3314 (re)pooling @ 100%: Slowly repool db1144:3314', diff saved to https://phabricator.wikimedia.org/P14990 and previous config saved to /var/cache/conftool/dbconfig/20210322-155808-root.json [15:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:56] 10SRE, 10netops, 10Patch-For-Review: Rename cloud-hosts1-b-eqiad to cloud-hosts1-eqiad - https://phabricator.wikimedia.org/T277771 (10aborrero) thanks! [15:59:16] (03CR) 10Urbanecm: [C: 03+1] "per T277723#6933940, also code LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673699 (https://phabricator.wikimedia.org/T277723) (owner: 10Luke081515) [16:01:09] RECOVERY - Host ganeti2015 is UP: PING OK - Packet loss = 0%, RTA = 33.13 ms [16:02:41] 10SRE, 10ops-codfw: ganeti2015 doesn't boot - https://phabricator.wikimedia.org/T277537 (10Papaul) 05Open→03Resolved Remove and insert back both Risers system is backup up [16:05:53] RECOVERY - Host ml-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 33.47 ms [16:07:24] 10SRE, 10serviceops: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Joe) Given it has created some doubts, let me clarify: I've created a first version of the charts that implements solution 1 (and not a complete version of it, either). I did... [16:07:35] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [16:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:51] 10SRE, 10serviceops: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Joe) a:03Joe [16:08:02] 10SRE, 10ops-codfw: ganeti2015 doesn't boot - https://phabricator.wikimedia.org/T277537 (10MoritzMuehlenhoff) Thanks, Papaul [16:12:10] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:45] (03CR) 10Bstorm: [C: 03+1] "Looks good on my end" [puppet] - 10https://gerrit.wikimedia.org/r/674074 (https://phabricator.wikimedia.org/T274565) (owner: 10David Caro) [16:21:49] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [16:24:07] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [16:31:44] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cumin2002.codfw.wmnet - https://phabricator.wikimedia.org/T276587 (10Papaul) [16:32:15] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-fe200[12].codfw.wmnet - https://phabricator.wikimedia.org/T275513 (10Papaul) [16:37:33] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [16:37:38] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [16:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:03] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to mailman3-roots role for Ladsgroup - https://phabricator.wikimedia.org/T278078 (10Volans) While the ownership of mailmain within SRE is being discussed, I've raised this request in today's SRE meeting and got approved so that @Ladsgroup i... [16:38:30] (03CR) 10Volans: "Approved on SRE meeting" [puppet] - 10https://gerrit.wikimedia.org/r/673971 (https://phabricator.wikimedia.org/T278078) (owner: 10Volans) [16:38:55] (03CR) 10JMeybohm: [C: 03+2] eventrouter: Fix type conversion issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/674028 (owner: 10JMeybohm) [16:39:14] (03PS4) 10JMeybohm: eventrouter: Fix type conversion issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/674028 [16:40:09] (03PS2) 10David Caro: wmcs.ceph.codfw: Upgrade to latest 5.X kernel [puppet] - 10https://gerrit.wikimedia.org/r/674074 (https://phabricator.wikimedia.org/T274565) [16:40:22] (03CR) 10David Caro: "Just added a comment" [puppet] - 10https://gerrit.wikimedia.org/r/674074 (https://phabricator.wikimedia.org/T274565) (owner: 10David Caro) [16:41:52] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephmon2004-dev - https://phabricator.wikimedia.org/T276509 (10Papaul) [16:46:20] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [16:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:06] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [16:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:48] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:13] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:53:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:55:53] (03PS2) 10H.krishna123: Add logger functionality to recover-dump, add logger statements, added unit test to test initializing logging [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) [17:00:05] ryankemper: Your horoscope predicts another unfortunate Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210322T1700). [17:01:38] (03CR) 10Ahmon Dancy: [C: 03+1] gerrit: remove Apache MaxClients limit [puppet] - 10https://gerrit.wikimedia.org/r/674070 (https://phabricator.wikimedia.org/T277127) (owner: 10Hashar) [17:04:09] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to mailman3-roots role for Ladsgroup - https://phabricator.wikimedia.org/T278078 (10Dzahn) meanwhile there is a VM, lists1002, that this should be applied to. So far it has the "insetup" role but the _actual_ step to give Amir access after... [17:04:27] (03CR) 10H.krishna123: "I think this is done, please review when possible :) Thank you" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) (owner: 10H.krishna123) [17:05:37] (03CR) 10Dzahn: [C: 03+1] "approved in SRE meeting and so far it has no effect, what will actually give Amir access is when the new mailman3 role is applied to lists" [puppet] - 10https://gerrit.wikimedia.org/r/673971 (https://phabricator.wikimedia.org/T278078) (owner: 10Volans) [17:05:47] (03CR) 10David Caro: [C: 03+2] wmcs.ceph.codfw: Upgrade to latest 5.X kernel [puppet] - 10https://gerrit.wikimedia.org/r/674074 (https://phabricator.wikimedia.org/T274565) (owner: 10David Caro) [17:06:32] (03CR) 10Volans: [C: 03+2] admin: add ladsgroup to mailman3-roots [puppet] - 10https://gerrit.wikimedia.org/r/673971 (https://phabricator.wikimedia.org/T278078) (owner: 10Volans) [17:10:19] RECOVERY - Stale file for node-exporter textfile in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [17:11:52] (03PS1) 10Andrew Bogott: wmcs instances: override /etc/cloud/templates/hosts.debian.tmpl [puppet] - 10https://gerrit.wikimedia.org/r/674091 (https://phabricator.wikimedia.org/T277866) [17:12:55] (03CR) 10jerkins-bot: [V: 04-1] wmcs instances: override /etc/cloud/templates/hosts.debian.tmpl [puppet] - 10https://gerrit.wikimedia.org/r/674091 (https://phabricator.wikimedia.org/T277866) (owner: 10Andrew Bogott) [17:15:07] 10SRE, 10SRE-Access-Requests: Request for access to mailman3-roots role for Ladsgroup - https://phabricator.wikimedia.org/T278078 (10Ladsgroup) Thank you so much for approving. I won't let you down. [17:15:15] (03PS8) 10Mstyles: rdf-streaming-updater: create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) [17:19:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:21:08] (03PS2) 10Andrew Bogott: wmcs instances: override /etc/cloud/templates/hosts.debian.tmpl [puppet] - 10https://gerrit.wikimedia.org/r/674091 (https://phabricator.wikimedia.org/T277866) [17:21:19] 10SRE, 10SRE-Access-Requests: Request for access to mailman3-roots role for Ladsgroup - https://phabricator.wikimedia.org/T278078 (10Legoktm) >>! In T278078#6935455, @Dzahn wrote: > meanwhile there is a VM, lists1002, that this should be applied to. So far it has the "insetup" role but the _actual_ step to giv... [17:22:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:22:15] (03CR) 10jerkins-bot: [V: 04-1] wmcs instances: override /etc/cloud/templates/hosts.debian.tmpl [puppet] - 10https://gerrit.wikimedia.org/r/674091 (https://phabricator.wikimedia.org/T277866) (owner: 10Andrew Bogott) [17:22:41] (03CR) 10Arturo Borrero Gonzalez: "thanks for the patch!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674091 (https://phabricator.wikimedia.org/T277866) (owner: 10Andrew Bogott) [17:23:46] (03PS3) 10Andrew Bogott: wmcs instances: override /etc/cloud/templates/hosts.debian.tmpl [puppet] - 10https://gerrit.wikimedia.org/r/674091 (https://phabricator.wikimedia.org/T277866) [17:28:51] (03CR) 10Jcrespo: "If you don't mind some criticism, some of your changes look a bit too verbose. While clarity over brevity is usually preferred, something " [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) (owner: 10H.krishna123) [17:30:15] !log reindexing Italian wikis on elastic@eqiad, elastic@codfw, and cloudelastic (T274200) [17:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:24] T274200: Reindex English and Italian wikis to enable homoglyph plugin - https://phabricator.wikimedia.org/T274200 [17:35:18] (03CR) 10Jcrespo: "recheck" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) (owner: 10H.krishna123) [17:36:31] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/674095 [17:39:26] (03PS4) 10Andrew Bogott: wmcs instances: override /etc/cloud/templates/hosts.debian.tmpl [puppet] - 10https://gerrit.wikimedia.org/r/674091 (https://phabricator.wikimedia.org/T277866) [17:40:14] (03CR) 10Majavah: "just a reminder for myself: hiera values for deployment-etcd02 need to be moved from Iecfc26a941dbe7741e48bce4f0e584af9090cd07 to this pat" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674077 (owner: 10Majavah) [17:43:19] (03PS5) 10Andrew Bogott: wmcs instances: override /etc/cloud/templates/hosts.debian.tmpl [puppet] - 10https://gerrit.wikimedia.org/r/674091 (https://phabricator.wikimedia.org/T277866) [17:44:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs instances: override /etc/cloud/templates/hosts.debian.tmpl [puppet] - 10https://gerrit.wikimedia.org/r/674091 (https://phabricator.wikimedia.org/T277866) (owner: 10Andrew Bogott) [17:45:17] PROBLEM - AQS root url on aqs1011 is CRITICAL: connect to address 10.64.16.201 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [17:45:39] hnowlan: --^ I think downtime expired [17:46:13] elukey: sounds like it, will ack [17:48:21] ACKNOWLEDGEMENT - AQS root url on aqs1011 is CRITICAL: connect to address 10.64.16.201 and port 7232: Connection refused Hnowlan Host not in use yet. https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [17:50:41] (03CR) 10Bstorm: "Question mark in inline comment intended, do with it as you wish. It's style nonsense." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674091 (https://phabricator.wikimedia.org/T277866) (owner: 10Andrew Bogott) [17:52:00] (03CR) 10David Caro: [C: 03+1] "LGTM, @andrew if we work around this with any future patch to cloud init, add a comment to remove 😊" [puppet] - 10https://gerrit.wikimedia.org/r/674091 (https://phabricator.wikimedia.org/T277866) (owner: 10Andrew Bogott) [17:54:51] (03PS1) 10Razzi: refinery: Rename --labsdb flag to be --clouddb [puppet] - 10https://gerrit.wikimedia.org/r/674097 (https://phabricator.wikimedia.org/T269211) [17:58:04] (03CR) 10Hashar: [C: 03+1] "Looks fine yes. That is used on contint2001, for example:" [puppet] - 10https://gerrit.wikimedia.org/r/670990 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [17:59:02] (03CR) 10H.krishna123: "> Patch Set 2:" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) (owner: 10H.krishna123) [18:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210322T1800). Please do the needful. [18:00:04] DannyS712: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:16] here [18:02:23] is anyone available to deploy? [18:02:48] i can deploy today [18:03:30] 10SRE, 10serviceops, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) @fgiunchedi could you re-run your analysis to see if mw1307 (10.64.0.169) is still exhibiting the issue? [18:04:19] thanks Urbanecm. Patch is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/674047 [18:04:25] DannyS712: since when do we have delete-redirect right? Just curious [18:04:40] (03CR) 10Urbanecm: [C: 03+2] Grant enwiki pagemovers the `delete-redirect` right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674047 (https://phabricator.wikimedia.org/T278131) (owner: 10DannyS712) [18:04:40] since I created it :) [18:04:43] hehe [18:05:19] see https://phabricator.wikimedia.org/T239277 [18:05:34] (03Merged) 10jenkins-bot: Grant enwiki pagemovers the `delete-redirect` right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674047 (https://phabricator.wikimedia.org/T278131) (owner: 10DannyS712) [18:05:44] (03PS6) 10Andrew Bogott: wmcs instances: override /etc/cloud/templates/hosts.debian.tmpl [puppet] - 10https://gerrit.wikimedia.org/r/674091 (https://phabricator.wikimedia.org/T277866) [18:06:29] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools' beta features on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674098 (https://phabricator.wikimedia.org/T276494) [18:07:42] Urbanecm standing by to test [18:08:41] (03PS7) 10Andrew Bogott: wmcs instances: override /etc/cloud/templates/hosts.debian.tmpl [puppet] - 10https://gerrit.wikimedia.org/r/674091 (https://phabricator.wikimedia.org/T277866) [18:09:13] DannyS712: please do [18:09:15] mwdebug1001 [18:10:54] okay, with that debug host specified the move form correctly determines that I have the rights and asks me if I want to delete the target redirect, but when I hit submit it just refreshes - I assume thats due to the debug part and not the underlying code, so should be good to merge [18:11:24] will try to submit and go through fully once its live [18:11:54] okay, syncing [18:13:45] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 951601f7a4c887f21e209b32dbd1cfd3da084816: Grant enwiki pagemovers the delete-redirect right (T278131) (duration: 00m 59s) [18:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:54] T278131: Grant enwiki pagemovers the `delete-redirect` right - https://phabricator.wikimedia.org/T278131 [18:14:16] Urbanecm confirmed to work [18:14:20] cool [18:14:23] anything else? [18:14:41] > I assume thats due to the debug part and not the underlying code, so should be good to merge [18:14:54] X-Wikimedia-Debug should apply to all requests [18:15:00] including POST requests... [18:15:21] legoktm well, it wasn't going through... [18:15:21] Urbanecm nothing here, though a request for you in -stewards if you have a second [18:15:28] (03CR) 10Urbanecm: hrwiki: Configure mentorship for Growth team features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673807 (https://phabricator.wikimedia.org/T275684) (owner: 10Urbanecm) [18:15:32] (03PS2) 10Urbanecm: hrwiki: Configure mentorship for Growth team features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673807 (https://phabricator.wikimedia.org/T275684) [18:15:51] (03CR) 10Andrew Bogott: [C: 03+2] wmcs instances: override /etc/cloud/templates/hosts.debian.tmpl [puppet] - 10https://gerrit.wikimedia.org/r/674091 (https://phabricator.wikimedia.org/T277866) (owner: 10Andrew Bogott) [18:15:57] sounds like a different bug then [18:16:32] legoktm: deletions happen in jobs...sometimes [18:17:04] but permissions should be checked in the same request, not in the job queue? [18:17:24] legoktm: yes, and Danny said "okay, with that debug host specified the move form correctly determines that I have the rights..." [18:17:42] but then why didn't the form submission work? [18:17:52] not sure [18:18:06] maybe it did work, but it submitted a job, which checked permissions again? [18:18:10] pure speculation [18:18:17] (03CR) 10Urbanecm: [C: 03+2] hrwiki: Configure mentorship for Growth team features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673807 (https://phabricator.wikimedia.org/T275684) (owner: 10Urbanecm) [18:18:29] I think it merits more investigation :) [18:18:43] I would be skeptical if the code did that, but I haven't looked at the code! [18:18:44] possibly [18:19:04] (03Merged) 10jenkins-bot: hrwiki: Configure mentorship for Growth team features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673807 (https://phabricator.wikimedia.org/T275684) (owner: 10Urbanecm) [18:19:09] but at least, it works in real prod :) [18:20:12] (03CR) 10Andrew Bogott: [C: 03+2] Support building a grid-exec node with cinder or flavor-defined storage [puppet] - 10https://gerrit.wikimedia.org/r/672456 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [18:20:57] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 25247c9cbba3d3741908164f2d15fb8497ce8b5e: hrwiki: Configure mentorship for Growth team features (T275684) (duration: 01m 00s) [18:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:05] T275684: Deploy Growth features on Croatian Wikipedia - https://phabricator.wikimedia.org/T275684 [18:21:12] * Urbanecm done [18:23:27] 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T277602 (10CGlenn) Thank you so much, @Volans ! [18:24:35] (03PS5) 10Volans: netbox: add NetboxServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/670235 (https://phabricator.wikimedia.org/T205885) [18:25:01] (03CR) 10Volans: "replies inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/670235 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [18:43:42] (03CR) 10CRusnov: [C: 03+1] "LGTM. Very needed class thank you." [software/spicerack] - 10https://gerrit.wikimedia.org/r/670235 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [18:48:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:50:39] (03CR) 10Razzi: "Goes with https://gerrit.wikimedia.org/r/c/analytics/refinery/+/666209" [puppet] - 10https://gerrit.wikimedia.org/r/674097 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [18:52:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:54:07] (03PS1) 10Andrew Bogott: cloud-vps instances: only change a cloud-init template if we have cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/674103 [18:54:57] (03CR) 10Dzahn: "tried to compile it to see if we can just merge it to get the access request completed but compiler does not know host list1002 yet. next " [puppet] - 10https://gerrit.wikimedia.org/r/673636 (owner: 10Legoktm) [18:55:44] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps instances: only change a cloud-init template if we have cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/674103 (owner: 10Andrew Bogott) [18:56:00] mutante: it's not ready to be merged yet, there's still a few outstanding things remaining [18:56:26] https://phabricator.wikimedia.org/T277286 the key ones are setting up ferm and protecting /admin [18:57:02] legoktm: ok, ACK. separately we need https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet-diffs/Documentation#How_to_update_the_compiler%27s_facts%3F_%28e.g._INFO%3A_Unable_to_find_facts_for_host_conf2001.codfw.wmnet%2C_skipping%29 [18:58:16] legoktm: I never expected it to be a single change from "nothing" to "production ready", expected many incremental steps, fwiw [18:59:32] my main worry is that just enabling the role will set up exim, mailman3 etc. [18:59:55] at least I think we need ferm setup first, which is on my list for tomorrow [19:00:10] (03CR) 10Bstorm: maintain-dbusers: polish things up a bit (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/673606 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [19:00:11] logging, monitoring dkim, etc. we can iterate on [19:00:40] *nod* makes sense [19:00:53] (03PS3) 10Herron: logstash: add field checks to filter throttle [puppet] - 10https://gerrit.wikimedia.org/r/633224 [19:04:23] (03PS6) 10Ayounsi: Add Capirca definitions exporter [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/666876 (https://phabricator.wikimedia.org/T273865) [19:04:51] (03CR) 10Ayounsi: Add Capirca definitions exporter (039 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/666876 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [19:04:55] (03PS2) 10Bstorm: maintain-dbusers: polish things up a bit [puppet] - 10https://gerrit.wikimedia.org/r/673606 (https://phabricator.wikimedia.org/T276284) [19:05:33] (03CR) 10Bstorm: maintain-dbusers: polish things up a bit (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/673606 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [19:05:47] (03CR) 10Herron: [C: 03+2] logstash: add field checks to filter throttle [puppet] - 10https://gerrit.wikimedia.org/r/633224 (owner: 10Herron) [19:09:49] (03PS1) 10Andrew Bogott: hosts.debian.tmpl: further attempt to handle VMs without cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/674126 (https://phabricator.wikimedia.org/T277866) [19:10:54] (03CR) 10jerkins-bot: [V: 04-1] hosts.debian.tmpl: further attempt to handle VMs without cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/674126 (https://phabricator.wikimedia.org/T277866) (owner: 10Andrew Bogott) [19:11:36] (03PS2) 10Andrew Bogott: hosts.debian.tmpl: further attempt to handle VMs without cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/674126 (https://phabricator.wikimedia.org/T277866) [19:12:57] (03CR) 10Andrew Bogott: [C: 03+2] hosts.debian.tmpl: further attempt to handle VMs without cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/674126 (https://phabricator.wikimedia.org/T277866) (owner: 10Andrew Bogott) [19:16:23] (03CR) 10Bstorm: maintain-dbusers: polish things up a bit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673606 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [19:23:38] (03CR) 10Krinkle: "The diff seems non-trivially different. E.g. not just a dc swap for some routes. I was expecting this to e.g. map eqiad->eqiad instead of " [puppet] - 10https://gerrit.wikimedia.org/r/654330 (owner: 10Aaron Schulz) [19:28:11] (03CR) 10Ayounsi: [C: 03+1] "+1 despite the issues you mentions, it's still quite useful." [homer/public] - 10https://gerrit.wikimedia.org/r/673071 (https://phabricator.wikimedia.org/T272688) (owner: 10Volans) [19:29:32] (03CR) 10Dzahn: [WIP] Add lists-next.wikimedia.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/673638 (owner: 10Legoktm) [19:31:37] (03CR) 10Dzahn: [C: 03+2] gerrit: remove Apache MaxClients limit [puppet] - 10https://gerrit.wikimedia.org/r/674070 (https://phabricator.wikimedia.org/T277127) (owner: 10Hashar) [19:45:11] (03CR) 10Dzahn: "I briefly talked about this in #httpd channel and since we are using the event MPM, we should be using MaxRequestWorkers in the first plac" [puppet] - 10https://gerrit.wikimedia.org/r/674070 (https://phabricator.wikimedia.org/T277127) (owner: 10Hashar) [19:47:30] !log gerrit - restarting apache2 after we dropped MaxClients config line. This should make us fall back to Debian default MaxRequestWorkers. (since we use event MPM we should not be using MaxClients in the first place, says #httpd) (T277127) [19:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:38] T277127: Gerrit Apache out of workers - https://phabricator.wikimedia.org/T277127 [19:48:33] (03CR) 10Dzahn: "manually restarted apache2" [puppet] - 10https://gerrit.wikimedia.org/r/674070 (https://phabricator.wikimedia.org/T277127) (owner: 10Hashar) [19:50:45] !log gerrit2001 - restarted apache2 as well for consistency [19:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:19] (03PS2) 10Dzahn: site/conftool-data: turn mw2278,mw2279 into canary jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/673630 (https://phabricator.wikimedia.org/T277780) [19:57:41] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) @Papaul fyi, this one is separate from T277119. I had to somehow separate them and instead by rack this is by purchase date. You will see though that... [19:58:40] (03CR) 10Dzahn: [C: 03+2] site/conftool-data: turn mw2278,mw2279 into canary jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/673630 (https://phabricator.wikimedia.org/T277780) (owner: 10Dzahn) [20:00:04] chrisalbon and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210322T2000). [20:02:07] !log dzahn@cumin1001 conftool action : set/weight=1; selector: name=mw2278.codfw.wmnet,service=canary [20:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:16] !log dzahn@cumin1001 conftool action : set/weight=1; selector: name=mw2279.codfw.wmnet,service=canary [20:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:36] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2278.codfw.wmnet,service=canary [20:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:44] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2279.codfw.wmnet,service=canary [20:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:40] (03CR) 10Dzahn: "[cumin1001:~] $ sudo -i confctl select name=mw2279.codfw.wmnet,service=canary set/weight=1" [puppet] - 10https://gerrit.wikimedia.org/r/673630 (https://phabricator.wikimedia.org/T277780) (owner: 10Dzahn) [20:07:51] (03PS1) 10Ahmon Dancy: Include patches in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 [20:16:24] (03PS1) 10Addshore: node10-sssd: bump npm from 6.5 to 6.14.5 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/674134 (https://phabricator.wikimedia.org/T278180) [20:21:13] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [20:23:41] (03PS1) 10Dzahn: Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into review/dzahn/decom-appserver-codfw [puppet] - 10https://gerrit.wikimedia.org/r/674136 [20:23:43] (03PS1) 10Dzahn: site/conftool-data: decom mw2249,mw2250 jobrunner canaries [puppet] - 10https://gerrit.wikimedia.org/r/674137 (https://phabricator.wikimedia.org/T277780) [20:23:45] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [20:24:17] (03CR) 10jerkins-bot: [V: 04-1] Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into review/dzahn/decom-appserver-codfw [puppet] - 10https://gerrit.wikimedia.org/r/674136 (owner: 10Dzahn) [20:24:25] (03Abandoned) 10Dzahn: Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into review/dzahn/decom-appserver-codfw [puppet] - 10https://gerrit.wikimedia.org/r/674136 (owner: 10Dzahn) [20:24:38] (03PS2) 10Dzahn: site/conftool-data: decom mw2249,mw2250 jobrunner canaries [puppet] - 10https://gerrit.wikimedia.org/r/674137 (https://phabricator.wikimedia.org/T277780) [20:25:55] (03PS3) 10Dzahn: site/conftool-data: decom mw2249,mw2250 jobrunner canaries [puppet] - 10https://gerrit.wikimedia.org/r/674137 (https://phabricator.wikimedia.org/T277780) [20:29:54] (03PS5) 10Legoktm: [WIP] Add shellbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 [20:31:21] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add shellbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 (owner: 10Legoktm) [20:38:23] (03PS6) 10Legoktm: Add shellbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 [20:43:30] 10ops-eqsin, 10DC-Ops: (Need By: TBD) rack/setup/install cp501[3-6] - https://phabricator.wikimedia.org/T278182 (10RobH) [20:43:58] 10ops-eqsin, 10DC-Ops: (Need By: TBD) rack/setup/install cp501[3-6] - https://phabricator.wikimedia.org/T278182 (10RobH) [20:45:49] (03PS7) 10Legoktm: Add shellbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 [20:47:00] (03CR) 10Volans: "Thanks for the fixes, replies inline" (036 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/666876 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [20:50:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:54:37] (03CR) 10Legoktm: Add shellbox chart (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 (owner: 10Legoktm) [20:55:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:00:04] Reedy and sbassett: Dear deployers, time to do the Weekly Security deployment window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210322T2100). [21:00:58] (03CR) 10Volans: [C: 03+2] "Self-merging to unblock CI in other CRs, already tested in other repositories like Cumin." [software/spicerack] - 10https://gerrit.wikimedia.org/r/673961 (owner: 10Volans) [21:02:04] Hey all - was going to deploy one sec patch for T272244 in a bit... [21:07:15] (03CR) 10jerkins-bot: [V: 04-1] tests: fix pip backtracking [software/spicerack] - 10https://gerrit.wikimedia.org/r/673961 (owner: 10Volans) [21:08:00] !log Deployed security patch for T272244 [21:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:52] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Papaul) @Dzahn thanks for the update. I am planning on racking mw2401 to mw2411 in A5 and not in A4 since A4 is a 10G rack , i will like to keep this rack on... [21:14:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:16:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:18:13] (03CR) 10Ahmon Dancy: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/674070 (https://phabricator.wikimedia.org/T277127) (owner: 10Hashar) [21:27:43] (03CR) 10CDanis: [C: 03+1] "I don't feel completely qualified to review this, but, looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [21:28:19] (03CR) 10CDanis: [C: 03+1] P:base: add ability to manage services file [puppet] - 10https://gerrit.wikimedia.org/r/670918 (owner: 10Jbond) [21:30:27] (03PS6) 10Volans: netbox: add NetboxServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/670235 (https://phabricator.wikimedia.org/T205885) [21:30:31] (03PS3) 10Volans: tests: fix pip backtracking [software/spicerack] - 10https://gerrit.wikimedia.org/r/673961 [21:30:33] (03PS1) 10Volans: tests: fix format checking [software/spicerack] - 10https://gerrit.wikimedia.org/r/674145 [21:33:02] (03CR) 10Bstorm: [C: 03+2] maintain-dbusers: polish things up a bit [puppet] - 10https://gerrit.wikimedia.org/r/673606 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [21:36:39] (03PS4) 10Volans: tests: fix pip backtracking [software/spicerack] - 10https://gerrit.wikimedia.org/r/673961 [21:36:41] (03PS2) 10Volans: tests: fix format checking [software/spicerack] - 10https://gerrit.wikimedia.org/r/674145 [21:36:43] (03PS7) 10Volans: netbox: add NetboxServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/670235 (https://phabricator.wikimedia.org/T205885) [21:46:17] (03CR) 10Volans: [C: 03+2] "Self merging to unblock CI on other CRs." [software/spicerack] - 10https://gerrit.wikimedia.org/r/674145 (owner: 10Volans) [21:46:37] (03CR) 10Volans: [C: 03+2] "Happy to fix any post-merge comment." [software/spicerack] - 10https://gerrit.wikimedia.org/r/674145 (owner: 10Volans) [21:47:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:47:27] 10SRE, 10SRE-tools: Private puppet commit hook checks current state of folder, not what is staged - https://phabricator.wikimedia.org/T278187 (10Legoktm) [21:49:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:50:05] (03PS1) 10Alexandros Kosiaris: downtime: Support services and other special icinga host [puppet] - 10https://gerrit.wikimedia.org/r/674147 (https://phabricator.wikimedia.org/T277191) [21:50:30] (03CR) 10Alexandros Kosiaris: [C: 03+2] kubernetes staging-eqiad: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/671175 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [21:50:34] (03PS2) 10Ahmon Dancy: Include patches in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 [21:51:06] (03CR) 10jerkins-bot: [V: 04-1] netbox: add NetboxServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/670235 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [21:51:41] (03CR) 10jerkins-bot: [V: 04-1] tests: fix format checking [software/spicerack] - 10https://gerrit.wikimedia.org/r/674145 (owner: 10Volans) [21:52:22] (03CR) 10Volans: [C: 03+2] "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/674145 (owner: 10Volans) [21:54:38] 10Puppet, 10SRE, 10SRE-tools: Private puppet commit hook checks current state of folder, not what is staged - https://phabricator.wikimedia.org/T278187 (10Volans) p:05Triage→03Medium [21:55:27] (03CR) 10Volans: [C: 03+2] "Self merging to unblock CI on other CRs. Happy to fix any post-merge comment." [software/homer] - 10https://gerrit.wikimedia.org/r/673990 (owner: 10Volans) [21:58:02] (03CR) 10jerkins-bot: [V: 04-1] tests: fix format checking [software/spicerack] - 10https://gerrit.wikimedia.org/r/674145 (owner: 10Volans) [21:58:04] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add kubernetes1017 to BGP peers [homer/public] - 10https://gerrit.wikimedia.org/r/672709 (https://phabricator.wikimedia.org/T277741) (owner: 10Alexandros Kosiaris) [21:58:59] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Nice, can't wait for the blockers to be lifted so we can merge this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/673956 (https://phabricator.wikimedia.org/T277741) (owner: 10JMeybohm) [21:59:32] (03Merged) 10jenkins-bot: tests: fix pip backtracking [software/homer] - 10https://gerrit.wikimedia.org/r/673990 (owner: 10Volans) [22:00:10] (03CR) 10Volans: [C: 03+2] tests: fix format checking [software/spicerack] - 10https://gerrit.wikimedia.org/r/674145 (owner: 10Volans) [22:00:53] (03PS2) 10Alexandros Kosiaris: kubernetes eqiad: Apply role and hiera values to new masters [puppet] - 10https://gerrit.wikimedia.org/r/673952 (https://phabricator.wikimedia.org/T277741) (owner: 10JMeybohm) [22:03:34] (03CR) 10Alexandros Kosiaris: [C: 03+1] kubernetes eqiad: Apply role and hiera values to new masters [puppet] - 10https://gerrit.wikimedia.org/r/673952 (https://phabricator.wikimedia.org/T277741) (owner: 10JMeybohm) [22:03:55] (03CR) 10Alexandros Kosiaris: [C: 03+1] admin_ng: Enable eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/673955 (https://phabricator.wikimedia.org/T277741) (owner: 10JMeybohm) [22:05:27] (03CR) 10jerkins-bot: [V: 04-1] tests: fix format checking [software/spicerack] - 10https://gerrit.wikimedia.org/r/674145 (owner: 10Volans) [22:06:21] (03PS1) 10Ebernhardson: Turn on glent m1 AB test [extensions/WikimediaEvents] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/674115 (https://phabricator.wikimedia.org/T262612) [22:08:01] (03PS2) 10Alexandros Kosiaris: kubernetes eqiad: Populate hiera keys for k8s worker updates [puppet] - 10https://gerrit.wikimedia.org/r/673949 (https://phabricator.wikimedia.org/T277741) (owner: 10JMeybohm) [22:08:05] (03CR) 10Alexandros Kosiaris: [C: 03+1] kubernetes eqiad: Populate hiera keys for k8s worker updates [puppet] - 10https://gerrit.wikimedia.org/r/673949 (https://phabricator.wikimedia.org/T277741) (owner: 10JMeybohm) [22:13:27] (03CR) 10Volans: [C: 03+2] "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/674145 (owner: 10Volans) [22:17:46] (03PS11) 10Volans: Add Capirca support to Homer [software/homer] - 10https://gerrit.wikimedia.org/r/663536 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [22:18:06] (03PS2) 10Volans: sre.hosts.downtime: fix example usage [cookbooks] - 10https://gerrit.wikimedia.org/r/672368 [22:19:11] (03Merged) 10jenkins-bot: tests: fix format checking [software/spicerack] - 10https://gerrit.wikimedia.org/r/674145 (owner: 10Volans) [22:20:33] (03Abandoned) 10Volans: DO NOT MERGE - debugging git clone in CI [dns] - 10https://gerrit.wikimedia.org/r/668345 (owner: 10Volans) [22:21:18] (03CR) 10jerkins-bot: [V: 04-1] Add Capirca support to Homer [software/homer] - 10https://gerrit.wikimedia.org/r/663536 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [22:21:58] (03CR) 10Volans: [C: 03+2] sre.hosts.downtime: fix example usage [cookbooks] - 10https://gerrit.wikimedia.org/r/672368 (owner: 10Volans) [22:22:56] (03CR) 10CDanis: [C: 03+1] "lgtm! a little involved but not so bad" [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond) [22:24:28] (03Merged) 10jenkins-bot: sre.hosts.downtime: fix example usage [cookbooks] - 10https://gerrit.wikimedia.org/r/672368 (owner: 10Volans) [22:37:31] (03PS1) 10Bstorm: maintain-dbusers: rely on the global_id, not username for paws [puppet] - 10https://gerrit.wikimedia.org/r/674151 (https://phabricator.wikimedia.org/T276284) [22:37:58] (03CR) 10Dzahn: "> For the record, what is the value of MaxRequestWorkers now?" [puppet] - 10https://gerrit.wikimedia.org/r/674070 (https://phabricator.wikimedia.org/T277127) (owner: 10Hashar) [22:38:39] (03CR) 10Dzahn: "for comparison, Phabricator has a puppetized MaxRequestWorkers 400" [puppet] - 10https://gerrit.wikimedia.org/r/674070 (https://phabricator.wikimedia.org/T277127) (owner: 10Hashar) [22:40:42] (03CR) 10Bstorm: maintain-dbusers: rely on the global_id, not username for paws (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674151 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [22:44:48] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2249.codfw.wmnet [22:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:56] (03PS2) 10Bstorm: maintain-dbusers: rely on the UIDS, not username [puppet] - 10https://gerrit.wikimedia.org/r/674151 (https://phabricator.wikimedia.org/T276284) [22:48:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:52:21] !log decom mw2249 [22:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:03] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Papaul) [22:54:21] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Papaul) All 35 servers racked. [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210322T2300). [23:00:04] ebernhardson: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:00:35] i can ship it [23:00:52] (03CR) 10Ebernhardson: [C: 03+2] Turn on glent m1 AB test [extensions/WikimediaEvents] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/674115 (https://phabricator.wikimedia.org/T262612) (owner: 10Ebernhardson) [23:01:41] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2249.codfw.wmnet [23:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:48] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2249.codfw.wmnet` - mw2249.codfw.wmnet (**PASS**) - Downtime... [23:02:42] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) @Papaul We can do that, it isn't a problem. We can use A3 and A5. Thank you [23:07:48] (03Merged) 10jenkins-bot: Turn on glent m1 AB test [extensions/WikimediaEvents] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/674115 (https://phabricator.wikimedia.org/T262612) (owner: 10Ebernhardson) [23:08:11] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2250.codfw.wmnet [23:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:18] (03PS1) 10RobH: cloudgw100[12] setup [puppet] - 10https://gerrit.wikimedia.org/r/674162 (https://phabricator.wikimedia.org/T272403) [23:17:11] (03CR) 10RobH: [C: 03+2] cloudgw100[12] setup [puppet] - 10https://gerrit.wikimedia.org/r/674162 (https://phabricator.wikimedia.org/T272403) (owner: 10RobH) [23:18:38] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [23:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:56] !log ebernhardson@deploy1002 Synchronized php-1.36.0-wmf.35/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js: T262612: Start glent m1 ab test (duration: 01m 53s) [23:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:03] T262612: Run an A/B test using suggestions generated using glent Method 1 - https://phabricator.wikimedia.org/T262612 [23:20:22] (03CR) 10Cwhite: [C: 03+1] "overall LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674025 (https://phabricator.wikimedia.org/T272977) (owner: 10Filippo Giunchedi) [23:21:45] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10RobH) [23:34:20] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2250.codfw.wmnet [23:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:26] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2250.codfw.wmnet` - mw2250.codfw.wmnet (**PASS**) - Downtime... [23:34:30] (03CR) 10Bstorm: [C: 04-1] "So this is a nice idea. This is the way the system is supposed to work, but the way it was set up and the way some things ran at some poin" [puppet] - 10https://gerrit.wikimedia.org/r/674151 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [23:35:37] (03PS4) 10Dzahn: site/conftool-data: decom mw2249,mw2250 jobrunner canaries [puppet] - 10https://gerrit.wikimedia.org/r/674137 (https://phabricator.wikimedia.org/T277780) [23:36:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` cloudgw1002.eqiad.wmnet `... [23:37:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Papaul) [23:42:55] (03PS3) 10Tim Starling: Use the RequestTimeout library to set time limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672579 (https://phabricator.wikimedia.org/T269326) [23:43:58] (03PS3) 10Bstorm: maintain-dbusers: rely on the UIDS, not username [puppet] - 10https://gerrit.wikimedia.org/r/674151 (https://phabricator.wikimedia.org/T276284) [23:46:29] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Papaul) [23:47:43] (03PS1) 10Bstorm: maintain-dbusers: rely on the global_id, not username for paws [puppet] - 10https://gerrit.wikimedia.org/r/674165 (https://phabricator.wikimedia.org/T276284) [23:49:02] (03PS2) 10Razzi: refinery: Rename --labsdb flag to be --clouddb [puppet] - 10https://gerrit.wikimedia.org/r/674097 (https://phabricator.wikimedia.org/T269211) [23:49:03] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1002.eqiad.wmnet with reason: REIMAGE [23:49:08] (03CR) 10Dzahn: [C: 03+2] site/conftool-data: decom mw2249,mw2250 jobrunner canaries [puppet] - 10https://gerrit.wikimedia.org/r/674137 (https://phabricator.wikimedia.org/T277780) (owner: 10Dzahn) [23:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:37] (03CR) 10Bstorm: [C: 04-1] "Created https://gerrit.wikimedia.org/r/c/operations/puppet/+/674165 out of patch set 1 so that we can prevent the crashes from PAWS accoun" [puppet] - 10https://gerrit.wikimedia.org/r/674151 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [23:51:15] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) [23:51:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:52:24] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1002.eqiad.wmnet with reason: REIMAGE [23:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:32] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cumin2002.codfw.wmnet - https://phabricator.wikimedia.org/T276587 (10Papaul) [23:58:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudgw1002.eqiad.wmnet'] ` and were **ALL** successful. [23:58:49] I am deploying this RequestTimeout change https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/672579 [23:59:07] Sounds good. [23:59:09] (03CR) 10Tim Starling: [C: 03+2] Use the RequestTimeout library to set time limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672579 (https://phabricator.wikimedia.org/T269326) (owner: 10Tim Starling)