[00:05:27] RECOVERY - Logstash Elasticsearch indexing errors #o11y on alert1001 is OK: (C)8 ge (W)1 ge 0.9125 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [01:47:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:52:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:04:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:06:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:54:31] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:56:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:56:07] 10SRE, 10ops-eqiad, 10DBA: db1162 crashed - https://phabricator.wikimedia.org/T275309 (10Marostegui) >>! In T275309#6879207, @Cmjohnson wrote: > This has been moved to this coming Friday at 10am local time (1500UTC) Was this done past Friday in the end? Thanks [06:03:54] 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Sergey.Trofimovsky.SF) >>! In T276144#6887981, @Dzahn wrote: > @Sergey.Trofimovsky.SF Do yo... [06:13:19] PROBLEM - Check systemd state on scb2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:21:06] !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [06:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1166 T276742', diff saved to https://phabricator.wikimedia.org/P14649 and previous config saved to /var/cache/conftool/dbconfig/20210308-062350-marostegui.json [06:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:57] T276742: Check all tables on some hosts - https://phabricator.wikimedia.org/T276742 [06:25:31] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:27:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:29:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P14650 and previous config saved to /var/cache/conftool/dbconfig/20210308-062932-root.json [06:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:37] (03CR) 10Marostegui: [C: 03+2] parsercache.my.cnf: innodb_change_buffering = none [puppet] - 10https://gerrit.wikimedia.org/r/668669 (https://phabricator.wikimedia.org/T263443) (owner: 10Marostegui) [06:35:29] (03PS1) 10Marostegui: db2116: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/669622 (https://phabricator.wikimedia.org/T275633) [06:36:04] (03CR) 10Marostegui: [C: 03+2] db2116: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/669622 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [06:37:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1168 T276742', diff saved to https://phabricator.wikimedia.org/P14651 and previous config saved to /var/cache/conftool/dbconfig/20210308-063700-marostegui.json [06:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:07] T276742: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 [06:41:39] 10SRE, 10ops-eqiad, 10User-ArielGlenn: Interface errors on asw2-b-eqiad:ge-8/0/6 (dumpsdata1001) - https://phabricator.wikimedia.org/T273714 (10ArielGlenn) 05Open→03Resolved Yes, things look good from here. Thanks a lot! [06:42:35] 10SRE, 10Datasets-General-or-Unknown, 10Dumps-Generation, 10netops: Packets discarded on dumpsdata1001 - https://phabricator.wikimedia.org/T273713 (10ArielGlenn) 05Open→03Resolved Never did merge the task but I'm closing it now. Dumpsdata1001 traffic looks good and so does 1003. [06:44:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P14652 and previous config saved to /var/cache/conftool/dbconfig/20210308-064436-root.json [06:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:43] !log Set innodb_change_buffering = none on all parsercache hosts T263443 [06:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:49] T263443: Evaluate the impact of changing innodb_change_buffering to inserts - https://phabricator.wikimedia.org/T263443 [06:49:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2146 T275633', diff saved to https://phabricator.wikimedia.org/P14653 and previous config saved to /var/cache/conftool/dbconfig/20210308-064953-marostegui.json [06:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:01] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [06:50:25] (03PS1) 10Marostegui: db2146: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/669626 (https://phabricator.wikimedia.org/T275633) [06:51:19] (03CR) 10Marostegui: [C: 03+2] db2146: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/669626 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [06:52:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2092 T275633', diff saved to https://phabricator.wikimedia.org/P14654 and previous config saved to /var/cache/conftool/dbconfig/20210308-065220-marostegui.json [06:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:52] 10ops-eqiad, 10DC-Ops: ps1-a7-eqiad power over threshold alerts - https://phabricator.wikimedia.org/T276743 (10ayounsi) p:05Triage→03Low [06:53:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2116 T275633', diff saved to https://phabricator.wikimedia.org/P14655 and previous config saved to /var/cache/conftool/dbconfig/20210308-065300-marostegui.json [06:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P14656 and previous config saved to /var/cache/conftool/dbconfig/20210308-065939-root.json [06:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:26] (03PS1) 10Marostegui: dbproxy1018: Depool clouddb1013 [puppet] - 10https://gerrit.wikimedia.org/r/669628 (https://phabricator.wikimedia.org/T269211) [07:13:55] (03CR) 10jerkins-bot: [V: 04-1] dbproxy1018: Depool clouddb1013 [puppet] - 10https://gerrit.wikimedia.org/r/669628 (https://phabricator.wikimedia.org/T269211) (owner: 10Marostegui) [07:14:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14657 and previous config saved to /var/cache/conftool/dbconfig/20210308-071443-root.json [07:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:48] (03PS2) 10Marostegui: dbproxy1018: Depool clouddb1013 [puppet] - 10https://gerrit.wikimedia.org/r/669628 (https://phabricator.wikimedia.org/T269211) [07:17:07] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Depool clouddb1013 [puppet] - 10https://gerrit.wikimedia.org/r/669628 (https://phabricator.wikimedia.org/T269211) (owner: 10Marostegui) [07:20:51] (03PS1) 10Marostegui: Revert "dbproxy1018: Depool clouddb1013" [puppet] - 10https://gerrit.wikimedia.org/r/668823 [07:21:27] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1018: Depool clouddb1013" [puppet] - 10https://gerrit.wikimedia.org/r/668823 (owner: 10Marostegui) [07:23:14] !log drain + reimage analytics107[4,5] to Buster [07:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:22] (03PS1) 10Marostegui: dbproxy1019: Depool clouddb1013 [puppet] - 10https://gerrit.wikimedia.org/r/669633 (https://phabricator.wikimedia.org/T269211) [07:23:49] (03CR) 10jerkins-bot: [V: 04-1] dbproxy1019: Depool clouddb1013 [puppet] - 10https://gerrit.wikimedia.org/r/669633 (https://phabricator.wikimedia.org/T269211) (owner: 10Marostegui) [07:26:14] (03PS2) 10Marostegui: dbproxy1019: Depool clouddb1013 [puppet] - 10https://gerrit.wikimedia.org/r/669633 (https://phabricator.wikimedia.org/T269211) [07:28:13] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:30:20] (03CR) 10Marostegui: [C: 03+2] dbproxy1019: Depool clouddb1013 [puppet] - 10https://gerrit.wikimedia.org/r/669633 (https://phabricator.wikimedia.org/T269211) (owner: 10Marostegui) [07:32:24] !log Depool clouddb1013:3311, clouddb1013:3313 - T269211 [07:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:30] T269211: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 [07:37:19] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 16 down 2 https://wikitech.wikimedia.org/wiki/HAProxy [07:39:22] ^ expected [07:40:13] noted! [07:41:18] ACKNOWLEDGEMENT - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 15 down 2 Marostegui Known https://wikitech.wikimedia.org/wiki/HAProxy [07:44:21] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:44:48] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1074.eqiad.wmnet with reason: REIMAGE [07:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:52] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1074.eqiad.wmnet with reason: REIMAGE [07:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:10] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1075.eqiad.wmnet with reason: REIMAGE [07:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:12] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1075.eqiad.wmnet with reason: REIMAGE [07:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:31] (03PS1) 10DCausse: Add a note for the elasticsearch image in releng/dev-images [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/669720 [08:12:41] (03PS1) 10Muehlenhoff: Remove access for dedcode [puppet] - 10https://gerrit.wikimedia.org/r/669726 [08:15:17] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for dedcode [puppet] - 10https://gerrit.wikimedia.org/r/669726 (owner: 10Muehlenhoff) [08:20:30] !log drain + reimage an-worker108[1,2] to Buster [08:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:41] (03PS1) 10Kosta Harlan: MentorHooks: Make mentor assignment follow same rules as HomepageHooks [extensions/GrowthExperiments] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668824 (https://phabricator.wikimedia.org/T276720) [08:21:39] !log swift eqiad-prod: add weight to ms-be106[0-3] - T268435 [08:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:45] T268435: Add ms-be106[0-3] to swift - https://phabricator.wikimedia.org/T268435 [08:32:03] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: ingest logstash logs as json and convert to ECS [puppet] - 10https://gerrit.wikimedia.org/r/668189 (https://phabricator.wikimedia.org/T273919) (owner: 10Cwhite) [08:48:15] 10SRE: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [08:48:17] 10SRE: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 (10MoritzMuehlenhoff) I'll take care of this once the new server is racked. [08:49:18] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:52:04] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1081.eqiad.wmnet with reason: REIMAGE [08:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:47] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1082.eqiad.wmnet with reason: REIMAGE [08:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1081.eqiad.wmnet with reason: REIMAGE [08:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:12] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1082.eqiad.wmnet with reason: REIMAGE [08:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:17:22] <_joe_> !log regenerating puppet certs for scb200{1,2} [09:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:19:14] !log drain + reimage an-worker108[3,4] to Buster [09:19:16] (03PS1) 10JMeybohm: Remove SSH keys reused in Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/669741 (https://phabricator.wikimedia.org/T275677) [09:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:45] RECOVERY - Puppet CA expired certs on puppetmaster1001 is OK: OK https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate [09:26:13] (03PS2) 10JMeybohm: Remove SSH keys reused in Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/669741 (https://phabricator.wikimedia.org/T275677) [09:26:49] (03PS6) 10Kormat: mariadb: Convert pt-heartbeat to a systemd service. [puppet] - 10https://gerrit.wikimedia.org/r/665324 (https://phabricator.wikimedia.org/T252528) [09:27:05] (03PS4) 10Kormat: mariadb: Use section parameters: misc profiles. [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) [09:27:33] (03CR) 10David Caro: "Looks ok to me, but someone else should check the business logic, any `nit` can be ignored." (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/668867 (https://phabricator.wikimedia.org/T276516) (owner: 10Legoktm) [09:34:01] <_joe_> !log manually removed the old graphoid IP from scb server's interfaces (long-standing bug in wikimedia-lvs-realserver when removing the last managed IP) [09:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:34] 10SRE, 10observability: Convert udp2log init script to use systemd - https://phabricator.wikimedia.org/T276623 (10JMeybohm) [09:36:16] 10SRE, 10Packaging: Disable man-db in pbuilder in package_builder on deneb - https://phabricator.wikimedia.org/T276632 (10JMeybohm) [09:36:27] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/669741 (https://phabricator.wikimedia.org/T275677) (owner: 10JMeybohm) [09:36:41] (03CR) 10JMeybohm: [C: 03+2] Remove SSH keys reused in Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/669741 (https://phabricator.wikimedia.org/T275677) (owner: 10JMeybohm) [09:38:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:38:53] RECOVERY - Check systemd state on scb2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:42:22] (03PS1) 10Marostegui: Revert "dbproxy1019: Depool clouddb1013" [puppet] - 10https://gerrit.wikimedia.org/r/668825 [09:45:02] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10JMeybohm) 05Resolved→03Open @OlyKalinichenkoSpeedAndFunction I did remove your SSH key from your production account as y... [09:45:08] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to gitlab1001 / gitlab1002 for Eugene Chernov from Speed & Function - https://phabricator.wikimedia.org/T275679 (10JMeybohm) 05Resolved→03Open @Eugene.chernov I did remove your SSH key from your production account as you seem to have up... [09:47:32] (03PS1) 10Giuseppe Lavagetto: systemd::timer::job: correctly quote environment variables [puppet] - 10https://gerrit.wikimedia.org/r/669753 [09:47:47] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 18 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:49:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:55:11] <_joe_> !log uploading new versions of docker images: php7.{2,3}-{cli,fpm}, httpd, httpd-fcgi, mediawiki-httpd, memcached T276097 T265327 [09:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:19] T276097: Create MediaWiki httpd base image - https://phabricator.wikimedia.org/T276097 [09:55:20] T265327: Create a basic helm chart to test MediaWiki on kubernetes - https://phabricator.wikimedia.org/T265327 [09:57:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:58:51] (03CR) 10Kormat: "Andrew: can you handle the clouddb grants?" [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [09:59:25] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1019: Depool clouddb1013" [puppet] - 10https://gerrit.wikimedia.org/r/668825 (owner: 10Marostegui) [10:00:33] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:01:44] !log Repool clouddb1013:3311, clouddb1013:3313 [10:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:02:39] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10RhinosF1) All 3 contractors have done so despite being told. The other one just got caught earlier. Should they be asked to... [10:03:35] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to gitlab1001 / gitlab1002 for Eugene Chernov from Speed & Function - https://phabricator.wikimedia.org/T275679 (10Eugene.chernov) hello @JMeybohm , Thank you. Here is the key for prod: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDXSV4ht0Q6zuWnnN... [10:05:37] (03CR) 10Kormat: "Deployed to m5." [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [10:08:04] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28411/console" [puppet] - 10https://gerrit.wikimedia.org/r/669753 (owner: 10Giuseppe Lavagetto) [10:11:21] (03CR) 10Elukey: [C: 03+1] systemd::timer::job: correctly quote environment variables [puppet] - 10https://gerrit.wikimedia.org/r/669753 (owner: 10Giuseppe Lavagetto) [10:12:31] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 4 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28412/console" [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [10:15:03] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1083.eqiad.wmnet with reason: REIMAGE [10:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:23] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1084.eqiad.wmnet with reason: REIMAGE [10:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:17:06] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1083.eqiad.wmnet with reason: REIMAGE [10:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:03] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1084.eqiad.wmnet with reason: REIMAGE [10:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:22:16] (03PS5) 10Kormat: mariadb: Use section parameters: misc profiles. [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) [10:23:46] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28413/console" [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [10:29:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/669753 (owner: 10Giuseppe Lavagetto) [10:34:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. When we split wmf-laptop into SRE and non-SRE packages, the Cloud bastion will be primary.bastion.wmcloud.org, but for migrati" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/668746 (owner: 10Elukey) [10:41:28] !log drain + reimage an-worker1104/1089 to Debian Buster [10:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:38] PROBLEM - Check systemd state on ms-be1040 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:49:15] (03PS6) 10Kormat: mariadb: Use section parameters: misc profiles. [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) [10:49:28] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Eugene Chernov from Speed & Function - https://phabricator.wikimedia.org/T275679 (10JMeybohm) [10:50:05] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28414/console" [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [10:50:27] (03PS1) 10JMeybohm: Add new SSH key for eugene-chernov [puppet] - 10https://gerrit.wikimedia.org/r/669768 (https://phabricator.wikimedia.org/T275679) [10:50:46] (03PS7) 10Kormat: mariadb: Use section parameters: misc profiles. [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) [10:51:33] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28415/console" [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [10:52:07] (03PS6) 10Klausman: [WIP, do not review] Add k8s config for ML machines [puppet] - 10https://gerrit.wikimedia.org/r/668075 [10:55:21] (03PS8) 10Kormat: mariadb: Use section parameters: misc profiles. [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) [10:56:15] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28417/console" [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [11:04:07] (03CR) 10Elukey: "Moritz: I just noticed the following in the docs:" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/668746 (owner: 10Elukey) [11:05:45] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The container should be built and run as www-data. See what was done for shellbox." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666492 (https://phabricator.wikimedia.org/T274182) (owner: 10Dduvall) [11:06:12] (03CR) 10Muehlenhoff: [C: 03+1] "> Patch Set 1:" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/668746 (owner: 10Elukey) [11:08:41] (03CR) 10Elukey: "+ Arturo to see if we can add the page :)" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/668746 (owner: 10Elukey) [11:11:25] RECOVERY - Check systemd state on ms-be1040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:14:08] (03PS2) 10Elukey: systemd::timer::job: correctly quote environment variables [puppet] - 10https://gerrit.wikimedia.org/r/669753 (owner: 10Giuseppe Lavagetto) [11:14:47] (03CR) 10Volans: [C: 04-1] "Thanks for the patch and the tests! One fix needed for the regex, the rest are just minor improvements." (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/668867 (https://phabricator.wikimedia.org/T276516) (owner: 10Legoktm) [11:16:21] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28420/console" [puppet] - 10https://gerrit.wikimedia.org/r/669753 (owner: 10Giuseppe Lavagetto) [11:19:13] (03CR) 10Kosta Harlan: "Thank you!" (031 comment) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/669720 (owner: 10DCausse) [11:20:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:20:30] (03CR) 10Elukey: [V: 03+1 C: 03+1] systemd::timer::job: correctly quote environment variables [puppet] - 10https://gerrit.wikimedia.org/r/669753 (owner: 10Giuseppe Lavagetto) [11:21:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:22:33] PROBLEM - Check systemd state on ms-be2060 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:28:01] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28421/console" [puppet] - 10https://gerrit.wikimedia.org/r/668075 (owner: 10Klausman) [11:28:10] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1088.eqiad.wmnet with reason: REIMAGE [11:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:15] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2060 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:30:04] jan_drewniak: (Dis)respected human, time to deploy Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210308T1130). Please do the needful. [11:30:11] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1088.eqiad.wmnet with reason: REIMAGE [11:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:51] RECOVERY - Check systemd state on ms-be2060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:45] 10SRE, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Volans) [11:40:29] 10SRE, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Volans) [11:41:05] (03PS2) 10Volans: sre.hosts.decommission: temporary fix for Netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/668505 (https://phabricator.wikimedia.org/T274689) [11:43:15] (03CR) 10Jbond: [C: 03+1] Add new SSH key for eugene-chernov [puppet] - 10https://gerrit.wikimedia.org/r/669768 (https://phabricator.wikimedia.org/T275679) (owner: 10JMeybohm) [11:50:55] (03CR) 10JMeybohm: [C: 03+2] Add new SSH key for eugene-chernov [puppet] - 10https://gerrit.wikimedia.org/r/669768 (https://phabricator.wikimedia.org/T275679) (owner: 10JMeybohm) [11:53:23] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to gitlab1001 / gitlab1002 for Eugene Chernov from Speed & Function - https://phabricator.wikimedia.org/T275679 (10JMeybohm) 05Open→03Resolved Account has been updated [11:59:38] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2060 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: How many deployers does it take to do European mid-day backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210308T1200). [12:00:04] kostajh: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:54] I can deploy today [12:01:15] kostajh: you around? [12:01:23] (03CR) 10Urbanecm: [C: 03+2] MentorHooks: Make mentor assignment follow same rules as HomepageHooks [extensions/GrowthExperiments] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668824 (https://phabricator.wikimedia.org/T276720) (owner: 10Kosta Harlan) [12:01:52] Urbanecm: hi! [12:02:14] Hi Kosta. I'll ping you when available. [12:02:19] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [12:07:18] Urbanecm: I'm not sure there's a great way to test this, so I think it could just be synced [12:07:40] kostajh: okay, good to know. [12:08:17] 10SRE: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10MoritzMuehlenhoff) >>! In T224579#6887456, @Majavah wrote: > Hi, is this going to happen? Beta cluster has also an IRC server running Jessie, and in an effort of getting rid of Jessie on beta I'm offering it as a test... [12:08:48] (03PS1) 10Giuseppe Lavagetto: mediawiki-httpd: force removal of mod_deflate [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/669780 [12:12:25] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mediawiki-httpd: force removal of mod_deflate [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/669780 (owner: 10Giuseppe Lavagetto) [12:13:04] (03CR) 10Elukey: [WIP, do not review] Add k8s config for ML machines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668075 (owner: 10Klausman) [12:15:28] (03Merged) 10jenkins-bot: MentorHooks: Make mentor assignment follow same rules as HomepageHooks [extensions/GrowthExperiments] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668824 (https://phabricator.wikimedia.org/T276720) (owner: 10Kosta Harlan) [12:18:18] kostajh: I'm going to sync this and rely on canary checks [12:18:27] Urbanecm: thanks, that sounds good to me [12:20:51] !log urbanecm@deploy1002 Synchronized php-1.36.0-wmf.33/extensions/GrowthExperiments/includes/Mentorship/MentorHooks.php: 48d6c55c91b42445900ccdf06b78703c1c5233a6: MentorHooks: Make mentor assignment follow same rules as HomepageHooks (T276720) (duration: 00m 58s) [12:20:55] kostajh: done [12:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:57] anything else? [12:20:58] T276720: User::loadFromSession called before the end of Setup.php - https://phabricator.wikimedia.org/T276720 [12:21:10] Urbanecm: ty [12:21:13] np [12:22:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P14660 and previous config saved to /var/cache/conftool/dbconfig/20210308-122201-root.json [12:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:51] (03PS1) 10Phuedx: vector: Expand Desktop Improvements pilot wiki group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669783 (https://phabricator.wikimedia.org/T273090) [12:29:15] (03PS1) 10Jbond: cfssl::cert: change how we distribute bundles [puppet] - 10https://gerrit.wikimedia.org/r/669784 [12:30:16] (03CR) 10jerkins-bot: [V: 04-1] vector: Expand Desktop Improvements pilot wiki group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669783 (https://phabricator.wikimedia.org/T273090) (owner: 10Phuedx) [12:36:04] (03PS2) 10Phuedx: vector: Expand Desktop Improvements pilot wiki group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669783 (https://phabricator.wikimedia.org/T273090) [12:37:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P14661 and previous config saved to /var/cache/conftool/dbconfig/20210308-123704-root.json [12:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:15] (03PS12) 10Giuseppe Lavagetto: pipeline: Initial multiversion pipeline configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666492 (https://phabricator.wikimedia.org/T274182) (owner: 10Dduvall) [12:37:17] (03PS1) 10Giuseppe Lavagetto: pipeline: add building the webserver image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669807 [12:38:14] (03CR) 10Jbond: [C: 03+2] cfssl::cert: change how we distribute bundles [puppet] - 10https://gerrit.wikimedia.org/r/669784 (owner: 10Jbond) [12:39:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:42:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:49:27] (03PS1) 10Jbond: cfssl::cert: fix path for int ca [puppet] - 10https://gerrit.wikimedia.org/r/669810 [12:50:24] (03CR) 10Jbond: [C: 03+2] cfssl::cert: fix path for int ca [puppet] - 10https://gerrit.wikimedia.org/r/669810 (owner: 10Jbond) [12:52:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P14662 and previous config saved to /var/cache/conftool/dbconfig/20210308-125208-root.json [12:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:27] (03PS9) 10Kormat: mariadb: Use section parameters: misc profiles. [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) [12:57:22] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28422/console" [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [13:01:01] (03PS1) 10Jbond: cfssl::cert: change renew time to 24 hours [puppet] - 10https://gerrit.wikimedia.org/r/669815 [13:02:11] (03CR) 10Jbond: [C: 03+2] cfssl::cert: change renew time to 24 hours [puppet] - 10https://gerrit.wikimedia.org/r/669815 (owner: 10Jbond) [13:03:11] (03PS1) 10Gergő Tisza: [beta] GrowthExperiments: set $wgGEDeveloperSetup = true on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669816 (https://phabricator.wikimedia.org/T274198) [13:03:31] (03PS10) 10Kormat: mariadb: Use section parameters: misc profiles. [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) [13:05:19] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 4 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28423/console" [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [13:07:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14663 and previous config saved to /var/cache/conftool/dbconfig/20210308-130712-root.json [13:07:16] (03PS1) 10Jbond: cfssl::signer: change default expiry to 96h [puppet] - 10https://gerrit.wikimedia.org/r/669818 [13:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:26] (03CR) 10Jbond: [C: 03+2] cfssl::signer: change default expiry to 96h [puppet] - 10https://gerrit.wikimedia.org/r/669818 (owner: 10Jbond) [13:19:19] (03PS1) 10Kormat: mariadb: Set misc nodes in codfw as 'master' [puppet] - 10https://gerrit.wikimedia.org/r/669821 (https://phabricator.wikimedia.org/T275497) [13:19:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={netbox_device_statistics,pdu_sentry4} site={codfw,eqsin} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:21:13] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (NOOP 3 DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28424/console" [puppet] - 10https://gerrit.wikimedia.org/r/669821 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [13:22:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:32:39] (03PS2) 10Kormat: mariadb: Set misc nodes in codfw as 'master' [puppet] - 10https://gerrit.wikimedia.org/r/669821 (https://phabricator.wikimedia.org/T275497) [13:33:41] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: temporary fix for Netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/668505 (https://phabricator.wikimedia.org/T274689) (owner: 10Volans) [13:34:19] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (NOOP 3 DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28425/console" [puppet] - 10https://gerrit.wikimedia.org/r/669821 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [13:35:54] (03Merged) 10jenkins-bot: sre.hosts.decommission: temporary fix for Netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/668505 (https://phabricator.wikimedia.org/T274689) (owner: 10Volans) [13:36:06] (03CR) 10Kormat: [V: 03+1 C: 03+2] mariadb: Set misc nodes in codfw as 'master' [puppet] - 10https://gerrit.wikimedia.org/r/669821 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [13:36:17] (03CR) 10Kormat: [V: 03+1] mariadb: Set misc nodes in codfw as 'master' [puppet] - 10https://gerrit.wikimedia.org/r/669821 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [13:39:01] (03PS11) 10Kormat: mariadb: Use section parameters: misc profiles. [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) [13:41:06] (03CR) 10Volans: "Couple of questions/comments inline. Will this break af-netbox in WMCS?" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [13:42:31] (03CR) 10Marostegui: "Will this make them page if they go down?" [puppet] - 10https://gerrit.wikimedia.org/r/669821 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [13:43:04] 10SRE, 10SRE-tools, 10Patch-For-Review: sre.hosts.decomission -> generate_dns_snippets - > Cumin execution failed - https://phabricator.wikimedia.org/T274689 (10Volans) 05Open→03Resolved a:03Volans The additional sleep in the above patch should have workaround the issue. Resolving it for now, feel free... [13:43:29] (03CR) 10Marostegui: [C: 03+1] "Sorry, I didn't see you were also changing misc.pp to avoid that :)" [puppet] - 10https://gerrit.wikimedia.org/r/669821 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [13:43:39] (03PS7) 10Klausman: [WIP, do not review] Add k8s config for ML machines [puppet] - 10https://gerrit.wikimedia.org/r/668075 [13:43:55] (03CR) 10Kormat: [V: 03+1 C: 03+2] mariadb: Set misc nodes in codfw as 'master' [puppet] - 10https://gerrit.wikimedia.org/r/669821 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [13:44:58] (03PS12) 10Kormat: mariadb: Use section parameters: smaller misc profiles [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) [13:45:37] (03PS13) 10Kormat: mariadb: Use section parameters: smaller misc profiles [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) [13:46:15] (03PS8) 10Klausman: [WIP, do not review] Add k8s config for ML machines [puppet] - 10https://gerrit.wikimedia.org/r/668075 [13:46:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:46:39] (03CR) 10Klausman: [WIP, do not review] Add k8s config for ML machines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668075 (owner: 10Klausman) [13:47:20] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 4 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28426/console" [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [13:47:47] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28427/console" [puppet] - 10https://gerrit.wikimedia.org/r/668075 (owner: 10Klausman) [13:48:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:50:05] (03PS9) 10Klausman: modules/roles: Add k8s config for ML team machines [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) [13:54:05] (03PS14) 10Kormat: mariadb: Use section parameters: smaller misc profiles [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) [13:55:46] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28428/console" [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [13:59:26] (03CR) 10Volans: "I'd like to see this applied to af-netbox to be able to test it." (036 comments) [software/netbox] - 10https://gerrit.wikimedia.org/r/668574 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [14:00:45] (03PS15) 10Kormat: mariadb: Use section parameters: smaller misc profiles [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) [14:05:56] (03CR) 10Muehlenhoff: [C: 03+1] netbox, profile::netbox: Switch to CAS authentication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [14:14:33] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1105.eqiad.wmnet with reason: REIMAGE [14:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:04] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28430/console" [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [14:16:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:16:32] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1106.eqiad.wmnet with reason: REIMAGE [14:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:39] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1105.eqiad.wmnet with reason: REIMAGE [14:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:10] (03CR) 10Marostegui: [C: 03+1] mariadb: Use section parameters: smaller misc profiles [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [14:17:46] (03CR) 10Kormat: [C: 03+2] mariadb: Use section parameters: smaller misc profiles [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [14:18:33] (03CR) 10Elukey: "A couple of things are off in the PCC:" [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [14:18:43] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:18:48] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1106.eqiad.wmnet with reason: REIMAGE [14:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:24] 10SRE, 10SRE-Access-Requests, 10Wikimedia-Mailing-lists: Request for creation of mailman3-roots group - https://phabricator.wikimedia.org/T276712 (10JMeybohm) p:05Triage→03Medium [14:21:40] (03CR) 10Elukey: "in role prometheus.yaml I see" [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [14:25:24] (03PS10) 10Klausman: modules/roles: Add k8s config for ML team machines [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) [14:26:14] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28431/console" [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [14:27:38] (03CR) 10Elukey: "Forgot also to mention:" [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [14:28:27] (03CR) 10Klausman: [V: 03+1] "> Patch Set 9:" [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [14:29:24] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28432/console" [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [14:32:48] 10SRE: Log the real X-Client-IP in apache mediawiki logs - https://phabricator.wikimedia.org/T246348 (10akosiaris) p:05High→03Low >>! In T246348#6887555, @jijiki wrote: > After discussing with @akosiaris, we decided that when a request is made from k8s towards the API, it makes sense for apache to see the po... [14:34:50] I'll deploy a beta-only patch [14:35:08] (03CR) 10Gergő Tisza: [C: 03+2] [beta] GrowthExperiments: set $wgGEDeveloperSetup = true on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669816 (https://phabricator.wikimedia.org/T274198) (owner: 10Gergő Tisza) [14:35:59] (03Merged) 10jenkins-bot: [beta] GrowthExperiments: set $wgGEDeveloperSetup = true on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669816 (https://phabricator.wikimedia.org/T274198) (owner: 10Gergő Tisza) [14:36:26] (03CR) 10Alexandros Kosiaris: [C: 03+1] "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/668750 (https://phabricator.wikimedia.org/T276502) (owner: 10Ottomata) [14:39:04] (03CR) 10Ottomata: "I used to know and thought I had documented it somewhere, but now can't find it... :/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/668750 (https://phabricator.wikimedia.org/T276502) (owner: 10Ottomata) [14:39:17] (03CR) 10Ottomata: [C: 03+2] eventgate-analytics-external - Bump replicas to 6 for increase in mediawiki.client.session_tick [deployment-charts] - 10https://gerrit.wikimedia.org/r/668750 (https://phabricator.wikimedia.org/T276502) (owner: 10Ottomata) [14:39:45] (03PS11) 10Klausman: modules/roles: Add k8s config for ML team machines [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) [14:39:47] (03PS1) 10Jbond: cfssl::client: add ability to proxy requests [puppet] - 10https://gerrit.wikimedia.org/r/669837 [14:40:29] (03Merged) 10jenkins-bot: eventgate-analytics-external - Bump replicas to 6 for increase in mediawiki.client.session_tick [deployment-charts] - 10https://gerrit.wikimedia.org/r/668750 (https://phabricator.wikimedia.org/T276502) (owner: 10Ottomata) [14:40:35] (03PS12) 10Klausman: modules/roles: Add k8s config for ML team machines [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) [14:41:35] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28433/console" [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [14:42:40] (03PS1) 10BBlack: ATS: force cache revalidation for 6 wikis [puppet] - 10https://gerrit.wikimedia.org/r/669840 (https://phabricator.wikimedia.org/T274784) [14:43:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28434/console" [puppet] - 10https://gerrit.wikimedia.org/r/669837 (owner: 10Jbond) [14:44:20] (03CR) 10Ottomata: "I found a comment where I was able to push 1800 / second through a single instance. https://phabricator.wikimedia.org/T220661#5116643" [deployment-charts] - 10https://gerrit.wikimedia.org/r/668750 (https://phabricator.wikimedia.org/T276502) (owner: 10Ottomata) [14:45:44] (03PS13) 10Klausman: modules/roles: Add k8s config for ML team machines [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) [14:46:28] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28435/console" [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [14:47:27] (03CR) 10Ottomata: "Added https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration#Benchmarking" [deployment-charts] - 10https://gerrit.wikimedia.org/r/668750 (https://phabricator.wikimedia.org/T276502) (owner: 10Ottomata) [14:48:24] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [14:48:24] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [14:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:08] (03PS2) 10Jbond: cfssl::client: add ability to proxy requests [puppet] - 10https://gerrit.wikimedia.org/r/669837 [14:49:40] (03CR) 10Alexandros Kosiaris: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/668750 (https://phabricator.wikimedia.org/T276502) (owner: 10Ottomata) [14:51:26] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [14:51:26] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [14:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28439/console" [puppet] - 10https://gerrit.wikimedia.org/r/669837 (owner: 10Jbond) [14:53:32] (03CR) 10Elukey: "Pcc shows only weird admin user entries, but I believe those are users absented etc.." [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [14:53:47] (03CR) 10Jbond: [V: 03+1 C: 03+2] cfssl::client: add ability to proxy requests [puppet] - 10https://gerrit.wikimedia.org/r/669837 (owner: 10Jbond) [14:54:08] !log drain + reimage an-worker110[7,8] to Buster [14:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:37] (03PS1) 10Kormat: mariadb: Use section params: remaining profiles. [puppet] - 10https://gerrit.wikimedia.org/r/669845 [14:55:58] (03PS1) 10JMeybohm: eventrouter: Update build and base image, switch to nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/669846 (https://phabricator.wikimedia.org/T274852) [14:56:22] (03PS2) 10Ottomata: Remove overrides for EL migration for Growth team schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668764 (https://phabricator.wikimedia.org/T267333) [14:56:34] (03PS2) 10Ottomata: Remove overrides for EL migration for WMDE Technical Wishes schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668766 (https://phabricator.wikimedia.org/T275005) [14:56:54] (03CR) 10Klausman: [V: 03+1] "> Patch Set 13:" [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [14:58:28] (03CR) 10Ottomata: [C: 03+2] Remove overrides for EL migration for Growth team schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668764 (https://phabricator.wikimedia.org/T267333) (owner: 10Ottomata) [14:58:36] (03CR) 10Ottomata: [C: 03+2] Remove overrides for EL migration for WMDE Technical Wishes schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668766 (https://phabricator.wikimedia.org/T275005) (owner: 10Ottomata) [14:59:56] tgr_: o/ ok if i sync your InitialiseSettings-labs.php change? [15:00:00] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/669816 [15:00:07] 10SRE, 10SRE-tools: Per host access control for kerberized SSH - https://phabricator.wikimedia.org/T276790 (10MoritzMuehlenhoff) [15:00:16] i was about to sync something and just saw it in the diff [15:00:22] (03PS2) 10Mforns: WikimediaEvents: Bump session_tick sampling rate to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668553 (https://phabricator.wikimedia.org/T276502) (owner: 10Mholloway) [15:00:29] ottomata: uh, sorry, got distracted in the middle of that [15:00:36] it doesn't need to be synced [15:00:53] oh right, i gues snot synced [15:00:55] but rebased? [15:01:01] yes please, thanks [15:01:03] so deployers don't have to think about it :) [15:01:04] ok cool thank you [15:02:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P14664 and previous config saved to /var/cache/conftool/dbconfig/20210308-150159-root.json [15:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:50] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Remove wgEventLoggingSchemas overrides for Growth and WMDE Tech wishes schemas - T267333, etc. (duration: 00m 59s) [15:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:57] T267333: Migrate Growth EventLogging schemas to Event Platform - https://phabricator.wikimedia.org/T267333 [15:03:08] (03PS2) 10Ottomata: Migrate PrefUpdate to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668769 (https://phabricator.wikimedia.org/T267348) [15:05:04] (03CR) 10Ottomata: [C: 03+2] Migrate PrefUpdate to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668769 (https://phabricator.wikimedia.org/T267348) (owner: 10Ottomata) [15:06:31] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10OlyKalinichenkoSpeedAndFunction) @JMeybohm, I've created add added a new dedicated key (4096 bits). Could you please add this key ` ssh-rsa AAAAB... [15:07:18] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Migrate PrefUpdate to EventGate on all wikis - T267348 (duration: 00m 59s) [15:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:25] T267348: PrefUpdate Event Platform Migration - https://phabricator.wikimedia.org/T267348 [15:10:40] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10JMeybohm) [15:11:48] (03PS2) 10Kormat: mariadb: Use section params: remaining profiles. [puppet] - 10https://gerrit.wikimedia.org/r/669845 [15:12:40] (03PS1) 10JMeybohm: Add new SSH key for olykalinichenko [puppet] - 10https://gerrit.wikimedia.org/r/669866 (https://phabricator.wikimedia.org/T275677) [15:13:46] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10OlyKalinichenkoSpeedAndFunction) @KFrancis could you please check a screenshot with *L3* {F34145575} [15:14:01] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1107.eqiad.wmnet with reason: REIMAGE [15:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:42] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28441/console" [puppet] - 10https://gerrit.wikimedia.org/r/669845 (owner: 10Kormat) [15:16:12] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1107.eqiad.wmnet with reason: REIMAGE [15:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:29] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1108.eqiad.wmnet with reason: REIMAGE [15:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P14665 and previous config saved to /var/cache/conftool/dbconfig/20210308-151703-root.json [15:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:32] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1108.eqiad.wmnet with reason: REIMAGE [15:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:13] (03PS1) 10Ottomata: Finalize PrefUpdate schema ingestion migration to Event Platform [puppet] - 10https://gerrit.wikimedia.org/r/669867 (https://phabricator.wikimedia.org/T267348) [15:21:29] (03CR) 10Ottomata: [C: 03+2] Finalize PrefUpdate schema ingestion migration to Event Platform [puppet] - 10https://gerrit.wikimedia.org/r/669867 (https://phabricator.wikimedia.org/T267348) (owner: 10Ottomata) [15:32:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P14666 and previous config saved to /var/cache/conftool/dbconfig/20210308-153207-root.json [15:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:55] PROBLEM - cassandra-a SSL 10.192.32.119:7001 on restbase2020 is CRITICAL: SSL CRITICAL - Certificate restbase2020-a valid until 2021-04-07 15:35:52 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [15:36:21] PROBLEM - cassandra-b SSL 10.192.32.120:7001 on restbase2020 is CRITICAL: SSL CRITICAL - Certificate restbase2020-b valid until 2021-04-07 15:35:54 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [15:36:41] PROBLEM - cassandra-c SSL 10.192.32.121:7001 on restbase2020 is CRITICAL: SSL CRITICAL - Certificate restbase2020-c valid until 2021-04-07 15:35:55 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [15:36:55] PROBLEM - cassandra-a SSL 10.192.16.98:7001 on restbase2019 is CRITICAL: SSL CRITICAL - Certificate restbase2019-a valid until 2021-04-07 15:35:49 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [15:36:59] PROBLEM - cassandra-c SSL 10.192.16.100:7001 on restbase2019 is CRITICAL: SSL CRITICAL - Certificate restbase2019-c valid until 2021-04-07 15:35:51 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [15:37:35] (03PS1) 10Jbond: P:pki::client: allow proxy service to use TLS when proxying [puppet] - 10https://gerrit.wikimedia.org/r/669873 [15:37:37] (03PS1) 10Jbond: cfssl::cert: add ability to notify a service on renew [puppet] - 10https://gerrit.wikimedia.org/r/669874 [15:42:05] (03PS3) 10Mforns: WikimediaEvents: Bump session_tick sampling rate to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668553 (https://phabricator.wikimedia.org/T276502) (owner: 10Mholloway) [15:45:29] PROBLEM - cassandra-b SSL 10.192.16.99:7001 on restbase2019 is CRITICAL: SSL CRITICAL - Certificate restbase2019-b valid until 2021-04-07 15:35:50 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [15:46:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10RobH) >>! In T273915#6889732, @Jclark-ctr wrote: > @RobH if you can clarify racking instructions. I have reviewed racking instructions and what these are replacin... [15:47:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14669 and previous config saved to /var/cache/conftool/dbconfig/20210308-154710-root.json [15:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:35] (03PS1) 10Effie Mouzeli: hieradata: enable ipv6 on envoy services mw canaries [puppet] - 10https://gerrit.wikimedia.org/r/669878 (https://phabricator.wikimedia.org/T255568) [15:48:30] (03PS2) 10Jbond: cfssl::cert: add ability to notify a service on renew [puppet] - 10https://gerrit.wikimedia.org/r/669874 [15:52:01] (03CR) 10Effie Mouzeli: "This has been tested already on https://gerrit.wikimedia.org/r/c/operations/puppet/+/663796" [puppet] - 10https://gerrit.wikimedia.org/r/669878 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli) [15:54:57] (03PS1) 10Gergő Tisza: [beta] GrowthExperiments: increase timeout for requests to service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669879 (https://phabricator.wikimedia.org/T274198) [15:55:04] !log Restar db11115 [15:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:41] !log Restart db1115 (tendril host) [15:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:57] PROBLEM - HTTPS-dbtree on dbmonitor1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 354 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [15:58:02] ^ expected [16:01:37] (03PS3) 10Kormat: mariadb: Use section params: remaining profiles. [puppet] - 10https://gerrit.wikimedia.org/r/669845 [16:01:42] (03PS8) 10Andrew Bogott: cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) [16:02:13] RECOVERY - HTTPS-dbtree on dbmonitor1001 is OK: HTTP OK: HTTP/1.1 200 OK - 106041 bytes in 7.733 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [16:02:29] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28442/console" [puppet] - 10https://gerrit.wikimedia.org/r/669845 (owner: 10Kormat) [16:03:47] I'll deploy a beta-only patch [16:04:04] (and hopefully not forget what I was doing halfway in, this time!) [16:05:15] (03CR) 10Gergő Tisza: [C: 03+2] [beta] GrowthExperiments: increase timeout for requests to service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669879 (https://phabricator.wikimedia.org/T274198) (owner: 10Gergő Tisza) [16:05:17] (03PS4) 10Kormat: mariadb: Use section params: remaining profiles. [puppet] - 10https://gerrit.wikimedia.org/r/669845 [16:06:07] (03Merged) 10jenkins-bot: [beta] GrowthExperiments: increase timeout for requests to service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669879 (https://phabricator.wikimedia.org/T274198) (owner: 10Gergő Tisza) [16:06:19] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28443/console" [puppet] - 10https://gerrit.wikimedia.org/r/669845 (owner: 10Kormat) [16:08:46] (03PS5) 10Kormat: mariadb: Use section params: remaining profiles. [puppet] - 10https://gerrit.wikimedia.org/r/669845 [16:09:32] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28444/console" [puppet] - 10https://gerrit.wikimedia.org/r/669845 (owner: 10Kormat) [16:11:21] (03PS3) 10Jbond: cfssl::cert: add ability to notify a service on renew [puppet] - 10https://gerrit.wikimedia.org/r/669874 [16:12:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28446/console" [puppet] - 10https://gerrit.wikimedia.org/r/669874 (owner: 10Jbond) [16:14:08] (03PS1) 10Muehlenhoff: Cumin aliases for unpriv Cumin [puppet] - 10https://gerrit.wikimedia.org/r/669881 [16:14:23] (03PS6) 10Kormat: mariadb: Use section params: remaining profiles. [puppet] - 10https://gerrit.wikimedia.org/r/669845 [16:14:50] (03CR) 10Jbond: [C: 03+2] P:pki::client: allow proxy service to use TLS when proxying [puppet] - 10https://gerrit.wikimedia.org/r/669873 (owner: 10Jbond) [16:14:57] (03CR) 10Jbond: [V: 03+1 C: 03+2] cfssl::cert: add ability to notify a service on renew [puppet] - 10https://gerrit.wikimedia.org/r/669874 (owner: 10Jbond) [16:17:38] !log drain + reimage an-worker1109/1110 to Buster [16:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:18] (03CR) 10Muehlenhoff: [C: 03+2] Cumin aliases for unpriv Cumin [puppet] - 10https://gerrit.wikimedia.org/r/669881 (owner: 10Muehlenhoff) [16:18:57] (03PS1) 10Jbond: cloud - hiera: add defaults [puppet] - 10https://gerrit.wikimedia.org/r/669884 [16:19:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,netbox_device_statistics} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:19:21] 10SRE, 10observability: Convert udp2log init script to use systemd - https://phabricator.wikimedia.org/T276623 (10lmata) @herron this might be worth looking into as part of the mwlog buster upgrade [16:21:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:21:22] (03CR) 10Jbond: [C: 03+2] cloud - hiera: add defaults [puppet] - 10https://gerrit.wikimedia.org/r/669884 (owner: 10Jbond) [16:23:42] (03PS6) 10Ahmon Dancy: env.php: Allow the datacenter to be specified in WMF_DATACENTER environment variable. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667243 [16:25:13] 10SRE, 10SRE-tools, 10Patch-For-Review: Evaluate options for non-root operations with cumin and spicerack cookbooks - https://phabricator.wikimedia.org/T244840 (10MoritzMuehlenhoff) Cumin has been adapted to be usable for non-privileged users with Kerberos (sans a final patch for the logging config to land i... [16:26:13] (03PS1) 10Jbond: cfssl::client: store bundle certs in own directory as we purge others [puppet] - 10https://gerrit.wikimedia.org/r/669887 [16:30:33] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10Legoktm) >>! In T275677#6891497, @RhinosF1 wrote: > All 3 contractors have done so despite being told. The other one just go... [16:30:45] (03CR) 10Bstorm: [C: 03+2] "Since this is an improvement, confirmed in testing, and is also the defined view on the live dbs, I'm going to merge this." [puppet] - 10https://gerrit.wikimedia.org/r/668843 (https://phabricator.wikimedia.org/T276628) (owner: 10Bstorm) [16:30:54] 10SRE, 10ops-eqiad, 10DBA: db1162 crashed - https://phabricator.wikimedia.org/T275309 (10Cmjohnson) The motherboard was swapped on friday but did not fix the issue. The Dell tech did more troubleshooting and it was determined the backplane is bad. Waiting on the part and tech to schedule a time with me to r... [16:32:20] 10SRE, 10ops-eqiad, 10DBA: db1162 crashed - https://phabricator.wikimedia.org/T275309 (10Marostegui) Thanks! [16:33:44] (03CR) 10Bstorm: [C: 03+1] wikireplica: depool clouddb1017 [puppet] - 10https://gerrit.wikimedia.org/r/668803 (owner: 10Phamhi) [16:33:56] (03PS1) 10Dduvall: maintenance: Skip setAgentAndTriggers for DB_NONE maintenance tasks [core] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/669790 (https://phabricator.wikimedia.org/T260827) [16:36:23] (03CR) 10Bstorm: "This would have depooled clouddb1017 and routed all traffic to clouddb1013. Luckily, I don't think anyone was using it. It's just the reve" [puppet] - 10https://gerrit.wikimedia.org/r/669628 (https://phabricator.wikimedia.org/T269211) (owner: 10Marostegui) [16:37:07] 10SRE, 10Analytics, 10CirrusSearch, 10Wikidata, and 3 others: Upgrade prometheus-jmx-exporter - https://phabricator.wikimedia.org/T276595 (10MPhamWMF) [16:39:10] (03CR) 10Jbond: [C: 03+2] cfssl::client: store bundle certs in own directory as we purge others [puppet] - 10https://gerrit.wikimedia.org/r/669887 (owner: 10Jbond) [16:39:12] (03CR) 10Marostegui: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/669628 (https://phabricator.wikimedia.org/T269211) (owner: 10Marostegui) [16:40:01] (03CR) 10Bstorm: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/669628 (https://phabricator.wikimedia.org/T269211) (owner: 10Marostegui) [16:40:15] (03PS9) 10Andrew Bogott: cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) [16:41:13] (03PS1) 10Bstorm: wikireplicas: expose actor_user = NULL (IPs) again in actor view [puppet] - 10https://gerrit.wikimedia.org/r/669888 (https://phabricator.wikimedia.org/T276698) [16:41:22] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1109.eqiad.wmnet with reason: REIMAGE [16:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:02] (03CR) 10Bstorm: "I've tested this on clouddb1013, and it definitely maintains the improved filtering using examples from the tasks, but it also is confirme" [puppet] - 10https://gerrit.wikimedia.org/r/669888 (https://phabricator.wikimedia.org/T276698) (owner: 10Bstorm) [16:43:23] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1110.eqiad.wmnet with reason: REIMAGE [16:43:25] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1109.eqiad.wmnet with reason: REIMAGE [16:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:19] (03PS1) 10Jbond: cloud - hiera: enable proxy client on proxy host [puppet] - 10https://gerrit.wikimedia.org/r/669889 [16:44:53] (03CR) 10Jbond: [C: 03+2] cloud - hiera: enable proxy client on proxy host [puppet] - 10https://gerrit.wikimedia.org/r/669889 (owner: 10Jbond) [16:44:55] (03CR) 10CRusnov: "> Patch Set 5:" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [16:45:38] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1110.eqiad.wmnet with reason: REIMAGE [16:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:39] PROBLEM - LVS linkrecommendation-external eqiad port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:53:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:53:34] 10SRE, 10Analytics, 10CirrusSearch, 10Wikidata, and 3 others: Upgrade prometheus-jmx-exporter - https://phabricator.wikimedia.org/T276595 (10colewhite) [16:53:38] 10Puppet, 10Analytics-Radar, 10Cassandra, 10observability, and 2 others: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948 (10colewhite) [16:53:59] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools' beta feature for newtopictool on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669893 (https://phabricator.wikimedia.org/T275827) [16:54:04] 10SRE, 10Discovery-Search (Current work): /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 (10MPhamWMF) [16:54:39] (03CR) 10Volans: "> Yes this will break af-netbox if it still exists (I was under the impression this was no longer in use and/or has been removed already)." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [16:55:05] RECOVERY - LVS linkrecommendation-external eqiad port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 195 bytes in 1.430 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:56:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:57:00] 10SRE, 10ops-eqiad, 10DC-Ops: ps1-a7-eqiad power over threshold alerts - https://phabricator.wikimedia.org/T276743 (10wiki_willy) a:03Cmjohnson [16:58:40] 10SRE, 10ops-eqiad, 10DC-Ops: ps1-a7-eqiad power over threshold alerts - https://phabricator.wikimedia.org/T276743 (10wiki_willy) @cmjohnson @elukey - just a heads up, this may put a wrench in moving one of the ms-be servers to A7. Let me see when the next time something in this rack is scheduled to be deco... [16:58:43] (03PS1) 10Cwhite: Upgrade to upstream version 0.15.0 [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/669894 [16:59:10] (03PS2) 10Cwhite: Upgrade to upstream version 0.15.0 [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/669894 (https://phabricator.wikimedia.org/T276595) [17:00:04] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools' beta features on frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669895 (https://phabricator.wikimedia.org/T276189) [17:02:30] (03PS10) 10Andrew Bogott: cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) [17:06:46] (03PS1) 10Andrew Bogott: Move prepare_cinder_volume.py into the cinderutils module [puppet] - 10https://gerrit.wikimedia.org/r/669896 [17:08:04] (03CR) 10jerkins-bot: [V: 04-1] Move prepare_cinder_volume.py into the cinderutils module [puppet] - 10https://gerrit.wikimedia.org/r/669896 (owner: 10Andrew Bogott) [17:09:09] (03PS1) 10Ottomata: Declare streams an migrate Editing schemas to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669897 (https://phabricator.wikimedia.org/T267343) [17:10:40] (03CR) 10jerkins-bot: [V: 04-1] Declare streams an migrate Editing schemas to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669897 (https://phabricator.wikimedia.org/T267343) (owner: 10Ottomata) [17:10:46] (03PS2) 10Andrew Bogott: Move prepare_cinder_volume.py into the cinderutils module [puppet] - 10https://gerrit.wikimedia.org/r/669896 [17:11:12] (03PS2) 10Ottomata: Declare streams an migrate Editing schemas to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669897 (https://phabricator.wikimedia.org/T267343) [17:12:50] !log drain + reimage an-worker11[13,14] to Buster [17:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:50] 10Puppet, 10SRE: puppet admin module: Assigne approveres to unix groups - https://phabricator.wikimedia.org/T276465 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff Thanks for generating an overview, I'm taking care of these piece by piece (but at slow rate) [17:24:02] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:25:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:31:01] (03PS1) 10Ahmon Dancy: Don't read ExtensionMessages-.php if running mergeMessageFileList.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669908 [17:32:57] (03CR) 10Dduvall: [C: 03+1] Don't read ExtensionMessages-.php if running mergeMessageFileList.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669908 (owner: 10Ahmon Dancy) [17:36:00] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1113.eqiad.wmnet with reason: REIMAGE [17:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:26] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1114.eqiad.wmnet with reason: REIMAGE [17:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:14] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1113.eqiad.wmnet with reason: REIMAGE [17:38:20] 10SRE, 10SRE-Access-Requests, 10Wikimedia-Mailing-lists: Request for creation of mailman3-roots group - https://phabricator.wikimedia.org/T276712 (10JMeybohm) This has been approved in today's SRE meeting. [17:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:02] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/669866 (https://phabricator.wikimedia.org/T275677) (owner: 10JMeybohm) [17:39:45] (03CR) 10JMeybohm: [C: 03+2] Add new SSH key for olykalinichenko [puppet] - 10https://gerrit.wikimedia.org/r/669866 (https://phabricator.wikimedia.org/T275677) (owner: 10JMeybohm) [17:40:07] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1114.eqiad.wmnet with reason: REIMAGE [17:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:13] 10SRE, 10SRE-Access-Requests, 10Wikimedia-Mailing-lists: Request for creation of mailman3-roots group - https://phabricator.wikimedia.org/T276712 (10Ladsgroup) >>! In T276712#6893426, @JMeybohm wrote: > This has been approved in today's SRE meeting. {meme, src=carltondance} [17:41:37] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10JMeybohm) 05Open→03Resolved Account has been updated [17:46:03] (03PS2) 10Zabe: Enable flood flag on hrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668708 (https://phabricator.wikimedia.org/T276560) [17:49:51] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1005.eqiad.wmnet, kubernetes1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:51:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:55:05] 10Puppet, 10SRE: puppet admin module: Assigne approveres to unix groups - https://phabricator.wikimedia.org/T276465 (10jbond) fyi i used the following script to generate this https://phabricator.wikimedia.org/P14673 [17:59:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:59:25] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:00:04] ryankemper: That opportune time is upon us again. Time for a Wikidata Query Service weekly deploy deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210308T1800). [18:00:26] 10SRE, 10SRE-tools: Per host access control for kerberized SSH - https://phabricator.wikimedia.org/T276790 (10crusnov) p:05Triage→03Medium [18:06:29] (03CR) 10SBassett: [C: 03+1] "Tested manually on mwmaint1002. Appears to solve the actor_user = NULL issue as I was able to see that data within a few different test q" [puppet] - 10https://gerrit.wikimedia.org/r/669888 (https://phabricator.wikimedia.org/T276698) (owner: 10Bstorm) [18:10:35] PROBLEM - LVS linkrecommendation-external eqiad port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:11:06] !log drain + reimage an-worker11[15,16] to Buster [18:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:12] (03PS4) 10Andrew Bogott: labs_lvm: check for available space before partitioning [puppet] - 10https://gerrit.wikimedia.org/r/668567 (https://phabricator.wikimedia.org/T272114) [18:16:37] (03CR) 10Andrew Bogott: [C: 03+2] labs_lvm: check for available space before partitioning [puppet] - 10https://gerrit.wikimedia.org/r/668567 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [18:24:30] 10Puppet, 10SRE: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465 (10Legoktm) [18:29:39] RECOVERY - LVS linkrecommendation-external eqiad port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 196 bytes in 2.413 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:29:55] !log robh@cumin1001 START - Cookbook sre.dns.netbox [18:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:23] 10SRE, 10serviceops, 10Patch-For-Review: move mwmaint2002 into production, replace mwmaint2001 - https://phabricator.wikimedia.org/T275905 (10Dzahn) 05Open→03Resolved [18:30:25] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install mwmaint2002 - https://phabricator.wikimedia.org/T274170 (10Dzahn) [18:31:55] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1115.eqiad.wmnet with reason: REIMAGE [18:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:06] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:57] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1115.eqiad.wmnet with reason: REIMAGE [18:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:37] 10SRE, 10vm-requests, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10Dzahn) 05Open→03Stalled Setting this to stalled because 1 VM has been created and whether the second one is still needed is TBD for now. [18:34:59] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1116.eqiad.wmnet with reason: REIMAGE [18:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:37] 10SRE, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn) 05Open→03Stalled Setting this to Stalled again because we are blocked on board feedback currently. [18:36:54] (03PS1) 10Urbanecm: hiwiki: Add missing help panel link descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669942 (https://phabricator.wikimedia.org/T276450) [18:36:58] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1116.eqiad.wmnet with reason: REIMAGE [18:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:24] jouncebot: now [18:37:24] No deployments scheduled for the next 0 hour(s) and 22 minute(s) [18:37:33] (03CR) 10Urbanecm: [C: 03+2] hiwiki: Add missing help panel link descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669942 (https://phabricator.wikimedia.org/T276450) (owner: 10Urbanecm) [18:37:46] !log robh@cumin1001 START - Cookbook sre.dns.netbox [18:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:12] (03PS2) 10Urbanecm: Set wgGEHelpPanelAskMentor to true by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667529 (https://phabricator.wikimedia.org/T275908) [18:38:31] (03Merged) 10jenkins-bot: hiwiki: Add missing help panel link descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669942 (https://phabricator.wikimedia.org/T276450) (owner: 10Urbanecm) [18:40:50] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: dfd95883ed15c532e6345d1dfacfc274b87fcd80: hiwiki: Add missing help panel link descriptions (T276450) (duration: 00m 58s) [18:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:56] T276450: Deploy Growth features on Hindi Wikipedia - https://phabricator.wikimedia.org/T276450 [18:41:43] PROBLEM - LVS linkrecommendation-external eqiad port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.23 and port 4006: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:42:57] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1012.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:43:05] RECOVERY - LVS linkrecommendation-external eqiad port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 196 bytes in 9.776 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:44:27] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:47:46] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:53:53] (03PS1) 10Urbanecm: Fix sqwiki help panel links description [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669952 (https://phabricator.wikimedia.org/T275550) [18:54:01] (03PS2) 10Urbanecm: Fix sqwiki help panel links description [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669952 (https://phabricator.wikimedia.org/T275550) [18:54:05] (03CR) 10Urbanecm: [C: 03+2] Fix sqwiki help panel links description [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669952 (https://phabricator.wikimedia.org/T275550) (owner: 10Urbanecm) [18:54:57] 10SRE, 10serviceops, 10Patch-For-Review: make noc.wikimedia.org active/active (was: improve mw maintenance server switch over and discovery names) - https://phabricator.wikimedia.org/T265936 (10Dzahn) [18:55:22] (03Merged) 10jenkins-bot: Fix sqwiki help panel links description [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669952 (https://phabricator.wikimedia.org/T275550) (owner: 10Urbanecm) [18:55:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:56:24] 10SRE, 10serviceops, 10Patch-For-Review: make noc.wikimedia.org active/active (was: improve mw maintenance server switch over and discovery names) - https://phabricator.wikimedia.org/T265936 (10Dzahn) p:05Triage→03Medium [18:56:58] (03PS3) 10Razzi: wikireplicas: give analytics_multiinstance role to clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/668494 (https://phabricator.wikimedia.org/T269211) [18:58:30] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: a85580030027ca5b879688ed5d76123454164001: Fix sqwiki help panel links description (T275550) (duration: 00m 58s) [18:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:39] T275550: Deploy Growth features on Albanian Wikipedia - https://phabricator.wikimedia.org/T275550 [19:00:05] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210308T1900) [19:00:05] Zabe, dancy, marxarelli, and phuedx: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:14] I can deploy today [19:00:27] o/ [19:00:35] Alive and kicking [19:00:45] dancy: I assume you'll want to self-deploy once I'm done with the rest? [19:00:50] (03PS3) 10Urbanecm: Enable flood flag on hrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668708 (https://phabricator.wikimedia.org/T276560) (owner: 10Zabe) [19:00:53] o/ [19:00:59] yes please [19:01:04] o/ [19:01:05] (03CR) 10Urbanecm: [C: 03+2] Enable flood flag on hrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668708 (https://phabricator.wikimedia.org/T276560) (owner: 10Zabe) [19:01:18] (03PS1) 10Legoktm: Add DATACENTER_NUMBERING_PREFIX constant [software/pywmflib] - 10https://gerrit.wikimedia.org/r/669953 [19:01:20] dancy: okay, I will ping you once done. [19:01:49] (03Merged) 10jenkins-bot: Enable flood flag on hrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668708 (https://phabricator.wikimedia.org/T276560) (owner: 10Zabe) [19:02:25] Zabe: can you check your patch on mwdebug1001, please? [19:03:14] (03CR) 10Urbanecm: [C: 03+2] maintenance: Skip setAgentAndTriggers for DB_NONE maintenance tasks [core] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/669790 (https://phabricator.wikimedia.org/T260827) (owner: 10Dduvall) [19:03:17] (03CR) 10Urbanecm: [C: 03+2] maintenance: mergeMessageFileList should be DB_NONE [core] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668514 (https://phabricator.wikimedia.org/T260827) (owner: 10Dduvall) [19:03:20] (03CR) 10Urbanecm: [C: 03+2] maintenance: Avoid missing l10n cache error in mergeMessageFileList [core] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668515 (owner: 10Dduvall) [19:03:23] (03CR) 10Urbanecm: [C: 03+2] maintenance: rebuildLocalisationCache should be DB_NONE if possible [core] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668516 (https://phabricator.wikimedia.org/T260827) (owner: 10Dduvall) [19:03:44] marxarelli: +2'ed your backports, can either ping you once done, or deploy for you, up to you :) [19:04:05] (03PS3) 10Urbanecm: vector: Expand Desktop Improvements pilot wiki group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669783 (https://phabricator.wikimedia.org/T273090) (owner: 10Phuedx) [19:04:09] Urbanecm: a ping works for me. thanks! [19:04:09] (03CR) 10Urbanecm: [C: 03+2] vector: Expand Desktop Improvements pilot wiki group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669783 (https://phabricator.wikimedia.org/T273090) (owner: 10Phuedx) [19:04:13] (03CR) 10jerkins-bot: [V: 04-1] Add DATACENTER_NUMBERING_PREFIX constant [software/pywmflib] - 10https://gerrit.wikimedia.org/r/669953 (owner: 10Legoktm) [19:04:15] marxarelli: okay, cool :) [19:04:27] Urbanecm: I'm sorry, but I don't know what you mean [19:05:11] Urbanecm: I'm with Olga Vasileva for testing the change [19:05:13] Zabe: that's fine. Can you install the gadget from https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_usage, pick "mwdebug1001.eqiad.wmnet" from the list of debug servers in the gadget, and then go to hrwiki's Special:UserGroupRights to make sure the patch works as expected? [19:05:23] phuedx: ack, I'll ping you once it's ready. [19:05:52] (03Merged) 10jenkins-bot: vector: Expand Desktop Improvements pilot wiki group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669783 (https://phabricator.wikimedia.org/T273090) (owner: 10Phuedx) [19:06:08] Zabe: do let me know if you need any assistance. [19:06:37] phuedx: mwdebug1002 is ready for your/Olga tests. [19:09:17] Urbanecm: Thanks. Yes I see the pseudobots group with the expected settings. [19:09:24] cool, I'll sync it [19:09:30] Urbanecm: Thanks. Testing now [19:10:04] (03PS2) 10BBlack: ATS: force cache revalidation for 7 wikis [puppet] - 10https://gerrit.wikimedia.org/r/669840 (https://phabricator.wikimedia.org/T274784) [19:10:37] (03CR) 10Elukey: [C: 03+1] "Looks good, let's also run Pcc to make sure that all is good :)" [puppet] - 10https://gerrit.wikimedia.org/r/668494 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [19:10:56] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: e1cb98890fd4ad0ed25670de2fff6db6e59d7132: Enable flood flag on hrwiki (T276560) (duration: 00m 58s) [19:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:02] T276560: Enable flood flag on hrwiki - https://phabricator.wikimedia.org/T276560 [19:11:03] Zabe: synced. [19:11:12] Urbanecm: Thanks [19:11:17] np [19:12:02] Urbanecm: There's an error in my patch. I've enabled the change on Bulgarian and not Bengali. Should I revert the original and submit a new good one? [19:12:12] phuedx: please submit a follow-up [19:12:20] ie. a patch that changes only what's wrong [19:12:29] no need to revert if we're going to apply most of it :) [19:12:56] (03CR) 10Elukey: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28451/console" [puppet] - 10https://gerrit.wikimedia.org/r/668494 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [19:13:03] (03CR) 10Legoktm: "Should I suppress the unused variable prospector warning?" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/669953 (owner: 10Legoktm) [19:13:25] (03CR) 10Legoktm: sre.ganeti.makevm: Automatically generate fqdn from hostname (039 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/668867 (https://phabricator.wikimedia.org/T276516) (owner: 10Legoktm) [19:13:50] (03PS2) 10Legoktm: sre.ganeti.makevm: Automatically generate fqdn from hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/668867 (https://phabricator.wikimedia.org/T276516) [19:14:33] (03CR) 10Razzi: [C: 03+2] wikireplicas: give analytics_multiinstance role to clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/668494 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [19:14:53] !log cp-text: disabling puppet ahead of T274784 changes - https://gerrit.wikimedia.org/r/c/operations/puppet/+/669840 [19:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:00] T274784: CDN cache revalidation on several wikis for desktop improvements deployment pt 2 - https://phabricator.wikimedia.org/T274784 [19:15:44] (03PS1) 10Phuedx: vector: Fix Desktop Improvements pilot wiki group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669956 (https://phabricator.wikimedia.org/T273090) [19:15:48] Urbanecm: ^ [19:15:51] thanks [19:15:53] *facepalm* [19:16:12] (03CR) 10Urbanecm: [C: 03+2] vector: Fix Desktop Improvements pilot wiki group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669956 (https://phabricator.wikimedia.org/T273090) (owner: 10Phuedx) [19:16:28] phuedx: it's exactly what the mwdebug testing is for :) [19:16:37] (03CR) 10Esanders: [C: 03+1] Enable DiscussionTools' beta feature for newtopictool on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669893 (https://phabricator.wikimedia.org/T275827) (owner: 10Bartosz Dziewoński) [19:17:07] (03Merged) 10jenkins-bot: vector: Fix Desktop Improvements pilot wiki group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669956 (https://phabricator.wikimedia.org/T273090) (owner: 10Phuedx) [19:17:12] (03CR) 10Esanders: [C: 03+1] Enable DiscussionTools' beta features on frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669895 (https://phabricator.wikimedia.org/T276189) (owner: 10Bartosz Dziewoński) [19:17:43] phuedx: pulled the new version onto mwdebug1002 [19:17:44] Sorry Bulgarians. [19:17:54] lol [19:18:25] Thanks Urbanecm. Those changes look good [19:18:37] (03PS1) 10Urbanecm: nowiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669957 (https://phabricator.wikimedia.org/T276816) [19:18:42] excellent, let's them push to the known and unknown universe :) [19:18:47] *push them [19:18:48] (03PS3) 10Ottomata: Declare streams an migrate Editing schemas to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669897 (https://phabricator.wikimedia.org/T267343) [19:19:38] wow lots of backports today! I was going to deploy ^ but I will wait until the backport window is finished :) [19:19:50] let me know if that happens early please! [19:20:04] ottomata: will ping you [19:20:08] ty [19:20:43] !log urbanecm@deploy1002 Synchronized dblists/desktop-improvements.dblist: 1c46d0b: 1aad60b: vector: Expand Desktop Improvements pilot wiki group (T273090) (duration: 00m 57s) [19:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:49] T273090: Deploy modern Vector and new Vue.js search experience to new pilot wikis - https://phabricator.wikimedia.org/T273090 [19:21:53] !log urbanecm@deploy1002 Synchronized wmf-config/config/: 1c46d0b: 1aad60b: vector: Expand Desktop Improvements pilot wiki group (T273090) (duration: 00m 58s) [19:21:56] phuedx: should be live [19:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:16] dancy: your turn :) [19:22:24] (03CR) 10Jbond: [C: 04-1] "mostly fine, see inline" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [19:22:25] thx [19:23:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:23:48] Urbanecm: dancy and i will be deploying/testing together since our changes are related [19:24:10] marxarelli: no problem with me, but the backports are not yet merged. [19:24:22] got it [19:24:31] (03PS11) 10Andrew Bogott: cloud-vps: Add a new class to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) [19:24:33] (03PS3) 10Andrew Bogott: Move prepare_cinder_volume.py into the cinderutils module [puppet] - 10https://gerrit.wikimedia.org/r/669896 [19:24:35] (03PS1) 10Andrew Bogott: Add role::labs::cindermount::srv [puppet] - 10https://gerrit.wikimedia.org/r/669958 (https://phabricator.wikimedia.org/T269511) [19:24:44] marxarelli: dancy: do the config-only changes you scheduled depend on the backports? [19:25:19] they should be independent [19:25:33] (03CR) 10Ahmon Dancy: [C: 03+2] env.php: Allow the datacenter to be specified in WMF_DATACENTER environment variable. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667243 (owner: 10Ahmon Dancy) [19:25:39] (03CR) 10Ahmon Dancy: [C: 03+2] wmf-config/CommonSettings.php: Add WMF_MAINTENANCE_OFFLINE handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667244 (https://phabricator.wikimedia.org/T238436) (owner: 10Ahmon Dancy) [19:25:47] ack [19:25:50] (03CR) 10Ahmon Dancy: [C: 03+2] Don't read ExtensionMessages-.php if running mergeMessageFileList.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669908 (owner: 10Ahmon Dancy) [19:26:17] (03CR) 10jerkins-bot: [V: 04-1] Add role::labs::cindermount::srv [puppet] - 10https://gerrit.wikimedia.org/r/669958 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [19:26:32] (03Merged) 10jenkins-bot: env.php: Allow the datacenter to be specified in WMF_DATACENTER environment variable. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667243 (owner: 10Ahmon Dancy) [19:26:58] (03Merged) 10jenkins-bot: wmf-config/CommonSettings.php: Add WMF_MAINTENANCE_OFFLINE handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667244 (https://phabricator.wikimedia.org/T238436) (owner: 10Ahmon Dancy) [19:27:42] (03CR) 10BBlack: [C: 03+2] ATS: force cache revalidation for 7 wikis [puppet] - 10https://gerrit.wikimedia.org/r/669840 (https://phabricator.wikimedia.org/T274784) (owner: 10BBlack) [19:32:12] hmm, gate-and-submit succeeded for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/669908/ but it's not longer in zuul? [19:32:20] and unsubmitted [19:32:33] marxarelli: rebase it [19:32:38] going to rebase [19:32:39] yeah [19:32:47] jenkins repetately fails to rebase in mediawiki-config [19:32:49] I don't know why [19:33:20] (03PS2) 10Ahmon Dancy: Don't read ExtensionMessages-.php if running mergeMessageFileList.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669908 [19:33:22] (03Merged) 10jenkins-bot: maintenance: Skip setAgentAndTriggers for DB_NONE maintenance tasks [core] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/669790 (https://phabricator.wikimedia.org/T260827) (owner: 10Dduvall) [19:33:26] must be gerrit's fault since everything from jenkins to zull worked [19:33:27] (03Merged) 10jenkins-bot: maintenance: mergeMessageFileList should be DB_NONE [core] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668514 (https://phabricator.wikimedia.org/T260827) (owner: 10Dduvall) [19:33:32] *zuul* [19:33:37] dancy is on it [19:33:40] (03Merged) 10jenkins-bot: maintenance: Avoid missing l10n cache error in mergeMessageFileList [core] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668515 (owner: 10Dduvall) [19:33:45] (03Merged) 10jenkins-bot: maintenance: rebuildLocalisationCache should be DB_NONE if possible [core] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668516 (https://phabricator.wikimedia.org/T260827) (owner: 10Dduvall) [19:33:49] (03PS1) 10Zabe: Enable DiscussionsTools for enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669960 (https://phabricator.wikimedia.org/T276851) [19:33:52] (03CR) 10Ahmon Dancy: [V: 03+2 C: 03+2] Don't read ExtensionMessages-.php if running mergeMessageFileList.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669908 (owner: 10Ahmon Dancy) [19:34:56] (03Merged) 10jenkins-bot: Don't read ExtensionMessages-.php if running mergeMessageFileList.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669908 (owner: 10Ahmon Dancy) [19:37:09] !log cp-text: banning varnish-fe for req.http.host == ( 7 wikis from T274784 ) [19:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:17] T274784: CDN cache revalidation on several wikis for desktop improvements deployment pt 2 - https://phabricator.wikimedia.org/T274784 [19:37:42] !log dduvall@deploy1002 Synchronized wmf-config/: wmf-config/env.php,CommonSettings.php: f70049b: e53dc3a: f9b9ea1: WMF_DATACENTER, WMF_MAINTENANCE_OFFLINE handling (duration: 01m 00s) [19:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:59] (03PS1) 10Mforns: Refine 2 ReadersWeb schemas using eventlogging_legacy job [puppet] - 10https://gerrit.wikimedia.org/r/669962 (https://phabricator.wikimedia.org/T271164) [19:39:10] (03CR) 10jerkins-bot: [V: 04-1] Refine 2 ReadersWeb schemas using eventlogging_legacy job [puppet] - 10https://gerrit.wikimedia.org/r/669962 (https://phabricator.wikimedia.org/T271164) (owner: 10Mforns) [19:39:45] (03PS1) 10Legoktm: docker_registry_ha: Remove legacy "uploader" user [puppet] - 10https://gerrit.wikimedia.org/r/669964 [19:39:56] running mergeMessageFileList to test the config real quick [19:40:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:40:23] `diff -u /tmp/dduvall-mmfl-output-test.php wmf-config/ExtensionMessages-1.36.0-wmf.33.php` looks good [19:41:07] (03PS2) 10Mforns: Refine 2 ReadersWeb schemas using eventlogging_legacy job [puppet] - 10https://gerrit.wikimedia.org/r/669962 (https://phabricator.wikimedia.org/T271164) [19:41:33] same with `WMF_MAINTENANCE_OFFLINE` set [19:41:39] ok. moving on to wmf.33 backports [19:41:49] cool [19:43:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:43:38] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28453/console" [puppet] - 10https://gerrit.wikimedia.org/r/669964 (owner: 10Legoktm) [19:44:17] (03PS1) 10Zabe: Enable visualeditor on enwikibooks by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669966 (https://phabricator.wikimedia.org/T276851) [19:45:47] Urbanecm: Hey, do you still have time for deployments? [19:46:04] Zabe: I can add you to the end of the queue [19:46:29] I'm not sure whether we can make it [19:47:04] not that important, i can request it tomorrow [19:47:08] syncing now and then i have two maintenance scripts to verify [19:47:10] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/669967 [19:47:19] (03CR) 10Legoktm: [V: 03+1 C: 03+2] docker_registry_ha: Remove legacy "uploader" user [puppet] - 10https://gerrit.wikimedia.org/r/669964 (owner: 10Legoktm) [19:47:20] should be relatively quick [19:47:22] !log dduvall@deploy1002 Synchronized php-1.36.0-wmf.33/maintenance/: maintenance: aa6f291: 4893ddb: fa97162: 380c448: DB_NONE offline maintenance improvements (duration: 00m 58s) [19:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:24] (03PS5) 10Legoktm: Validate htpasswd() salt is only 8 characters [puppet] - 10https://gerrit.wikimedia.org/r/666787 [19:49:56] 10SRE, 10Desktop Improvements, 10Traffic, 10Bengali-Sites, and 5 others: CDN cache revalidation on several wikis for desktop improvements deployment pt 2 - https://phabricator.wikimedia.org/T274784 (10BBlack) ^ There was a last-minute change of plans, so we made a last-minute call to expend a little bit of... [19:50:25] marxarelli: ack. Lmk when done :) [19:52:02] (03CR) 10Legoktm: [C: 03+2] Validate htpasswd() salt is only 8 characters [puppet] - 10https://gerrit.wikimedia.org/r/666787 (owner: 10Legoktm) [19:52:16] Urbanecm: we're done. thank you! [19:52:22] great! [19:52:26] 10SRE, 10Wikimedia-Mailing-lists, 10observability: Implement central logging for mailman3 - https://phabricator.wikimedia.org/T276697 (10crusnov) p:05Triage→03Medium [19:52:38] ottomata: please do your stuff now, and let me know when done :) [19:53:09] ok [19:53:21] (03CR) 10Ottomata: [C: 03+2] Declare streams an migrate Editing schemas to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669897 (https://phabricator.wikimedia.org/T267343) (owner: 10Ottomata) [19:53:47] (03PS2) 10Legoktm: Add apache2 mod_rewrite to beta prometheus [puppet] - 10https://gerrit.wikimedia.org/r/668995 (https://phabricator.wikimedia.org/T276654) (owner: 10Majavah) [19:54:08] 10SRE, 10observability: Convert udp2log init script to use systemd - https://phabricator.wikimedia.org/T276623 (10crusnov) p:05Triage→03Medium [19:54:12] (03PS1) 10Cwhite: pontoon: initial hiera config for pontoon env [puppet] - 10https://gerrit.wikimedia.org/r/669968 [19:55:05] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Migrate Editing schemas to Event Platform on testwiki - T267343, T267353 (duration: 00m 57s) [19:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:14] T267353: VisualEditorFeatureUse Event Platform Migration - https://phabricator.wikimedia.org/T267353 [19:55:15] T267343: EditAttemptStep Event Platform Migration - https://phabricator.wikimedia.org/T267343 [19:55:49] (03CR) 10Legoktm: [C: 03+2] Add apache2 mod_rewrite to beta prometheus [puppet] - 10https://gerrit.wikimedia.org/r/668995 (https://phabricator.wikimedia.org/T276654) (owner: 10Majavah) [19:56:19] (03CR) 10Legoktm: [C: 03+2] wancache: change deployment-prep to new Buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/669436 (https://phabricator.wikimedia.org/T276707) (owner: 10Majavah) [19:56:48] (03PS2) 10Andrew Bogott: Add role::labs::cindermount::srv [puppet] - 10https://gerrit.wikimedia.org/r/669958 (https://phabricator.wikimedia.org/T269511) [19:58:08] Urbanecm: done syncing, testing some things on testwiki [19:58:13] (03CR) 10jerkins-bot: [V: 04-1] Add role::labs::cindermount::srv [puppet] - 10https://gerrit.wikimedia.org/r/669958 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [19:58:13] thanks [19:58:25] ottomata: can i deploy, or should i wait for you to fully finish? [19:58:39] you can deploy [19:58:52] thx [19:58:58] (03PS2) 10Urbanecm: nowiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669957 (https://phabricator.wikimedia.org/T276816) [19:59:09] (03CR) 10Urbanecm: [C: 03+2] nowiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669957 (https://phabricator.wikimedia.org/T276816) (owner: 10Urbanecm) [19:59:22] Zabe: what's your patch please? [19:59:31] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/669960 [19:59:40] (03CR) 10Legoktm: [C: 04-1] redis::multidc: Make discovery optional (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/669447 (owner: 10Majavah) [19:59:50] and if possible https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/669966 [19:59:57] (03Merged) 10jenkins-bot: nowiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669957 (https://phabricator.wikimedia.org/T276816) (owner: 10Urbanecm) [20:00:01] Zabe: I'm not going to deploy the first patch without signoff from the editing team. [20:00:09] ok [20:00:19] ottomata: /srv/mediawiki-stagging is out of sync [20:01:29] (03PS3) 10Andrew Bogott: Add role::labs::cindermount::srv [puppet] - 10https://gerrit.wikimedia.org/r/669958 (https://phabricator.wikimedia.org/T269511) [20:02:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:02:15] outstanding patch is `Declare streams an migrate Editing schemas to Event Platform on testwiki ` authored by ottomata :) [20:03:27] (03CR) 10Urbanecm: [C: 04-2] "requires signoff by editing team" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669960 (https://phabricator.wikimedia.org/T276851) (owner: 10Zabe) [20:05:27] ottomata: ping? [20:10:44] ottomata: ping? [20:11:39] (03PS1) 10RobH: an-druid100[345] updates [puppet] - 10https://gerrit.wikimedia.org/r/669969 (https://phabricator.wikimedia.org/T274163) [20:12:05] 10SRE, 10Analytics, 10CirrusSearch, 10Wikidata, and 4 others: Upgrade prometheus-jmx-exporter - https://phabricator.wikimedia.org/T276595 (10crusnov) p:05Triage→03Medium [20:13:12] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-druid100[345] - https://phabricator.wikimedia.org/T274163 (10RobH) [20:14:02] (03CR) 10RobH: [C: 03+2] an-druid100[345] updates [puppet] - 10https://gerrit.wikimedia.org/r/669969 (https://phabricator.wikimedia.org/T274163) (owner: 10RobH) [20:14:19] (03PS2) 10RobH: an-druid100[345] updates [puppet] - 10https://gerrit.wikimedia.org/r/669969 (https://phabricator.wikimedia.org/T274163) [20:17:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:18:07] (03CR) 10Bstorm: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/669888 (https://phabricator.wikimedia.org/T276698) (owner: 10Bstorm) [20:20:29] (03PS13) 10Dduvall: pipeline: Initial multiversion pipeline configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666492 (https://phabricator.wikimedia.org/T274182) [20:21:36] (03PS1) 10Urbanecm: Revert "Declare streams an migrate Editing schemas to Event Platform on testwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669793 (https://phabricator.wikimedia.org/T267343) [20:21:38] (03CR) 10Volans: [C: 03+1] "LGTM!" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/668867 (https://phabricator.wikimedia.org/T276516) (owner: 10Legoktm) [20:21:47] (03PS2) 10Urbanecm: Revert "Declare streams an migrate Editing schemas to Event Platform on testwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669793 (https://phabricator.wikimedia.org/T267343) [20:21:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:22:21] (03CR) 10Urbanecm: [C: 03+2] "reverting undeployed commit" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669793 (https://phabricator.wikimedia.org/T267343) (owner: 10Urbanecm) [20:23:08] (03Merged) 10jenkins-bot: Revert "Declare streams an migrate Editing schemas to Event Platform on testwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669793 (https://phabricator.wikimedia.org/T267343) (owner: 10Urbanecm) [20:23:13] (03PS14) 10Dduvall: pipeline: Initial multiversion pipeline configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666492 (https://phabricator.wikimedia.org/T274182) [20:23:39] !log miscweb[12]002 - disabling puppet to remake cergen cert... [20:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:27] Urbanecm: oh sorry, i am in a meeteing [20:24:34] Urbanecm: i didn't sync it??? [20:24:39] ottomata: you forgot git rebase [20:24:41] I reverted it [20:24:42] oh maybe i forgot the rebase? [20:24:55] thank you i am very sorry, i was wondering why it wasn't quite working [20:24:56] (03CR) 10Volans: "> Should I suppress the unused variable prospector warning?" (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/669953 (owner: 10Legoktm) [20:24:58] but then a meeting started... [20:24:58] :/ [20:24:59] sorry [20:25:13] okay [20:25:16] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-druid100[345] - https://phabricator.wikimedia.org/T274163 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-druid1003.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/... [20:25:17] you'll deploy it later i guess [20:26:03] (03CR) 10Dduvall: "> Patch Set 11: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666492 (https://phabricator.wikimedia.org/T274182) (owner: 10Dduvall) [20:26:09] (03PS2) 10Jeena Huneidi: Rsync private mediawiki files to releases server [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) [20:26:38] (03CR) 10Jeena Huneidi: Rsync private mediawiki files to releases server (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) (owner: 10Jeena Huneidi) [20:26:54] (03CR) 10Dduvall: "Note I've added the l10n cache build to this as well." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666492 (https://phabricator.wikimedia.org/T274182) (owner: 10Dduvall) [20:27:35] volans|away: we could test set(ALL_DATACENTERS) == set(DATACENTER_NUMBERING_PREFIX) ? [20:28:20] (03PS1) 10Urbanecm: Revert "nowiki: Enable Growth features in stealth mode" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669794 (https://phabricator.wikimedia.org/T276816) [20:28:26] (03PS2) 10Urbanecm: Revert "nowiki: Enable Growth features in stealth mode" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669794 (https://phabricator.wikimedia.org/T276816) [20:28:38] (03CR) 10Urbanecm: [C: 03+2] Revert "nowiki: Enable Growth features in stealth mode" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669794 (https://phabricator.wikimedia.org/T276816) (owner: 10Urbanecm) [20:29:32] (03Merged) 10jenkins-bot: Revert "nowiki: Enable Growth features in stealth mode" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669794 (https://phabricator.wikimedia.org/T276816) (owner: 10Urbanecm) [20:30:34] legoktm: that would be nice indeed [20:30:40] (03PS1) 10BBlack: New cert for webserver-misc-apps [puppet] - 10https://gerrit.wikimedia.org/r/669972 (https://phabricator.wikimedia.org/T266470) [20:31:25] (03CR) 10BBlack: [C: 03+2] New cert for webserver-misc-apps [puppet] - 10https://gerrit.wikimedia.org/r/669972 (https://phabricator.wikimedia.org/T266470) (owner: 10BBlack) [20:32:53] !log miscweb[12]002 - re-enabled puppet and deployed new cert [20:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:45] (03PS2) 10Dduvall: pipeline: add building the webserver image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669807 (owner: 10Giuseppe Lavagetto) [20:33:46] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:34:16] (03PS3) 10Urbanecm: Set wgGEHelpPanelAskMentor to true by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667529 (https://phabricator.wikimedia.org/T275908) [20:34:27] (03CR) 10Urbanecm: [C: 03+2] Set wgGEHelpPanelAskMentor to true by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667529 (https://phabricator.wikimedia.org/T275908) (owner: 10Urbanecm) [20:35:17] (03Merged) 10jenkins-bot: Set wgGEHelpPanelAskMentor to true by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667529 (https://phabricator.wikimedia.org/T275908) (owner: 10Urbanecm) [20:38:14] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 5ce7b4602d2b109adfb86bef6795a4d07a1208b9: Set wgGEHelpPanelAskMentor to true by default (T275908) (duration: 01m 07s) [20:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:21] T275908: Scale: default to mentorship questions from help panel - https://phabricator.wikimedia.org/T275908 [20:39:16] (03PS2) 10Legoktm: Add DATACENTER_NUMBERING_PREFIX constant [software/pywmflib] - 10https://gerrit.wikimedia.org/r/669953 [20:39:58] (03CR) 10Legoktm: "> 1) add a small test for it, a bit pointless I agree" (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/669953 (owner: 10Legoktm) [20:41:32] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1003.eqiad.wmnet with reason: REIMAGE [20:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:37] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1003.eqiad.wmnet with reason: REIMAGE [20:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:30] !log legoktm@registry1004:~$ sudo systemctl reset-failed # to fix icinga warning [20:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:40] RECOVERY - Check systemd state on registry1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:50:58] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-druid100[345] - https://phabricator.wikimedia.org/T274163 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-druid1003.eqiad.wmnet'] ` and were **ALL** successful. [20:51:22] (03PS1) 10Urbanecm: idwiki: Growth features: Add mentorlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669975 (https://phabricator.wikimedia.org/T259024) [20:51:24] (03CR) 10Urbanecm: [C: 03+2] idwiki: Growth features: Add mentorlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669975 (https://phabricator.wikimedia.org/T259024) (owner: 10Urbanecm) [20:52:21] (03Merged) 10jenkins-bot: idwiki: Growth features: Add mentorlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669975 (https://phabricator.wikimedia.org/T259024) (owner: 10Urbanecm) [20:53:44] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 1227e2ad8d14e5f0a10a1050e0fadbe0d3c3e238: idwiki: Growth features: Add mentorlist (T259024) (duration: 00m 58s) [20:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:51] T259024: Deploy Growth experiments at Indonesian Wikipedia - https://phabricator.wikimedia.org/T259024 [20:54:01] (03CR) 10Volans: [C: 03+1] "LGTM, thx" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/669953 (owner: 10Legoktm) [20:54:14] * Urbanecm done [20:55:05] 10SRE, 10ops-eqiad, 10Analytics-Radar: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10wiki_willy) [20:56:10] Urbanecm ok meeting over, sorry about that [20:56:17] ok if I unrevert and actually sync? [20:56:26] ottomata: yes, I'm done for now [20:57:58] (03PS1) 10Ottomata: Revert "Revert "Declare streams an migrate Editing schemas to Event Platform on testwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669795 [20:58:13] (03PS2) 10Ottomata: Revert "Revert "Declare streams an migrate Editing schemas to Event Platform on testwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669795 [21:00:04] chrisalbon and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210308T2100). [21:00:18] (03CR) 10Ottomata: [C: 03+2] Revert "Revert "Declare streams an migrate Editing schemas to Event Platform on testwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669795 (owner: 10Ottomata) [21:03:31] 10SRE, 10ops-eqiad, 10Analytics-Radar: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10wiki_willy) Some of the mw servers in rack A7 should be decom'd, after T273915 is installed for the refresh. Since the power in A7 is maxing out, I think we sh... [21:04:17] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-druid100[345] - https://phabricator.wikimedia.org/T274163 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-druid1005.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/... [21:04:47] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Migrate Editing schemas to Event Platform on testwiki, take 2 - T267343, T267353 (duration: 00m 58s) [21:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:55] T267353: VisualEditorFeatureUse Event Platform Migration - https://phabricator.wikimedia.org/T267353 [21:04:55] T267343: EditAttemptStep Event Platform Migration - https://phabricator.wikimedia.org/T267343 [21:08:22] volans|away: is it OK for me to +2 both? or [21:09:40] legoktm: sure, just +2 it's enough, CI will auto-merge, for wmflib I'll make a new release soon, the cookbook will be synced by puppet, do you have a VM to create to test it? [21:09:48] (03PS1) 10Dzahn: testreduce: invalid test change to show scandium is not affected [puppet] - 10https://gerrit.wikimedia.org/r/669980 [21:10:52] (03PS1) 10Ottomata: Migrate Editor schemas to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669981 (https://phabricator.wikimedia.org/T267343) [21:11:06] volans|away: I don't currently have any VMs to create [21:11:31] (03CR) 10Legoktm: [C: 03+2] Add DATACENTER_NUMBERING_PREFIX constant [software/pywmflib] - 10https://gerrit.wikimedia.org/r/669953 (owner: 10Legoktm) [21:11:50] (03CR) 10jerkins-bot: [V: 04-1] Migrate Editor schemas to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669981 (https://phabricator.wikimedia.org/T267343) (owner: 10Ottomata) [21:13:45] (03PS2) 10Ottomata: Migrate Editor schemas to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669981 (https://phabricator.wikimedia.org/T267343) [21:13:59] (03Merged) 10jenkins-bot: Add DATACENTER_NUMBERING_PREFIX constant [software/pywmflib] - 10https://gerrit.wikimedia.org/r/669953 (owner: 10Legoktm) [21:14:41] 10SRE, 10Parsoid-Tests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) > The config files in /etc/testreduce/*parsoid-rt* and /etc/testreduce/*parsoid-vd* should still be puppe... [21:15:23] (03CR) 10Ottomata: [C: 03+2] Migrate Editor schemas to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669981 (https://phabricator.wikimedia.org/T267343) (owner: 10Ottomata) [21:16:16] (03Abandoned) 10Dzahn: testreduce: invalid test change to show scandium is not affected [puppet] - 10https://gerrit.wikimedia.org/r/669980 (owner: 10Dzahn) [21:16:21] 10SRE, 10Parsoid-Tests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) 05Open→03Resolved a:05Dzahn→03ssastry Let me know if you see anything else that is missing here. [21:18:27] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Migrate Editing schemas to Event Platform on all wikis - T267343, T267353 (duration: 00m 58s) [21:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:35] T267353: VisualEditorFeatureUse Event Platform Migration - https://phabricator.wikimedia.org/T267353 [21:18:35] T267343: EditAttemptStep Event Platform Migration - https://phabricator.wikimedia.org/T267343 [21:20:35] (03CR) 10Cwhite: [C: 03+2] logstash: ingest logstash logs as json and convert to ECS [puppet] - 10https://gerrit.wikimedia.org/r/668189 (https://phabricator.wikimedia.org/T273919) (owner: 10Cwhite) [21:21:38] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:23:40] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 19.17 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [21:23:43] (03CR) 10Ottomata: [C: 03+2] Refine 2 ReadersWeb schemas using eventlogging_legacy job [puppet] - 10https://gerrit.wikimedia.org/r/669962 (https://phabricator.wikimedia.org/T271164) (owner: 10Mforns) [21:28:18] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:29:56] (03PS1) 10Dzahn: admin: create new admin group mailman3-roots [puppet] - 10https://gerrit.wikimedia.org/r/669988 (https://phabricator.wikimedia.org/T276712) [21:32:32] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:36:06] (03CR) 10Dzahn: "As requested and approved in today's SRE meeting." [puppet] - 10https://gerrit.wikimedia.org/r/669988 (https://phabricator.wikimedia.org/T276712) (owner: 10Dzahn) [21:37:00] (03PS2) 10Mholloway: WikimediaEvents: Create data QA group/right on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668716 (https://phabricator.wikimedia.org/T276515) [21:38:59] (03CR) 10Mholloway: [C: 03+2] WikimediaEvents: Create data QA group/right on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668716 (https://phabricator.wikimedia.org/T276515) (owner: 10Mholloway) [21:39:57] (03Merged) 10jenkins-bot: WikimediaEvents: Create data QA group/right on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668716 (https://phabricator.wikimedia.org/T276515) (owner: 10Mholloway) [21:42:48] !log mholloway-shell@deploy1002 Synchronized wmf-config/CommonSettings.php: WikimediaEvents: Create data QA group/right on testwiki (T276515) (duration: 00m 57s) [21:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:57] T276515: Generate Session Length test data - https://phabricator.wikimedia.org/T276515 [21:44:42] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-druid100[345] - https://phabricator.wikimedia.org/T274163 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-druid1005.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-druid1005.eqiad.wmnet'] ` [21:47:06] 10SRE, 10DNS, 10Traffic, 10serviceops, and 3 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10Dzahn) The change above is abandoned because it was about adding a director to the traffic/caching layer and now the VM moved from private to public network. So it is not behind caching anymor... [21:52:31] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10Dzahn) >>! In T276148#6875418, @Legoktm wrote: > Also I believe `git-ssh.wikimedia.org` ran/runs a public sshd o... [21:53:12] RECOVERY - Logstash Elasticsearch indexing errors #o11y on alert1001 is OK: (C)8 ge (W)1 ge 0.95 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [21:55:22] 10SRE, 10DNS, 10Traffic, 10serviceops, and 3 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10brennen) > Do you really want "gitlab" for the MVP though, not like gitlab-test.wikimedia.org or gitlab-beta.wikimedia.org ? Yeah, I think so. Users (and other consumers of data, like bots)... [21:58:29] 10SRE, 10DNS, 10Traffic, 10serviceops, and 3 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10Sergey.Trofimovsky.SF) >>! In T276170#6894252, @Dzahn wrote: > Do you really want "gitlab" for the MVP though, not like gitlab-test.wikimedia.org or gitlab-beta.wikimedia.org ? For testin... [22:00:04] Reedy and sbassett: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210308T2200). [22:01:46] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-druid100[345] - https://phabricator.wikimedia.org/T274163 (10RobH) [22:09:56] 10SRE, 10DNS, 10Traffic, 10serviceops, and 3 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10Dzahn) > not like gitlab-test.wikimedia.org or gitlab-beta.wikimedia.org ? > > Yeah, I think so. Users (and other consumers of data, like bots) are going to have git remotes set for long-ter... [22:12:15] (03CR) 10Legoktm: admin: create new admin group mailman3-roots (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/669988 (https://phabricator.wikimedia.org/T276712) (owner: 10Dzahn) [22:12:22] (03PS1) 10RobH: updating an-druid1005 [puppet] - 10https://gerrit.wikimedia.org/r/669995 (https://phabricator.wikimedia.org/T274163) [22:12:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:12:57] (03CR) 10RobH: [C: 03+2] updating an-druid1005 [puppet] - 10https://gerrit.wikimedia.org/r/669995 (https://phabricator.wikimedia.org/T274163) (owner: 10RobH) [22:15:04] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:15:32] (03CR) 10Bstorm: [C: 03+2] wikireplicas: expose actor_user = NULL (IPs) again in actor view [puppet] - 10https://gerrit.wikimedia.org/r/669888 (https://phabricator.wikimedia.org/T276698) (owner: 10Bstorm) [22:18:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-druid100[345] - https://phabricator.wikimedia.org/T274163 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-druid1005.eqiad.wmnet ` The log can be found in `/var... [22:19:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:22:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:25:11] (03CR) 10Dzahn: admin: create new admin group mailman3-roots (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/669988 (https://phabricator.wikimedia.org/T276712) (owner: 10Dzahn) [22:33:28] (03PS2) 10Dzahn: admin: create new admin group mailman3-roots [puppet] - 10https://gerrit.wikimedia.org/r/669988 (https://phabricator.wikimedia.org/T276712) [22:34:37] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1005.eqiad.wmnet with reason: REIMAGE [22:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:44] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1005.eqiad.wmnet with reason: REIMAGE [22:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:27] PROBLEM - WDQS SPARQL on wdqs1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:40:58] 10SRE, 10DNS, 10Traffic, 10serviceops, and 3 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10brennen) Yeah, good question - sorry I conflated machine-specific with a `-test` / `-beta` hostname in my response. I //think// `gitlab.wikimedia.org` is good, on my current understanding tha... [22:44:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-druid100[345] - https://phabricator.wikimedia.org/T274163 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-druid1005.eqiad.wmnet'] ` and were **ALL** successful. [22:47:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:49:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:51:55] (03CR) 10Legoktm: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/669988 (https://phabricator.wikimedia.org/T276712) (owner: 10Dzahn) [22:53:42] (03CR) 10Dzahn: [C: 03+2] admin: create new admin group mailman3-roots [puppet] - 10https://gerrit.wikimedia.org/r/669988 (https://phabricator.wikimedia.org/T276712) (owner: 10Dzahn) [22:56:20] 10SRE, 10SRE-Access-Requests, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Request for creation of mailman3-roots group - https://phabricator.wikimedia.org/T276712 (10Dzahn) 05Open→03Resolved a:03Dzahn The group has been created with gid 827 and is automatically applied where the `lists3` puppet r... [22:56:31] 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to Mailman3 - https://phabricator.wikimedia.org/T52864 (10Dzahn) [22:58:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-druid100[345] - https://phabricator.wikimedia.org/T274163 (10RobH) [22:59:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-druid100[345] - https://phabricator.wikimedia.org/T274163 (10RobH) a:05RobH→03Cmjohnson [[ https://netbox.wikimedia.org/dcim/devices/3046/ | an-druid1004.mgmt.eqiad.wmnet ]] is not responsive. Please double check th... [23:07:59] (03PS1) 10Jdlrobson: Logo updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669998 (https://phabricator.wikimedia.org/T273085) [23:08:01] (03PS1) 10Jdlrobson: Enable modern Vector on incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669999 (https://phabricator.wikimedia.org/T275479) [23:09:20] (03CR) 10jerkins-bot: [V: 04-1] Logo updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669998 (https://phabricator.wikimedia.org/T273085) (owner: 10Jdlrobson) [23:09:34] (03CR) 10jerkins-bot: [V: 04-1] Enable modern Vector on incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669999 (https://phabricator.wikimedia.org/T275479) (owner: 10Jdlrobson) [23:17:56] (03CR) 10Bstorm: "Is there a way to ensure this is idempotent in puppet? I *looks* potentially destructive or thrashy here." [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [23:17:59] 10SRE, 10DNS, 10Traffic, 10serviceops, and 3 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10Sergey.Trofimovsky.SF) >>! In T276170#6894358, @brennen wrote: > Yeah, good question - sorry I conflated machine-specific with a `-test` / `-beta` hostname in my response. > > I //think// `gi... [23:20:06] (03PS2) 10Jdlrobson: Logo updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669998 (https://phabricator.wikimedia.org/T273085) [23:20:14] (03PS2) 10Jdlrobson: Enable modern Vector on incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669999 (https://phabricator.wikimedia.org/T275479) [23:21:27] (03CR) 10jerkins-bot: [V: 04-1] Enable modern Vector on incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669999 (https://phabricator.wikimedia.org/T275479) (owner: 10Jdlrobson) [23:35:40] PROBLEM - Disk space on releases1002 is CRITICAL: DISK CRITICAL - /srv/docker/containers/4686899adb7a2b51abe85abece0d28ee8b00213700f71c70ae98685f5a6ef2ba/mounts/shm is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=releases1002&var-datasource=eqiad+prometheus/ops [23:36:50] ^ me running a shell via wmfdebug image [23:37:09] (03PS3) 10Jdlrobson: Enable modern Vector on incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669999 (https://phabricator.wikimedia.org/T275479) [23:38:09] looks like we'll need to exclude container mountpoints from monitoring [23:40:47] I dont know why we would need to monitor them anyway [23:47:11] (03PS1) 10Bstorm: wikireplicas: depool labsdb1010 for view changes [puppet] - 10https://gerrit.wikimedia.org/r/670008 (https://phabricator.wikimedia.org/T276698) [23:48:44] (03CR) 10Bstorm: [C: 03+2] "I'm pretty confident that I'm the only one messing with this host right now, so I'll merge this." [puppet] - 10https://gerrit.wikimedia.org/r/670008 (https://phabricator.wikimedia.org/T276698) (owner: 10Bstorm) [23:51:15] (03PS1) 10Bstorm: Revert "wikireplicas: depool labsdb1010 for view changes" [puppet] - 10https://gerrit.wikimedia.org/r/669797 [23:57:04] RECOVERY - Disk space on releases1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=releases1002&var-datasource=eqiad+prometheus/ops [23:57:33] (03CR) 10Bstorm: [C: 03+2] Revert "wikireplicas: depool labsdb1010 for view changes" [puppet] - 10https://gerrit.wikimedia.org/r/669797 (owner: 10Bstorm)