[00:06:04] (03CR) 10Dzahn: "> adding changes for a cloud-specific service that could easily be used to leak production log data with a relatively small configuration" [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [00:09:12] (03CR) 10Bstorm: "I think this is a good setup. The base sentinel setup pieces might be a good thing to move into the module for ::redis as something like b" [puppet] - 10https://gerrit.wikimedia.org/r/690528 (https://phabricator.wikimedia.org/T153810) (owner: 10Majavah) [00:13:39] (03PS2) 10Dzahn: add test variant to match test pipeline, add httpd.conf [container/miscweb] - 10https://gerrit.wikimedia.org/r/691283 [00:13:57] (03CR) 10jerkins-bot: [V: 04-1] add test variant to match test pipeline, add httpd.conf [container/miscweb] - 10https://gerrit.wikimedia.org/r/691283 (owner: 10Dzahn) [00:37:21] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 55.17 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:38:33] (03Abandoned) 10Dzahn: add test variant to match test pipeline, add httpd.conf [container/miscweb] - 10https://gerrit.wikimedia.org/r/691283 (owner: 10Dzahn) [00:38:45] (03PS2) 10Dzahn: initial Blubberfile, placeholders for prod,stage,test HTML and httpd.conf [container/miscweb] - 10https://gerrit.wikimedia.org/r/690768 (https://phabricator.wikimedia.org/T281538) [00:39:01] (03CR) 10jerkins-bot: [V: 04-1] initial Blubberfile, placeholders for prod,stage,test HTML and httpd.conf [container/miscweb] - 10https://gerrit.wikimedia.org/r/690768 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [00:39:15] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [00:39:45] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 82.79 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:41:37] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [00:43:38] (03PS1) 10Ahmon Dancy: Set final notify email addresses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692741 [00:51:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:53:01] (03CR) 10Dzahn: "focused on the "not found in wmf-stable index" part and then "grep -r wmf-stable *"ed the repo and I notice that all the other services ha" [deployment-charts] - 10https://gerrit.wikimedia.org/r/692736 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [00:54:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:56:55] (03CR) 10Dzahn: "duh, of course you are adding just that in this very patch" [deployment-charts] - 10https://gerrit.wikimedia.org/r/692736 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [01:02:04] (03CR) 10Thcipriani: [C: 03+2] Set final notify email addresses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692741 (owner: 10Ahmon Dancy) [01:02:47] (03Merged) 10jenkins-bot: Set final notify email addresses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692741 (owner: 10Ahmon Dancy) [01:03:50] ^ fetched on deployment host but not sync'd since it's a CI thing, not a wikiprod thing [01:07:13] (03PS3) 10Dzahn: initial Blubberfile, placeholders for prod,stage,test HTML and httpd.conf [container/miscweb] - 10https://gerrit.wikimedia.org/r/690768 (https://phabricator.wikimedia.org/T281538) [01:07:47] (03CR) 10jerkins-bot: [V: 04-1] initial Blubberfile, placeholders for prod,stage,test HTML and httpd.conf [container/miscweb] - 10https://gerrit.wikimedia.org/r/690768 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [01:11:59] (03PS4) 10Dzahn: initial Blubberfile, placeholders for prod,stage,test HTML and httpd.conf [container/miscweb] - 10https://gerrit.wikimedia.org/r/690768 (https://phabricator.wikimedia.org/T281538) [01:12:30] (03CR) 10jerkins-bot: [V: 04-1] initial Blubberfile, placeholders for prod,stage,test HTML and httpd.conf [container/miscweb] - 10https://gerrit.wikimedia.org/r/690768 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [01:35:47] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:40:07] (03CR) 10Reedy: [C: 03+2] Fix call to non-existent var [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/692655 (https://phabricator.wikimedia.org/T283098) (owner: 10Zabe) [02:40:10] (03CR) 10Reedy: [C: 03+2] Fix call to non-existent var [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/692654 (https://phabricator.wikimedia.org/T283098) (owner: 10Zabe) [02:57:31] (03Merged) 10jenkins-bot: Fix call to non-existent var [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/692655 (https://phabricator.wikimedia.org/T283098) (owner: 10Zabe) [02:58:28] (03Merged) 10jenkins-bot: Fix call to non-existent var [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/692654 (https://phabricator.wikimedia.org/T283098) (owner: 10Zabe) [03:01:47] !log reedy@deploy1002 Synchronized php-1.37.0-wmf.6/includes/changetags/ChangeTagsRevisionList.php: T283098 T283099 (duration: 02m 35s) [03:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:52] T283098: PHP Notice: Undefined property: ChangeTagsRevisionList::$page - https://phabricator.wikimedia.org/T283098 [03:01:53] T283099: Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T283099 [03:03:10] !log reedy@deploy1002 Synchronized php-1.37.0-wmf.5/includes/changetags/ChangeTagsRevisionList.php: T283098 T283099 (duration: 01m 05s) [03:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:23:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:30:55] 10Puppet, 10GitLab (Initialization), 10Patch-For-Review: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10Reedy) [03:49:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:51:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:58:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1109', diff saved to https://phabricator.wikimedia.org/P16075 and previous config saved to /var/cache/conftool/dbconfig/20210519-045857-marostegui.json [04:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:27] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:13:50] (03PS1) 10Marostegui: Revert "db1106: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/692656 [05:17:07] !log Compress a few tables on s3 T283125 [05:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:11] T283125: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 [05:24:40] (03PS1) 10Marostegui: mariadb: Decommission labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/692754 (https://phabricator.wikimedia.org/T282523) [05:31:12] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [05:43:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 25%: Repool db1109', diff saved to https://phabricator.wikimedia.org/P16076 and previous config saved to /var/cache/conftool/dbconfig/20210519-054313-root.json [05:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:22] 10SRE, 10Wikimedia-Mailing-lists: Cannot manually add users to wmfreqs list due to regex banning any addresses - https://phabricator.wikimedia.org/T283103 (10Ladsgroup) And also please don't do a wild ban, that would have worked in mm2 but not in mm3 [05:51:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1141', diff saved to https://phabricator.wikimedia.org/P16077 and previous config saved to /var/cache/conftool/dbconfig/20210519-055134-marostegui.json [05:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 50%: Repool db1109', diff saved to https://phabricator.wikimedia.org/P16078 and previous config saved to /var/cache/conftool/dbconfig/20210519-055817-root.json [05:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:04] legoktm and Marostegui: That opportune time is upon us again. Time for a Mailman schema change deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210519T0600). [06:00:15] \o/ [06:00:20] yay [06:00:33] ok, stopping mm3 service now [06:01:22] !log stopped mailman3 service on lists1001 for schema change [06:01:23] cool, let me know when I can proceed [06:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:27] marostegui: your turn :) [06:02:08] legoktm: mailman3 database right? [06:02:11] yes [06:02:19] ok, deploying [06:02:24] done [06:02:29] all of them? [06:02:35] that was...fast [06:02:35] yes, let me double check [06:03:33] looks good to me [06:04:05] going to turn mailman3 back on now [06:04:09] yeah [06:04:11] looks good too [06:04:26] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change, 10User-notice: Mailman3 schema change: change utf8 columns to utf8mb4 - https://phabricator.wikimedia.org/T282621 (10Marostegui) Alters deployed [06:04:46] !log restarted mailman3 on lists1001 [06:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:02] <3 [06:05:16] I'll soon start migrating daily-article [06:06:03] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [06:06:45] marostegui: thank you :)) [06:06:59] <3 [06:07:40] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: daily-article-l@, education@ import to Mailman3 failed because of unicode characters in display name - https://phabricator.wikimedia.org/T282271 (10Legoktm) [06:08:10] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change, 10User-notice: Mailman3 schema change: change utf8 columns to utf8mb4 - https://phabricator.wikimedia.org/T282621 (10Legoktm) 05Open→03Resolved Yay, thank you! In conclusion the schema changes themselves took a few seconds and we had about 3 m... [06:08:33] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [06:13:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 75%: Repool db1109', diff saved to https://phabricator.wikimedia.org/P16079 and previous config saved to /var/cache/conftool/dbconfig/20210519-061321-root.json [06:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:24] (03Abandoned) 10Elukey: role::redis::misc::master: increase maxmemory for ORES instances [puppet] - 10https://gerrit.wikimedia.org/r/671186 (owner: 10Elukey) [06:14:32] (03Abandoned) 10Elukey: rsync::server::module: add check for secrets_file [puppet] - 10https://gerrit.wikimedia.org/r/514750 (owner: 10Elukey) [06:14:41] (03Abandoned) 10Elukey: archiva::proxy: raise TLS ciphersuite requirements [puppet] - 10https://gerrit.wikimedia.org/r/604698 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [06:16:07] (03CR) 10Elukey: [C: 03+2] "Makes sense, merging :)" [puppet] - 10https://gerrit.wikimedia.org/r/692626 (https://phabricator.wikimedia.org/T282454) (owner: 10Ottomata) [06:18:29] !log upgrading daily-article-l to mailman3 (T282271 T280322) [06:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:33] T282271: daily-article-l@, education@ import to Mailman3 failed because of unicode characters in display name - https://phabricator.wikimedia.org/T282271 [06:18:33] T280322: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 [06:21:05] (03CR) 10Elukey: [C: 03+1] "LGTM!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/692734 (https://phabricator.wikimedia.org/T282454) (owner: 10Ottomata) [06:24:25] PROBLEM - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:25:00] sigh, the bounce runner crashed [06:25:27] "Lock wait timeout exceeded; try restarting transaction" [06:25:40] Amir1: once you finish the import, can you run systemctl restart mailman3? [06:26:24] :( [06:26:25] Sure [06:26:36] it'll take some time though [06:28:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 100%: Repool db1109', diff saved to https://phabricator.wikimedia.org/P16080 and previous config saved to /var/cache/conftool/dbconfig/20210519-062824-root.json [06:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:37] (03CR) 10Elukey: "Rolled out via cumin, a long list of no ops!" [puppet] - 10https://gerrit.wikimedia.org/r/692626 (https://phabricator.wikimedia.org/T282454) (owner: 10Ottomata) [06:33:37] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:33:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1167', diff saved to https://phabricator.wikimedia.org/P16081 and previous config saved to /var/cache/conftool/dbconfig/20210519-063345-marostegui.json [06:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:59] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:35:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts labsdb1010.eqiad.wmnet [06:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:32] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/692754 (https://phabricator.wikimedia.org/T282523) (owner: 10Marostegui) [06:38:16] (03PS1) 10Marostegui: maintain_dbusers.pp: Remove labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/692757 (https://phabricator.wikimedia.org/T282662) [06:39:56] (03CR) 10Marostegui: [C: 03+2] maintain_dbusers.pp: Remove labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/692757 (https://phabricator.wikimedia.org/T282662) (owner: 10Marostegui) [06:42:26] (03CR) 10Marostegui: [C: 03+2] Revert "db1106: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/692656 (owner: 10Marostegui) [06:42:47] PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 35 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [06:43:18] I'll ack ^ [06:43:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1106 T280492', diff saved to https://phabricator.wikimedia.org/P16082 and previous config saved to /var/cache/conftool/dbconfig/20210519-064343-marostegui.json [06:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:48] T280492: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 [06:44:27] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:46:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts labsdb1010.eqiad.wmnet [06:46:18] ACKNOWLEDGEMENT - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 35 (limit: 25) Legoktm runner crashed, will restart later https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [06:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:25] ACKNOWLEDGEMENT - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner Legoktm runner crashed, will restart later https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:47:53] 10ops-eqiad, 10Data-Services, 10decommission-hardware: decommission labsdb1010.eqiad.wmnet - https://phabricator.wikimedia.org/T282523 (10Marostegui) [06:49:04] (03CR) 10Elukey: "Did a quick pass and left some little comments, I think this is definitely valuable and worth to review." (033 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/644050 (owner: 10Ladsgroup) [06:49:19] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [06:53:51] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:54:11] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 25%: Repool db1167', diff saved to https://phabricator.wikimedia.org/P16083 and previous config saved to /var/cache/conftool/dbconfig/20210519-070019-root.json [07:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:04] (03PS2) 10Jcrespo: dbbackups: Remove s6 stretch backup source instance on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/692341 (https://phabricator.wikimedia.org/T280751) [07:05:06] (03PS1) 10Jcrespo: dbbackups: Switchover s3 backup source from db1171 to db1102 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/692845 (https://phabricator.wikimedia.org/T283131) [07:07:01] (03CR) 10Ladsgroup: Move tests to a proper directory structure (033 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/644050 (owner: 10Ladsgroup) [07:07:04] (03PS2) 10Matthias Mullie: Properly enable media change tags on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690691 (https://phabricator.wikimedia.org/T266067) (owner: 10Urbanecm) [07:07:15] (03PS5) 10Ladsgroup: Move tests to a proper directory structure [debs/pybal] - 10https://gerrit.wikimedia.org/r/644050 [07:15:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 50%: Repool db1167', diff saved to https://phabricator.wikimedia.org/P16084 and previous config saved to /var/cache/conftool/dbconfig/20210519-071523-root.json [07:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:49] gosh this mailing list list so massive, I can't do anything about it [07:17:59] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692690 (owner: 10Volans) [07:20:22] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, two typos inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/692703 (owner: 10Volans) [07:21:02] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&from=now-3h&to=now&var-server=lists1001&var-datasource=thanos&var-cluster=misc [07:21:26] or https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1128&var-port=9104&var-dc=eqiad%20prometheus%2Fops [07:21:38] it failed. I want to try again but it can't delete the new mailing list [07:21:39] ugh [07:21:45] why'd it fail?? [07:22:20] nothing emoji related AFAICS [07:22:29] :v [07:22:41] I slowly drain members so I can delete it [07:22:51] directly from db [07:23:00] what was the error? [07:23:18] but yeah, deleting it that way makes sense [07:23:54] transaction took too long [07:24:07] but I hope this doesn't happen [07:24:46] 34,818 members [07:25:07] (checked in mm2) [07:25:09] (03CR) 10Muehlenhoff: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/692703 also addresses the expiry date, but I guess additional YAML schema validation " [puppet] - 10https://gerrit.wikimedia.org/r/692643 (owner: 10RLazarus) [07:30:22] okay finally deleted, redoing it [07:30:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 75%: Repool db1167', diff saved to https://phabricator.wikimedia.org/P16085 and previous config saved to /var/cache/conftool/dbconfig/20210519-073027-root.json [07:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:44] !log Deploy schema change on s3 codfw, lag will appear in codfw T266486 T268392 T273360 [07:31:47] (03PS2) 10Volans: admin: add additional validation tests [puppet] - 10https://gerrit.wikimedia.org/r/692703 [07:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:49] T268392: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 [07:31:50] T273360: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 [07:31:50] T266486: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 [07:32:02] (03CR) 10Volans: "addressed comments, thanks for the review" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/692703 (owner: 10Volans) [07:32:05] restarted mm3 now I think the bounce processor will crash again [07:32:11] RECOVERY - mailman3_runners on lists1001 is OK: PROCS OK: 14 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:32:31] ty and yeah it probably will [07:32:48] not the end of the world [07:33:45] RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [07:36:11] Amir1: in theory commenting out https://gitlab.com/mailman/mailman/-/blob/3.3.3/src/mailman/commands/cli_import.py#L73 will have it not use a single transaction for the import [07:36:33] let's try it if breaks again [07:36:59] I didn't actually check if other code later would create its own transaction [07:37:09] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:38:33] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:38:33] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:38:51] !log add 100G to prometheus/ops eqiad [07:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:19] PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 57 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [07:41:29] (03CR) 10Ayounsi: [C: 03+2] SNMP: filter out default logical interfaces (.0) [homer/public] - 10https://gerrit.wikimedia.org/r/692576 (https://phabricator.wikimedia.org/T283060) (owner: 10Ayounsi) [07:42:18] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/692703 (owner: 10Volans) [07:42:19] PROBLEM - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:42:24] !log roll SNMP: filter out default logical interfaces (.0) to all network devices - T283060 [07:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:28] T283060: SNMP: filter out default sub interfaces - https://phabricator.wikimedia.org/T283060 [07:42:59] (03CR) 10Filippo Giunchedi: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/692587 (https://phabricator.wikimedia.org/T273673) (owner: 10Filippo Giunchedi) [07:43:04] (03CR) 10Volans: [C: 03+1] "I didn't test it but seems good to me." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/692643 (owner: 10RLazarus) [07:43:19] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 145, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:43:19] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:43:46] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10JoKalliauer) @Glrx sorry I missed it earlier. #### bash-input ` #!/bin/bash rsvg-convert --version LANG=de rsvg-convert -w 512 -h 224 -o result-de.png SystemLanguage.svg LA... [07:45:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 100%: Repool db1167', diff saved to https://phabricator.wikimedia.org/P16086 and previous config saved to /var/cache/conftool/dbconfig/20210519-074530-root.json [07:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:57] !log zpapierski@deploy1002 Started deploy [wikimedia/discovery/analytics@5740956]: T273847 deploying export_queries_to_relforge - index setting changes [07:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:01] T273847: Create a elasticsearch/kibana index with queries to allow query completion candidate research - https://phabricator.wikimedia.org/T273847 [07:46:02] (03CR) 10Filippo Giunchedi: [C: 03+2] rsync: move quickdatacopy to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/692587 (https://phabricator.wikimedia.org/T273673) (owner: 10Filippo Giunchedi) [07:48:21] !log zpapierski@deploy1002 Finished deploy [wikimedia/discovery/analytics@5740956]: T273847 deploying export_queries_to_relforge - index setting changes (duration: 02m 23s) [07:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:35] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/692643 (owner: 10RLazarus) [08:09:01] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. The UI still looks the same with the updated libs/CSS and I guess that's what we want :-)" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692691 (owner: 10Volans) [08:10:54] !log zpapierski@deploy1002 Started deploy [wikimedia/discovery/analytics@f514dd9]: T273847 deploying export_queries_to_relforge - starttime bump [08:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:00] T273847: Create a elasticsearch/kibana index with queries to allow query completion candidate research - https://phabricator.wikimedia.org/T273847 [08:12:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good and works fine on the test instance" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692692 (owner: 10Volans) [08:13:19] !log zpapierski@deploy1002 Finished deploy [wikimedia/discovery/analytics@f514dd9]: T273847 deploying export_queries_to_relforge - starttime bump (duration: 02m 24s) [08:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:12] (03PS3) 10Jbond: admin: drop christinedk do not merge before 17/05/2021 [puppet] - 10https://gerrit.wikimedia.org/r/692073 [08:19:05] (03CR) 10Jbond: [C: 03+2] admin: drop christinedk do not merge before 17/05/2021 [puppet] - 10https://gerrit.wikimedia.org/r/692073 (owner: 10Jbond) [08:20:11] 10SRE, 10Wikimedia-Mailing-lists: Cannot manually add users to wmfreqs list due to regex banning any addresses - https://phabricator.wikimedia.org/T283103 (10Peachey88) @Olauro See https://meta.wikimedia.org/wiki/Mailing_lists/Mailman3_migration#Review_bans to expand on Ladsgroup comment on that regex method n... [08:21:51] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add bootstrap and provision scripts [puppet] - 10https://gerrit.wikimedia.org/r/691231 (owner: 10Filippo Giunchedi) [08:22:13] jbond42: merged your change too [08:22:20] oh yes lease :) [08:26:10] {{done}} [08:27:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1175', diff saved to https://phabricator.wikimedia.org/P16087 and previous config saved to /var/cache/conftool/dbconfig/20210519-082713-marostegui.json [08:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:16] 10SRE, 10serviceops: Publish wikimedia-bullseye base docker image - https://phabricator.wikimedia.org/T281596 (10Joe) For the record, I considered using the "official" dockerhub debian images as a base, but: - It's not an official debian effort (for now) - The build process includes downloading artifacts from... [08:28:18] !log Stop MySQL on db1175 to upgrade kernel and mysql [08:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:29] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692690 (owner: 10Volans) [08:34:02] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692691 (owner: 10Volans) [08:34:18] (03CR) 10Volans: [C: 03+2] admin: add additional validation tests [puppet] - 10https://gerrit.wikimedia.org/r/692703 (owner: 10Volans) [08:34:43] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692692 (owner: 10Volans) [08:41:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: Repool db1175', diff saved to https://phabricator.wikimedia.org/P16088 and previous config saved to /var/cache/conftool/dbconfig/20210519-084119-root.json [08:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:53] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_puppetdb site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:53:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:56:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 50%: Repool db1175', diff saved to https://phabricator.wikimedia.org/P16089 and previous config saved to /var/cache/conftool/dbconfig/20210519-085622-root.json [08:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:48] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: daily-article-l@, education@ import to Mailman3 failed because of unicode characters in display name - https://phabricator.wikimedia.org/T282271 (10Ladsgroup) 05Open→03Resolved a:03Legoktm Migrated like a charm [08:56:59] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup) [09:04:55] (03PS2) 10Kormat: switchover: Use heartbeat systemd service. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/665337 (https://phabricator.wikimedia.org/T252528) [09:05:58] (03PS9) 10Kormat: mariadb: Convert pt-heartbeat to a systemd service. [puppet] - 10https://gerrit.wikimedia.org/r/665324 (https://phabricator.wikimedia.org/T252528) [09:07:31] (03CR) 10jerkins-bot: [V: 04-1] switchover: Use heartbeat systemd service. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/665337 (https://phabricator.wikimedia.org/T252528) (owner: 10Kormat) [09:11:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 75%: Repool db1175', diff saved to https://phabricator.wikimedia.org/P16090 and previous config saved to /var/cache/conftool/dbconfig/20210519-091126-root.json [09:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: Add ingress-nginx Helm files [puppet] - 10https://gerrit.wikimedia.org/r/685715 (https://phabricator.wikimedia.org/T264221) (owner: 10Majavah) [09:12:43] (03CR) 10Ladsgroup: [C: 03+1] "Generally looks good" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690680 (https://phabricator.wikimedia.org/T280400) (owner: 10Urbanecm) [09:13:36] (03CR) 10Ladsgroup: [C: 03+1] Move otrs-wiki.wikimedia.org to vrt-wiki.wikimedia.org (part 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692452 (https://phabricator.wikimedia.org/T280400) (owner: 10Urbanecm) [09:13:47] (03PS1) 10Jbond: O:standard::manifest::ntp::timesync install systemd-timesyncd on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/692852 (https://phabricator.wikimedia.org/T280801) [09:14:34] (03CR) 10jerkins-bot: [V: 04-1] O:standard::manifest::ntp::timesync install systemd-timesyncd on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/692852 (https://phabricator.wikimedia.org/T280801) (owner: 10Jbond) [09:18:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:23:43] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:26:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: Repool db1175', diff saved to https://phabricator.wikimedia.org/P16091 and previous config saved to /var/cache/conftool/dbconfig/20210519-092630-root.json [09:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:47] (03CR) 10Alexandros Kosiaris: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/692736 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [09:26:53] (03CR) 10Alexandros Kosiaris: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/692736 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [09:35:17] (03CR) 10Arturo Borrero Gonzalez: "I'm having problems understanding this change, specifically:" [puppet] - 10https://gerrit.wikimedia.org/r/688361 (https://phabricator.wikimedia.org/T264221) (owner: 10Majavah) [09:35:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (sans the nits already pointed out by Luca)" [puppet] - 10https://gerrit.wikimedia.org/r/692734 (https://phabricator.wikimedia.org/T282454) (owner: 10Ottomata) [09:36:22] (03CR) 10Volans: [C: 03+2] setup.py: upgrade dependencies [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692690 (owner: 10Volans) [09:36:30] (03CR) 10Volans: [C: 03+2] static: update CSS and JS libraries [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692691 (owner: 10Volans) [09:36:37] (03CR) 10Volans: [C: 03+2] config: improve CSP headers [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692692 (owner: 10Volans) [09:37:37] PROBLEM - DNS on labsdb1010.mgmt is CRITICAL: Domain labsdb1010.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:38:03] ^ this host was decommissioned earlier today [09:39:05] (03Merged) 10jenkins-bot: setup.py: upgrade dependencies [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692690 (owner: 10Volans) [09:39:07] (03Merged) 10jenkins-bot: static: update CSS and JS libraries [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692691 (owner: 10Volans) [09:39:09] (03Merged) 10jenkins-bot: config: improve CSP headers [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692692 (owner: 10Volans) [09:39:34] marostegui: but should still be reachable on the mgmt iface right? [09:39:58] volans: this is the first time I see that after running the decom script [09:40:13] checking [09:41:14] volans: from what I can see in the output of the script, there were no errors or anything [09:41:25] (03PS1) 10Kormat: setup.py: Don't use latest cumin for py3.5 [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/692857 [09:41:27] it removed labsdb1010 and labsdb1010.mgmt.eqiad.wmnet. dns entries [09:41:28] marostegui: puppet disabled on alert1010 1162 minutes ago) [09:41:34] :-/ [09:41:53] no message [09:42:27] that's lots of hours to keep puppet disbaled [09:42:28] disabled [09:42:38] indeed [09:42:49] :-) nice mystery for a wednesday [09:42:51] also no message == not using the disable-puppet script as it should be done [09:43:31] (03Abandoned) 10MSantos: maps imposm3: add log file for imposm3 sync [puppet] - 10https://gerrit.wikimedia.org/r/670817 (owner: 10MSantos) [09:44:31] (03PS2) 10Jbond: O:standard::manifest::ntp::timesync install systemd-timesyncd on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/692852 (https://phabricator.wikimedia.org/T280801) [09:44:33] (03PS1) 10Jbond: standard: drop metadata.yaml [puppet] - 10https://gerrit.wikimedia.org/r/692858 [09:45:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29620/console" [puppet] - 10https://gerrit.wikimedia.org/r/692852 (https://phabricator.wikimedia.org/T280801) (owner: 10Jbond) [09:47:34] (03CR) 10Muehlenhoff: [C: 03+2] Enable SLO for grafana [puppet] - 10https://gerrit.wikimedia.org/r/692605 (owner: 10Muehlenhoff) [09:47:59] arturo, marostegui: my bad, it's not disabled (I read too quickly), it's failing since yesterday [09:48:12] https://puppetboard.wikimedia.org/report/alert1001.wikimedia.org/8e3a9231bf1d18aa9d320171eb3b231e2bc0b6af [09:49:01] mystery solved then! [09:51:23] (03PS1) 10Volans: prometheus: fix syntax error [puppet] - 10https://gerrit.wikimedia.org/r/692860 (https://phabricator.wikimedia.org/T276749) [09:51:34] godog: ^^^ [09:52:02] (03CR) 10Volans: "According to https://puppetboard.wikimedia.org/report/alert1001.wikimedia.org/8e3a9231bf1d18aa9d320171eb3b231e2bc0b6af this should be the " [puppet] - 10https://gerrit.wikimedia.org/r/692860 (https://phabricator.wikimedia.org/T276749) (owner: 10Volans) [09:52:08] gah, thanks volans [09:52:19] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: fix syntax error [puppet] - 10https://gerrit.wikimedia.org/r/692860 (https://phabricator.wikimedia.org/T276749) (owner: 10Volans) [09:53:01] (03CR) 10Volans: [C: 03+2] prometheus: fix syntax error [puppet] - 10https://gerrit.wikimedia.org/r/692860 (https://phabricator.wikimedia.org/T276749) (owner: 10Volans) [09:53:42] merged, running puppet on alert1001 [09:55:27] SGTM [09:55:48] (03CR) 10Muehlenhoff: "Logout works fine, the error messages are a little irritating (NetworkError when attempting to fetch resource when accessing a dashboard a" [puppet] - 10https://gerrit.wikimedia.org/r/692605 (owner: 10Muehlenhoff) [09:56:11] all good, lot of changes ofc [09:56:21] (03CR) 10Alexandros Kosiaris: "Completely different error this time around, so we at least hotfixed the other one." [deployment-charts] - 10https://gerrit.wikimedia.org/r/692736 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [09:56:26] in the icinga configs [09:57:06] I think that the alert hosts shouldalert if puppet broken and be an exception of the aggregated puppet failed alert [09:58:49] I'm doubtful it'll help but I'm not opposed, I'm assuming you mean an exception to "puppet last run" alert ? [09:59:18] yes that one [09:59:18] or yeah the aggregated alert too perhaps [09:59:44] I think they are a bit of a special host in that sense [09:59:48] being monitoring all the rest and such [10:00:11] 10SRE, 10Analytics, 10Analytics-Kanban: Terminate Wikimetrics - https://phabricator.wikimedia.org/T219446 (10Aklapper) [10:00:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor comment inline, rest LGTM" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/692736 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [10:00:29] 10SRE, 10Scap (Scap3-MediaWiki-MVP): Depool proxies temporarily while scap is ongoing to avoid taxing those nodes - https://phabricator.wikimedia.org/T125629 (10Aklapper) [10:00:45] 10SRE, 10Scap (Scap3-MediaWiki-MVP): Make scap able to depool/repool servers via the conftool API - https://phabricator.wikimedia.org/T104352 (10Aklapper) 05Open→03Resolved No reply to last comments; assuming this is resolved [10:01:29] yeah I see what you are saying re: alert hosts [10:01:30] (03CR) 10Muehlenhoff: [C: 03+2] Enable SLO by default [puppet] - 10https://gerrit.wikimedia.org/r/692607 (owner: 10Muehlenhoff) [10:04:19] godog: should I open a task with puppet and o11y tags? [10:04:33] so it can be discussed more easily there [10:05:46] volans: sure, thank you [10:05:50] ack, doing [10:07:06] (03CR) 10Alexandros Kosiaris: [C: 04-1] (WIP) Add tokens and users for maps-vector-server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/692669 (owner: 10Effie Mouzeli) [10:08:48] 10Puppet, 10observability: Puppet failing on the alert hosts should alert - https://phabricator.wikimedia.org/T283151 (10Volans) p:05Triage→03Medium [10:09:47] (03CR) 10Alexandros Kosiaris: [C: 03+2] "I 'll merge this and submit the followup commit for making this smarter based on the type of the resource for review later." [puppet] - 10https://gerrit.wikimedia.org/r/691108 (https://phabricator.wikimedia.org/T256762) (owner: 10Alexandros Kosiaris) [10:09:53] (03PS2) 10Alexandros Kosiaris: docker-registry: Re-apply Cache-Control rules [puppet] - 10https://gerrit.wikimedia.org/r/691108 (https://phabricator.wikimedia.org/T256762) [10:11:27] (03CR) 10Kormat: [C: 03+2] setup.py: Don't use latest cumin for py3.5 [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/692857 (owner: 10Kormat) [10:11:37] (03PS1) 10Jbond: cloud puppet-dev: add hiera config so puppet failes after first run [puppet] - 10https://gerrit.wikimedia.org/r/692861 [10:12:12] (03CR) 10Jbond: [C: 03+2] cloud puppet-dev: add hiera config so puppet failes after first run [puppet] - 10https://gerrit.wikimedia.org/r/692861 (owner: 10Jbond) [10:13:52] (03Merged) 10jenkins-bot: setup.py: Don't use latest cumin for py3.5 [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/692857 (owner: 10Kormat) [10:17:05] akosiaris: forget to press 'y' on puppet-merge ? [10:17:31] (03PS3) 10Kormat: switchover: Use heartbeat systemd service. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/665337 (https://phabricator.wikimedia.org/T252528) [10:20:59] (03CR) 10Kormat: [C: 03+2] "After a bunch of testing in pontoon, this seems to work fine." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/665337 (https://phabricator.wikimedia.org/T252528) (owner: 10Kormat) [10:23:16] (03Merged) 10jenkins-bot: switchover: Use heartbeat systemd service. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/665337 (https://phabricator.wikimedia.org/T252528) (owner: 10Kormat) [10:31:43] jbond42: yup, fixed, thanks [10:32:00] thanks [10:32:36] (03CR) 10Muehlenhoff: "- alertmanager bails out correctly, although the error message is a little irritating: "Can't connect to the API, last error was "JSON.par" [puppet] - 10https://gerrit.wikimedia.org/r/692607 (owner: 10Muehlenhoff) [10:32:38] 10SRE, 10netops: SNMP: filter out default sub interfaces - https://phabricator.wikimedia.org/T283060 (10ayounsi) 05Open→03Resolved a:03ayounsi Pushed everywhere successfully. ~2400 LibreNMS ports removed. Speed up LibreNMS pooling by ~25%. https://librenms.wikimedia.org/graphs/type=global_poller_modules... [10:33:20] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Not in love with the naming, but if it's provisional, fine by me" [puppet] - 10https://gerrit.wikimedia.org/r/692667 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [10:33:36] PROBLEM - MariaDB memory on db1148 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1196) = 92.9% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [10:34:18] (03PS1) 10Jbond: hiera: remove borked config [puppet] - 10https://gerrit.wikimedia.org/r/692863 [10:35:03] (03CR) 10Jbond: [C: 03+2] hiera: remove borked config [puppet] - 10https://gerrit.wikimedia.org/r/692863 (owner: 10Jbond) [10:35:59] (03PS9) 10Majavah: toolforge: Allow passing host port for k8s ingress [puppet] - 10https://gerrit.wikimedia.org/r/688361 (https://phabricator.wikimedia.org/T264221) [10:36:07] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/691154 (owner: 10Arturo Borrero Gonzalez) [10:36:19] ACKNOWLEDGEMENT - MariaDB memory on db1148 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1196) = 92.9% Marostegui Long running maintenance query running https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [10:37:07] (03CR) 10Hnowlan: [C: 03+1] "LGTM, one question" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685743 (owner: 10MSantos) [10:37:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: Allow passing host port for k8s ingress [puppet] - 10https://gerrit.wikimedia.org/r/688361 (https://phabricator.wikimedia.org/T264221) (owner: 10Majavah) [10:41:28] (03CR) 10Jbond: [C: 03+2] standard: drop metadata.yaml [puppet] - 10https://gerrit.wikimedia.org/r/692858 (owner: 10Jbond) [10:44:03] (03PS3) 10Jbond: O:standard::manifest::ntp::timesync install systemd-timesyncd on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/692852 (https://phabricator.wikimedia.org/T280801) [10:45:34] (03CR) 10Filippo Giunchedi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/692605 (owner: 10Muehlenhoff) [10:48:02] (03CR) 10Muehlenhoff: O:standard::manifest::ntp::timesync install systemd-timesyncd on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/692852 (https://phabricator.wikimedia.org/T280801) (owner: 10Jbond) [10:48:29] (03PS1) 10Effie Mouzeli: (WIP) Add tokens for maps-vector-server [labs/private] - 10https://gerrit.wikimedia.org/r/692865 [10:48:33] (03PS1) 10Filippo Giunchedi: openstack: add bullseye VM clientpackages [puppet] - 10https://gerrit.wikimedia.org/r/692866 (https://phabricator.wikimedia.org/T280801) [10:48:55] (03CR) 10Filippo Giunchedi: "I ran into this missing class when trying bullseye in eqiad1" [puppet] - 10https://gerrit.wikimedia.org/r/692866 (https://phabricator.wikimedia.org/T280801) (owner: 10Filippo Giunchedi) [10:48:57] (03PS2) 10Effie Mouzeli: Add tokens for mwdebug service [labs/private] - 10https://gerrit.wikimedia.org/r/692672 (https://phabricator.wikimedia.org/T283056) [10:52:01] (03CR) 10Muehlenhoff: O:standard::manifest::ntp::timesync install systemd-timesyncd on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/692852 (https://phabricator.wikimedia.org/T280801) (owner: 10Jbond) [10:54:42] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10jijiki) [10:58:35] (03PS1) 10MMandere: Add drmrs site instances [puppet] - 10https://gerrit.wikimedia.org/r/692869 (https://phabricator.wikimedia.org/T282787) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210519T1100). [11:00:05] matthiasmullie and kostajh: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:25] \o [11:00:31] o/ [11:04:11] (03PS2) 10Giuseppe Lavagetto: Add helmfile.d for shellbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/692736 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [11:04:12] (03PS1) 10Giuseppe Lavagetto: Rakefile: correctly handle added deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/692870 [11:04:40] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10ayounsi) After chatting with @wiki_willy the best use for those switches is to go in the C8 and D5, as cloudsw2 switches to add capacity to the existing cloudsw switches. Exact... [11:05:29] I can deploy mine & kostajh's patches [11:05:54] matthiasmullie: thanks! [11:06:04] (03CR) 10Alexandros Kosiaris: [C: 03+1] Rakefile: correctly handle added deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/692870 (owner: 10Giuseppe Lavagetto) [11:07:19] (03CR) 10Matthias Mullie: [C: 03+2] Properly enable media change tags on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690691 (https://phabricator.wikimedia.org/T266067) (owner: 10Urbanecm) [11:08:05] (03Merged) 10jenkins-bot: Properly enable media change tags on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690691 (https://phabricator.wikimedia.org/T266067) (owner: 10Urbanecm) [11:08:52] (03CR) 10Jbond: O:standard::manifest::ntp::timesync install systemd-timesyncd on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/692852 (https://phabricator.wikimedia.org/T280801) (owner: 10Jbond) [11:11:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "This should be good to merge. But I wonder what is still using ussuri out there. There should be some pending hiera cleanup." [puppet] - 10https://gerrit.wikimedia.org/r/692866 (https://phabricator.wikimedia.org/T280801) (owner: 10Filippo Giunchedi) [11:13:13] !log mlitn@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:690691|Properly enable media change tags on Wikipedias (T266067 T282822)]] - part 1 (duration: 01m 34s) [11:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:18] T282822: Certain tags are no longer activated by default - https://phabricator.wikimedia.org/T282822 [11:13:18] T266067: [L] Create edit tags to measure multimedia edits to Wikipedia articles - https://phabricator.wikimedia.org/T266067 [11:14:46] !log mlitn@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:690691|Properly enable media change tags on Wikipedias (T266067 T282822)]] - part 2 (duration: 01m 04s) [11:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:05] (03CR) 10Matthias Mullie: [C: 03+2] Add a link: Set contentedtiable=false on mobile [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/692653 (https://phabricator.wikimedia.org/T281771) (owner: 10Kosta Harlan) [11:22:17] (03CR) 10MSantos: WIP: maps: DB performance improvements (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685743 (owner: 10MSantos) [11:22:35] (03CR) 10Muehlenhoff: O:standard::manifest::ntp::timesync install systemd-timesyncd on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/692852 (https://phabricator.wikimedia.org/T280801) (owner: 10Jbond) [11:25:05] (03CR) 10Jbond: O:standard::manifest::ntp::timesync install systemd-timesyncd on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/692852 (https://phabricator.wikimedia.org/T280801) (owner: 10Jbond) [11:25:15] (03Abandoned) 10Jbond: O:standard::manifest::ntp::timesync install systemd-timesyncd on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/692852 (https://phabricator.wikimedia.org/T280801) (owner: 10Jbond) [11:29:02] (03PS1) 10Volans: tests: drop Python2 support code from cli tests [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692871 [11:29:03] (03PS1) 10Volans: cli: drop support for Python 3.4 [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692872 [11:29:05] (03PS1) 10Volans: setup.py: add explicit support for Python 3.9 [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692873 [11:29:07] (03PS1) 10Volans: cli: bump version to 0.3 to follow the server side [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692874 [11:29:53] (03PS1) 10Majavah: toolforge: Update helm values after chart changes [puppet] - 10https://gerrit.wikimedia.org/r/692875 (https://phabricator.wikimedia.org/T264221) [11:31:43] (03PS1) 10Marostegui: install_server: Remove db1087 entry [puppet] - 10https://gerrit.wikimedia.org/r/692876 (https://phabricator.wikimedia.org/T282093) [11:32:17] (03Merged) 10jenkins-bot: Add a link: Set contentedtiable=false on mobile [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/692653 (https://phabricator.wikimedia.org/T281771) (owner: 10Kosta Harlan) [11:32:26] (03CR) 10Marostegui: [C: 03+2] install_server: Remove db1087 entry [puppet] - 10https://gerrit.wikimedia.org/r/692876 (https://phabricator.wikimedia.org/T282093) (owner: 10Marostegui) [11:34:08] (03CR) 10Volans: "I was planning to put a 0.3.0 tag on the repo for the server side, as we've upgraded a lot of dependencies and dropped support from the cl" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692874 (owner: 10Volans) [11:36:01] kostajh: patch is on mwdebug1002 [11:36:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/692634 (https://phabricator.wikimedia.org/T282575) (owner: 10Herron) [11:36:45] matthiasmullie: thanks, having a look [11:36:50] (03PS1) 10Jbond: chrony: drop chrony [puppet] - 10https://gerrit.wikimedia.org/r/692877 (https://phabricator.wikimedia.org/T280801) [11:37:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: Update helm values after chart changes [puppet] - 10https://gerrit.wikimedia.org/r/692875 (https://phabricator.wikimedia.org/T264221) (owner: 10Majavah) [11:37:44] (03PS2) 10Jbond: chrony: drop chrony [puppet] - 10https://gerrit.wikimedia.org/r/692877 (https://phabricator.wikimedia.org/T280801) [11:40:07] matthiasmullie: looks good [11:40:56] kostajh: syncing [11:41:52] !log mlitn@deploy1002 Synchronized php-1.37.0-wmf.6/extensions/GrowthExperiments/modules: Backport: [[gerrit:692653|Add a link: Set contentedtiable=false on mobile (T281771)]] (duration: 01m 06s) [11:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:56] T281771: Prevent contenteditabe like experiences (cursor placement; menu selection for cutting text) in AddLinkArticleTarget - https://phabricator.wikimedia.org/T281771 [11:41:56] (03PS1) 10Kormat: Prepare for 0.7 release. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/692878 [11:42:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1175', diff saved to https://phabricator.wikimedia.org/P16093 and previous config saved to /var/cache/conftool/dbconfig/20210519-114203-marostegui.json [11:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:13] thank you matthiasmullie [11:45:29] Alright I guess we're done here [11:45:34] !log "EU backports done" [11:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:38] kostajh: yw! [11:47:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] chrony: drop chrony [puppet] - 10https://gerrit.wikimedia.org/r/692877 (https://phabricator.wikimedia.org/T280801) (owner: 10Jbond) [11:51:41] (03CR) 10Kormat: [C: 03+2] Prepare for 0.7 release. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/692878 (owner: 10Kormat) [11:51:56] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692871 (owner: 10Volans) [11:54:15] (03Merged) 10jenkins-bot: Prepare for 0.7 release. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/692878 (owner: 10Kormat) [11:55:28] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692872 (owner: 10Volans) [11:56:07] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692873 (owner: 10Volans) [11:56:26] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692874 (owner: 10Volans) [11:56:39] thanks! [11:57:11] :) np [11:57:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, PCC also fine: https://puppet-compiler.wmflabs.org/compiler1003/29623/" [puppet] - 10https://gerrit.wikimedia.org/r/692877 (https://phabricator.wikimedia.org/T280801) (owner: 10Jbond) [11:59:31] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692871 (owner: 10Volans) [12:00:05] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210519T1200) [12:00:34] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692872 (owner: 10Volans) [12:03:13] (03CR) 10Volans: [C: 03+2] tests: drop Python2 support code from cli tests [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692871 (owner: 10Volans) [12:03:18] (03CR) 10Volans: [C: 03+2] cli: drop support for Python 3.4 [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692872 (owner: 10Volans) [12:03:26] (03CR) 10Volans: [C: 03+2] setup.py: add explicit support for Python 3.9 [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692873 (owner: 10Volans) [12:03:28] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692873 (owner: 10Volans) [12:04:10] (03CR) 10Muehlenhoff: "> Patch Set 1:" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692874 (owner: 10Volans) [12:04:24] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692874 (owner: 10Volans) [12:05:05] (03CR) 10Volans: [C: 03+2] cli: bump version to 0.3 to follow the server side [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692874 (owner: 10Volans) [12:06:26] (03Merged) 10jenkins-bot: tests: drop Python2 support code from cli tests [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692871 (owner: 10Volans) [12:06:27] (03Merged) 10jenkins-bot: cli: drop support for Python 3.4 [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692872 (owner: 10Volans) [12:06:30] (03Merged) 10jenkins-bot: setup.py: add explicit support for Python 3.9 [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692873 (owner: 10Volans) [12:07:13] (03Merged) 10jenkins-bot: cli: bump version to 0.3 to follow the server side [software/debmonitor] - 10https://gerrit.wikimedia.org/r/692874 (owner: 10Volans) [12:14:28] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:19:09] (03CR) 10Jbond: "Took fist pass see inline" (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/692869 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [12:20:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:21:00] (03CR) 10Jbond: [C: 03+2] chrony: drop chrony [puppet] - 10https://gerrit.wikimedia.org/r/692877 (https://phabricator.wikimedia.org/T280801) (owner: 10Jbond) [12:22:42] (03PS4) 10MSantos: maps: DB performance improvements [puppet] - 10https://gerrit.wikimedia.org/r/685743 [12:25:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: Repool db1175', diff saved to https://phabricator.wikimedia.org/P16095 and previous config saved to /var/cache/conftool/dbconfig/20210519-122501-root.json [12:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:26] (03PS1) 10Filippo Giunchedi: pontoon: symlink /var/lib/puppet/client if needed [puppet] - 10https://gerrit.wikimedia.org/r/692879 [12:37:30] (03CR) 10Filippo Giunchedi: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/692866 (https://phabricator.wikimedia.org/T280801) (owner: 10Filippo Giunchedi) [12:37:40] (03Abandoned) 10Filippo Giunchedi: openstack: add bullseye VM clientpackages [puppet] - 10https://gerrit.wikimedia.org/r/692866 (https://phabricator.wikimedia.org/T280801) (owner: 10Filippo Giunchedi) [12:40:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 50%: Repool db1175', diff saved to https://phabricator.wikimedia.org/P16096 and previous config saved to /var/cache/conftool/dbconfig/20210519-124004-root.json [12:40:06] (03PS1) 10Filippo Giunchedi: pontoon: change eqiad1::version to victoria [puppet] - 10https://gerrit.wikimedia.org/r/692880 [12:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:55] (03CR) 10Filippo Giunchedi: "Fixed in Id79c00a5be9" [puppet] - 10https://gerrit.wikimedia.org/r/692866 (https://phabricator.wikimedia.org/T280801) (owner: 10Filippo Giunchedi) [12:42:29] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: change eqiad1::version to victoria [puppet] - 10https://gerrit.wikimedia.org/r/692880 (owner: 10Filippo Giunchedi) [12:49:17] (03CR) 10Kormat: [C: 03+1] pontoon: symlink /var/lib/puppet/client if needed [puppet] - 10https://gerrit.wikimedia.org/r/692879 (owner: 10Filippo Giunchedi) [12:49:55] (03PS1) 10JMeybohm: docker_kubernetes_user_password is in common for staging [labs/private] - 10https://gerrit.wikimedia.org/r/692881 (https://phabricator.wikimedia.org/T273521) [12:50:31] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] docker_kubernetes_user_password is in common for staging [labs/private] - 10https://gerrit.wikimedia.org/r/692881 (https://phabricator.wikimedia.org/T273521) (owner: 10JMeybohm) [12:55:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 75%: Repool db1175', diff saved to https://phabricator.wikimedia.org/P16097 and previous config saved to /var/cache/conftool/dbconfig/20210519-125508-root.json [12:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:58] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: symlink /var/lib/puppet/client if needed [puppet] - 10https://gerrit.wikimedia.org/r/692879 (owner: 10Filippo Giunchedi) [13:00:05] hashar and dancy: #bothumor I � Unicode. All rise for MediaWiki train - European+American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210519T1300). [13:10:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: Repool db1175', diff saved to https://phabricator.wikimedia.org/P16098 and previous config saved to /var/cache/conftool/dbconfig/20210519-131012-root.json [13:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1157', diff saved to https://phabricator.wikimedia.org/P16099 and previous config saved to /var/cache/conftool/dbconfig/20210519-131920-marostegui.json [13:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:08] oh [13:25:10] train time! [13:25:40] (03PS1) 10Hashar: group1 wikis to 1.37.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692891 [13:25:42] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.37.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692891 (owner: 10Hashar) [13:27:02] (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692891 (owner: 10Hashar) [13:28:01] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10Andrew) It's down again. Whenever we prod it it seems to recover briefly and then fall off the network after an hour or two. [13:28:31] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.6 [13:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:37] !log hashar@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.6 (duration: 01m 05s) [13:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:36] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:12] !log uploaded wmfmariadb 0.7 packages to apt [13:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:38] (03PS2) 10Herron: install_server: add buster-rescue tftp config [puppet] - 10https://gerrit.wikimedia.org/r/692634 (https://phabricator.wikimedia.org/T282575) [13:39:08] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Rakefile: correctly handle added deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/692870 (owner: 10Giuseppe Lavagetto) [13:40:18] (03CR) 10Herron: [C: 03+2] install_server: add buster-rescue tftp config [puppet] - 10https://gerrit.wikimedia.org/r/692634 (https://phabricator.wikimedia.org/T282575) (owner: 10Herron) [13:41:55] (03Merged) 10jenkins-bot: Rakefile: correctly handle added deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/692870 (owner: 10Giuseppe Lavagetto) [13:43:00] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:30] (03CR) 10Ottomata: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/692626 (https://phabricator.wikimedia.org/T282454) (owner: 10Ottomata) [13:48:59] (03CR) 10Ottomata: [V: 03+1] kafka - Use hardened_tls instead of java::security if $ssl_enabled (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/692734 (https://phabricator.wikimedia.org/T282454) (owner: 10Ottomata) [14:00:39] (03CR) 10Urbanecm: Move otrs-wiki.wikimedia.org to vrt-wiki.wikimedia.org (part 1) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690680 (https://phabricator.wikimedia.org/T280400) (owner: 10Urbanecm) [14:09:12] (03CR) 10RLazarus: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/692643 (owner: 10RLazarus) [14:19:32] 10SRE, 10Traffic: Let's Encrypt issuance chains update - https://phabricator.wikimedia.org/T283164 (10Vgutierrez) [14:19:47] 10SRE, 10Acme-chief, 10Traffic: Let's Encrypt chain size increase - https://phabricator.wikimedia.org/T283061 (10Vgutierrez) [14:19:49] 10SRE, 10Traffic: Let's Encrypt issuance chains update - https://phabricator.wikimedia.org/T283164 (10Vgutierrez) [14:23:39] (03PS1) 10Elukey: Add knative serving and net-istio images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/692899 (https://phabricator.wikimedia.org/T278194) [14:23:43] 7~\o/ [14:23:46] without the 7 [14:26:58] (03CR) 10RLazarus: [C: 03+2] "(Sorry Riccardo, hit send too early.) Thanks all for the review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/692643 (owner: 10RLazarus) [14:28:16] rzl: no prob :) I've also added separately some additional tests to puppet so that it checks some validty of stuff including that one [14:28:34] yeah moritzm pointed me at it, looks good [14:30:09] 10SRE, 10Traffic: OpenSSL < 1.1.0 compatibility issues with new LE issuance chain - https://phabricator.wikimedia.org/T283165 (10Vgutierrez) [14:30:34] jouncebot: now [14:30:34] For the next 0 hour(s) and 29 minute(s): MediaWiki train - European+American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210519T1300) [14:30:36] jouncebot: next [14:30:36] In 3 hour(s) and 29 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210519T1800) [14:30:36] In 3 hour(s) and 29 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210519T1800) [14:30:46] volans: we were actually checking for "expiry_date iff expiry_contact" in cross-validate-accounts already, but it makes more sense to do it in CI -- I wonder if we should take it out of there, or just leave it in both places [14:30:55] probably just leave it in [14:31:37] I didn't check that in the script itself, I just thought was a good addition while at the date one :) [14:31:43] yeah for sure [14:31:45] either works for me [14:31:49] no strong opinion [14:42:54] 10SRE, 10Traffic: Let's Encrypt issuance chains update - https://phabricator.wikimedia.org/T283164 (10Vgutierrez) [14:44:21] 10SRE, 10Okapi [Wikimedia Enterprise], 10Platform Engineering: Securely connect Wikimedia Enterprise Infrastructure with WMF Kafka Streams - https://phabricator.wikimedia.org/T280628 (10RBrounley_WMF) Lovely, thank you @Ottomata. @nskaggs is this potentially interesting to you all? tagging for line of sight.... [14:48:21] only one log error so far as far as I can tell https://phabricator.wikimedia.org/T283167 [14:48:31] 10SRE, 10Okapi [Wikimedia Enterprise], 10Platform Engineering: Securely connect Wikimedia Enterprise Infrastructure with WMF Kafka Streams - https://phabricator.wikimedia.org/T280628 (10Ottomata) > Are these steps needed together or different approaches? Needed together > Does this imply the need to provisi... [14:48:43] I will let the wikis as is and take a break. I am reachable by phone if need be [14:51:13] 10SRE, 10Okapi [Wikimedia Enterprise], 10Platform Engineering: Securely connect Wikimedia Enterprise Infrastructure with WMF Kafka Streams - https://phabricator.wikimedia.org/T280628 (10Ottomata) Actually, the networking bits are new too, so I'm not sure how hard that would be. We'd have top configure Kafka... [15:02:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 25%: Repool db1157', diff saved to https://phabricator.wikimedia.org/P16100 and previous config saved to /var/cache/conftool/dbconfig/20210519-150257-root.json [15:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:50] (03PS1) 10Marostegui: db1141: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/692900 [15:05:43] (03CR) 10Marostegui: [C: 03+2] db1141: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/692900 (owner: 10Marostegui) [15:09:28] 10SRE, 10ops-eqiad, 10Data-Services, 10decommission-hardware: decommission labsdb1010.eqiad.wmnet - https://phabricator.wikimedia.org/T282523 (10wiki_willy) a:05wiki_willy→03Cmjohnson [15:18:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 50%: Repool db1157', diff saved to https://phabricator.wikimedia.org/P16101 and previous config saved to /var/cache/conftool/dbconfig/20210519-151800-root.json [15:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:41] 10SRE, 10Okapi [Wikimedia Enterprise], 10Platform Engineering, 10Traffic: Securely connect Wikimedia Enterprise Infrastructure with WMF Kafka Streams - https://phabricator.wikimedia.org/T280628 (10RBrounley_WMF) [15:22:23] 10Puppet, 10observability: Puppet failing on the alert hosts should alert - https://phabricator.wikimedia.org/T283151 (10fgiunchedi) I tend to agree that puppet failures on alert hosts are more critical than others, implementation wise I think we could tweak the `check_puppet_run` thresholds on alert hosts to... [15:24:08] (03CR) 10Elukey: [C: 03+1] kafka - Use hardened_tls instead of java::security if $ssl_enabled (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/692734 (https://phabricator.wikimedia.org/T282454) (owner: 10Ottomata) [15:27:00] (03PS1) 10Volans: Upstream release v0.3.0 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/692902 [15:33:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 75%: Repool db1157', diff saved to https://phabricator.wikimedia.org/P16102 and previous config saved to /var/cache/conftool/dbconfig/20210519-153304-root.json [15:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 100%: Repool db1157', diff saved to https://phabricator.wikimedia.org/P16103 and previous config saved to /var/cache/conftool/dbconfig/20210519-154808-root.json [15:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:07] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:55:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:56:57] (03CR) 10CDanis: sre.network.cf: Provide some advice in the event of errors (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/691275 (owner: 10CDanis) [15:57:58] (03PS1) 10Mforns: reportupdater: Rsync logs to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/692909 (https://phabricator.wikimedia.org/T274880) [15:58:24] (03PS1) 10Alexandros Kosiaris: registry: Fix up [puppet] - 10https://gerrit.wikimedia.org/r/692910 (https://phabricator.wikimedia.org/T256762) [15:58:26] (03CR) 10jerkins-bot: [V: 04-1] reportupdater: Rsync logs to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/692909 (https://phabricator.wikimedia.org/T274880) (owner: 10Mforns) [15:59:30] (03PS2) 10Alexandros Kosiaris: registry: Add proxy_pass to the catalog location block [puppet] - 10https://gerrit.wikimedia.org/r/692910 (https://phabricator.wikimedia.org/T256762) [16:01:37] (03CR) 10Alexandros Kosiaris: [C: 03+2] registry: Add proxy_pass to the catalog location block [puppet] - 10https://gerrit.wikimedia.org/r/692910 (https://phabricator.wikimedia.org/T256762) (owner: 10Alexandros Kosiaris) [16:03:48] (03CR) 10Volans: [C: 03+2] Upstream release v0.3.0 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/692902 (owner: 10Volans) [16:06:00] (03Merged) 10jenkins-bot: Upstream release v0.3.0 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/692902 (owner: 10Volans) [16:09:23] (03CR) 10Volans: "reply to questions inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/691275 (owner: 10CDanis) [16:13:49] !log uploaded debmonitor-client_0.3.0 to apt.wikimedia.org stretch-wikimedia,buster-wikimedia,bullseye-wikimedia [16:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:47] (03PS1) 10Joal: Add is_from_public_cloud to webrequest turnilo config [puppet] - 10https://gerrit.wikimedia.org/r/692926 (https://phabricator.wikimedia.org/T279380) [16:27:12] razzi, ottomata - can one of you take care of that one please? --^ [16:27:34] 10SRE, 10Okapi [Wikimedia Enterprise], 10Traffic: "wikimedia.com" DNS transfer to Wikimedia Enterprise's AWS infra - https://phabricator.wikimedia.org/T281428 (10Eugene.chernov) @BBlack, Public Hosted Zone wikimedia.com has been created on the AWS side. Now there is an ability to proceed with interception of... [16:28:06] joal: lgtm [16:28:12] (03CR) 10Razzi: [C: 03+2] Add is_from_public_cloud to webrequest turnilo config [puppet] - 10https://gerrit.wikimedia.org/r/692926 (https://phabricator.wikimedia.org/T279380) (owner: 10Joal) [16:28:48] razzi: this then needs a turnilo restart (after puppet has run and the config has changed) [16:28:51] please :) [16:29:01] joal: sounds good, I'm on it, ops on ops week! [16:29:10] Thanks :) [16:30:20] 10SRE, 10Wikimedia-Mailing-lists: I lost the password for the Wikisource-l mailing list - https://phabricator.wikimedia.org/T282997 (10Ladsgroup) Is it done? Can we close it? [16:33:47] 10SRE, 10Okapi [Wikimedia Enterprise], 10Traffic: "wikimedia.com" DNS transfer to Wikimedia Enterprise's AWS infra - https://phabricator.wikimedia.org/T281428 (10BBlack) >>! In T281428#7098765, @Eugene.chernov wrote: > [...] now we're ready to check/peek of what we have at the AWS Route53. Awesome, I'll hav... [16:34:18] 10SRE, 10Traffic, 10Patch-For-Review: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10Volans) With the current automation and logic in the generation script, adding just the IP would create this file, that is far from ideal: ` diff --git a/o... [16:34:42] cdanis: Hi - You have "is from public cloud" field in webrequest turnilo starting yesterday :) [16:35:28] 10SRE, 10Traffic, 10Patch-For-Review: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10Volans) I've updated https://netbox.wikimedia.org/ipam/ip-addresses/8539/ to set the DNS as manual for now so that it doesn't gets auto-generated. [16:35:30] 10SRE, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Add Traffic's notion of "from public cloud" to Analytics webrequest data - https://phabricator.wikimedia.org/T279380 (10JAllemandou) [16:37:32] 10SRE, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Add Traffic's notion of "from public cloud" to Analytics webrequest data - https://phabricator.wikimedia.org/T279380 (10JAllemandou) 05Open→03Resolved The new field is in turnilo with data starting from May 18th 2021. https://w.wi... [16:44:55] (03PS1) 10Ahmon Dancy: Update train-versions.json [mediawiki-config] (sandbox/dancy) - 10https://gerrit.wikimedia.org/r/692929 [16:44:57] (03CR) 10Ahmon Dancy: [C: 03+2] Update train-versions.json [mediawiki-config] (sandbox/dancy) - 10https://gerrit.wikimedia.org/r/692929 (owner: 10Ahmon Dancy) [16:45:19] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:45:49] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:46:10] (03Merged) 10jenkins-bot: Update train-versions.json [mediawiki-config] (sandbox/dancy) - 10https://gerrit.wikimedia.org/r/692929 (owner: 10Ahmon Dancy) [16:50:18] joal: awesome, thanks! [16:51:31] PROBLEM - Check for expired certificates debmonitor on pki1001 is CRITICAL: CRITICAL - 24 certs expiry in 9 days, 1 certs expiry in 5 days https://wikitech.wikimedia.org/wiki/PKI/Debugging [16:51:52] jbond42: expected? ^^^ [16:52:00] ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Lumen Ticket #: 21284533 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:52:00] ACKNOWLEDGEMENT - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Lumen Ticket #: 21284533 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:52:48] volans: no but can wait till tomorrow, will ackknowlage [16:52:49] PROBLEM - Check for expired certificates debmonitor on pki2001 is CRITICAL: CRITICAL - 23 certs expiry in 9 days, 2 certs expiry in 5 days https://wikitech.wikimedia.org/wiki/PKI/Debugging [16:52:55] ack, thx [16:52:59] this is just the other server ^ [16:53:03] lmk if I can help [16:53:41] volans: suspect its probably just an issue with the check logic, or something not taken into account like decomisioning serveres etc [16:53:49] but will do thanks [16:54:57] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:55:27] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:57:25] 10SRE, 10CFSSL-PKI: Investigate Check for expired certificates debmonitor - https://phabricator.wikimedia.org/T283185 (10jbond) p:05Triage→03High [16:58:05] ACKNOWLEDGEMENT - Check for expired certificates debmonitor on pki1001 is CRITICAL: CRITICAL - 20 certs expiry in 9 days, 5 certs expiry in 5 days John Bond T283185 https://wikitech.wikimedia.org/wiki/PKI/Debugging [16:58:05] ACKNOWLEDGEMENT - Check for expired certificates debmonitor on pki2001 is CRITICAL: CRITICAL - 19 certs expiry in 9 days, 6 certs expiry in 5 days John Bond T283185 https://wikitech.wikimedia.org/wiki/PKI/Debugging [16:59:21] (03PS1) 10Volans: Release v0.3.0 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/692932 [17:06:17] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10Glrx) @JoKalliauer Thanks for showing that hyphens do not work, that 2.50 shows the default when there is no match, and that simply substituting an underscore for a hyphen doe... [17:37:05] 10SRE, 10SRE-Access-Requests: Superset Access for Cooltey Feng - https://phabricator.wikimedia.org/T283189 (10cooltey) [17:38:45] 10SRE, 10SRE-Access-Requests: Superset Access for Cooltey Feng - https://phabricator.wikimedia.org/T283189 (10cooltey) [17:42:09] 10SRE, 10SRE-Access-Requests: Superset Access for Cooltey Feng - https://phabricator.wikimedia.org/T283189 (10MattCleinman) Approved! [17:42:45] 10SRE, 10Wikimedia-Mailing-lists: Cannot manually add users to wmfreqs list due to regex banning any addresses - https://phabricator.wikimedia.org/T283103 (10Olauro) Hi @Legoktm, Thank you for looking into this! I was able to create an account but it looks like my role is "member" so I think I do need to be... [17:46:13] 10SRE, 10Wikimedia-Mailing-lists: Cannot manually add users to wmfreqs list due to regex banning any addresses - https://phabricator.wikimedia.org/T283103 (10Legoktm) >>! In T283103#7098998, @Olauro wrote: > I was able to create an account but it looks like my role is "member" so I think I do need to be added... [17:48:28] 10SRE, 10Traffic, 10observability, 10User-fgiunchedi: Port traffic/netops grafana alerts to AM - https://phabricator.wikimedia.org/T282806 (10lmata) moving to radar for tracking :-) [17:49:13] (03PS1) 10Ppchelko: ratelimiter: update to new upstream version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/692941 (https://phabricator.wikimedia.org/T246278) [17:57:54] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10JoKalliauer) FYI: related librsvg-issues: - [[ https://gitlab.gnome.org/GNOME/librsvg/-/issues/735 | librsvg#735 RFC: meta-issue for localized SVGs ]] make it easy for Wikimed... [17:59:07] (03PS1) 10Ppchelko: Api-gateway: enable x-ratelimit-* headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/692946 (https://phabricator.wikimedia.org/T246278) [18:00:04] hashar and dancy: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210519T1800). [18:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210519T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:04:11] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:15:18] so hmm I am doing a rollback of train from group 1 due to some breakages related to actor / title etc [18:17:41] (03PS1) 10Hashar: Revert "group1 wikis to 1.37.0-wmf.6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692951 (https://phabricator.wikimedia.org/T281147) [18:17:48] !log herron@cumin1001 START - Cookbook sre.dns.netbox [18:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:00] (03CR) 10Hashar: [C: 03+2] Revert "group1 wikis to 1.37.0-wmf.6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692951 (https://phabricator.wikimedia.org/T281147) (owner: 10Hashar) [18:18:55] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.37.0-wmf.6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692951 (https://phabricator.wikimedia.org/T281147) (owner: 10Hashar) [18:20:48] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: Revert group1 wikis to 1.37.0-wmf.6 T281147 [18:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:52] T281147: 1.37.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T281147 [18:23:19] 10SRE, 10netops: routinator: create gabage collection job - https://phabricator.wikimedia.org/T282469 (10ayounsi) [18:23:46] !log herron@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:23:48] !log herron@cumin1001 START - Cookbook sre.dns.netbox [18:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:12] 10SRE, 10netops: routinator: create gabage collection job - https://phabricator.wikimedia.org/T282469 (10ayounsi) @jbond would the upcoming changes in https://github.com/NLnetLabs/routinator/releases/tag/v0.9.0-rc1 solve that issue by using a database instead of the file-system? [18:30:12] !log herron@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:47] (Traffic bill over quota) firing: Traffic bill over quota - https://alerts.wikimedia.org [18:36:13] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:37:47] (Traffic bill over quota) firing: (3) Traffic bill over quota - https://alerts.wikimedia.org [18:42:47] (Traffic bill over quota) resolved: (2) Traffic bill over quota - https://alerts.wikimedia.org [18:57:18] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Dc-Ops Commands for Cumin - https://phabricator.wikimedia.org/T279721 (10RobH) Ok, just found out that I use sudo sgdisk in rebuilding mdadm raid arrays (I use it to copy parition info before invoking mdadm commands) Please add this to the list of cumin support c... [18:58:09] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Dc-Ops Commands for Cumin - https://phabricator.wikimedia.org/T279721 (10RobH) [19:00:05] hashar and dancy: Dear deployers, time to do the MediaWiki train - European+American Version (secondary timeslot) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210519T1900). [19:00:59] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10RobH) Ok, the defective SSD has been swapped out with a new one, but the software raid doesn't appear to auto rebuild, so I invoked it... [19:05:01] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 221, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:06:34] (03PS1) 10Ahmon Dancy: Review access change [mediawiki-config] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/692660 [19:09:23] (03PS2) 10Ahmon Dancy: Allow TrainBranchBot to create refs in operations/mediawiki-config [mediawiki-config] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/692660 [19:11:03] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:13:08] 10SRE, 10Wikimedia-Mailing-lists: Cannot manually add users to wmfreqs list due to regex banning any addresses - https://phabricator.wikimedia.org/T283103 (10Olauro) @Legoktm I reached out to Bryan and Carie and received the following response (Picture attached) : {F34460324} It seems that no one was aware... [19:13:25] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:20:35] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:25:09] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:25:17] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:29:55] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:40:39] !log razzi@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1125.eqiad.wmnet [19:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:57] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:44:35] jouncebot: noq [19:44:37] jouncebot: now [19:44:37] For the next 1 hour(s) and 15 minute(s): MediaWiki train - European+American Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210519T1900) [19:46:22] (03PS2) 10Ottomata: kafka - Use hardened_tls instead of java::security if $ssl_enabled [puppet] - 10https://gerrit.wikimedia.org/r/692734 (https://phabricator.wikimedia.org/T282454) [19:46:55] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:50:03] (03CR) 10Ottomata: [C: 03+2] kafka - Use hardened_tls instead of java::security if $ssl_enabled [puppet] - 10https://gerrit.wikimedia.org/r/692734 (https://phabricator.wikimedia.org/T282454) (owner: 10Ottomata) [19:56:34] (03PS1) 10Ottomata: Revert "kafka - Use hardened_tls instead of java::security if $ssl_enabled" [puppet] - 10https://gerrit.wikimedia.org/r/692661 [19:56:45] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:58:00] (03CR) 10jerkins-bot: [V: 04-1] Revert "kafka - Use hardened_tls instead of java::security if $ssl_enabled" [puppet] - 10https://gerrit.wikimedia.org/r/692661 (owner: 10Ottomata) [19:58:06] (03CR) 10Ottomata: "We can unrevert this after https://phabricator.wikimedia.org/T279342 is done" [puppet] - 10https://gerrit.wikimedia.org/r/692661 (owner: 10Ottomata) [19:59:15] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:59:33] (03PS2) 10Ottomata: Revert "kafka - Use hardened_tls instead of java::security [puppet] - 10https://gerrit.wikimedia.org/r/692661 [20:00:05] hashar and dancy: I, the Bot under the Fountain, allow thee, The Deployer, to do MediaWiki train - European+American Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210519T1900). [20:00:05] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210519T2000). Please do the needful. [20:00:17] PROBLEM - Check whether ferm is active by checking the default input chain on labstore1007 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:01:26] (03CR) 10Ottomata: [C: 03+2] Revert "kafka - Use hardened_tls instead of java::security [puppet] - 10https://gerrit.wikimedia.org/r/692661 (owner: 10Ottomata) [20:01:29] 10SRE, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Switch kafka/Hadoop away from java::security - https://phabricator.wikimedia.org/T282454 (10Ottomata) Had to revert Kafka change: https://gerrit.wikimedia.org/r/c/operations/puppet/+/692661 Can un-revert after {T279342} is done. [20:08:54] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1125.eqiad.wmnet [20:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:41] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:30:36] RECOVERY - Check whether ferm is active by checking the default input chain on labstore1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:32:24] 10SRE, 10Wikimedia-Mailing-lists: Cannot manually add users to wmfreqs list due to regex banning any addresses - https://phabricator.wikimedia.org/T283103 (10Legoktm) >>! In T283103#7099246, @Olauro wrote: > I reached out to Bryan and Carie and received the following response (Picture attached) : {F34460324} >... [20:36:04] PROBLEM - SSH on ms-be2035 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:36:43] (03PS3) 10Andrew Bogott: Install systemd-timesyncd on Bullseye and later [puppet] - 10https://gerrit.wikimedia.org/r/691960 (https://phabricator.wikimedia.org/T280801) [20:37:38] RECOVERY - SSH on ms-be2035 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:38:43] (03CR) 10Andrew Bogott: "This looks awfully fragile but it seems to work." [puppet] - 10https://gerrit.wikimedia.org/r/691960 (https://phabricator.wikimedia.org/T280801) (owner: 10Andrew Bogott) [20:41:08] (03CR) 10Muehlenhoff: "See my comments at https://gerrit.wikimedia.org/r/c/operations/puppet/+/692852" [puppet] - 10https://gerrit.wikimedia.org/r/691960 (https://phabricator.wikimedia.org/T280801) (owner: 10Andrew Bogott) [20:41:46] (03PS4) 10Andrew Bogott: Install systemd-timesyncd on Bullseye and later [puppet] - 10https://gerrit.wikimedia.org/r/691960 (https://phabricator.wikimedia.org/T280801) [20:47:38] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:49:24] PROBLEM - SSH on ms-be2035 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:50:25] 10SRE, 10Okapi [Wikimedia Enterprise], 10Platform Engineering, 10Traffic: Securely connect Wikimedia Enterprise Infrastructure with WMF Kafka Streams - https://phabricator.wikimedia.org/T280628 (10nskaggs) Following. This could be an interesting reference implementation for other uses / problems. I would b... [20:51:38] RECOVERY - SSH on ms-be2035 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:03:41] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10wiki_willy) a:05wiki_willy→03Cmjohnson [21:22:00] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:23:51] (03PS1) 10Ppchelko: WIP: ratelimit: add verify stage [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/692983 [21:23:59] (03PS1) 10Razzi: db1125: decommission db1125 [puppet] - 10https://gerrit.wikimedia.org/r/692984 (https://phabricator.wikimedia.org/T283125) [21:24:00] PROBLEM - SSH on ms-be2035 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:24:22] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:24:44] (03PS2) 10Razzi: db1125: decommission db1125 [puppet] - 10https://gerrit.wikimedia.org/r/692984 (https://phabricator.wikimedia.org/T283125) [21:29:59] (03PS1) 10Herron: sre.hosts.decommssion: use dd to zero the bootloader [cookbooks] - 10https://gerrit.wikimedia.org/r/692991 (https://phabricator.wikimedia.org/T283204) [21:30:09] (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.decommssion: use dd to zero the bootloader [cookbooks] - 10https://gerrit.wikimedia.org/r/692991 (https://phabricator.wikimedia.org/T283204) (owner: 10Herron) [21:33:58] (03PS1) 10Herron: sre.hosts.decommssion: use dd to zero the bootloader [cookbooks] - 10https://gerrit.wikimedia.org/r/692992 (https://phabricator.wikimedia.org/T283204) [21:34:24] (03Abandoned) 10Herron: sre.hosts.decommssion: use dd to zero the bootloader [cookbooks] - 10https://gerrit.wikimedia.org/r/692991 (https://phabricator.wikimedia.org/T283204) (owner: 10Herron) [21:36:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10RobH) ` robh@wdqs2007:~$ cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md0... [21:37:54] RECOVERY - SSH on ms-be2035 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:38:24] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:39:41] (03PS1) 10Herron: sre.hosts.decommission: clarify "wipe bootloader" step [cookbooks] - 10https://gerrit.wikimedia.org/r/692993 (https://phabricator.wikimedia.org/T283204) [21:40:44] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:42:18] (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.decommission: clarify "wipe bootloader" step [cookbooks] - 10https://gerrit.wikimedia.org/r/692993 (https://phabricator.wikimedia.org/T283204) (owner: 10Herron) [21:42:26] 10SRE, 10Traffic, 10vm-requests: Please create two Ganeti VMs for Wikidough - https://phabricator.wikimedia.org/T283192 (10Dzahn) a:03Dzahn [21:44:34] !log razzi@cumin1001 START - Cookbook sre.dns.netbox [21:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:56] PROBLEM - SSH on ms-be2035 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:45:46] (03PS2) 10Herron: sre.hosts.decommission: clarify "wipe bootloader" step [cookbooks] - 10https://gerrit.wikimedia.org/r/692993 (https://phabricator.wikimedia.org/T283204) [21:50:47] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host doh2001.wikimedia.org [21:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:59] !log razzi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:02] RECOVERY - SSH on ms-be2035 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:52:04] 10SRE, 10Traffic, 10vm-requests: Please create two Ganeti VMs for Wikidough - https://phabricator.wikimedia.org/T283192 (10Dzahn) Ready to create Ganeti VM doh2001.wikimedia.org in the ganeti01.svc.codfw.wmnet cluster on row A with 2 vCPUs, 8GB of RAM, 30GB of disk in the public network. Ready to create Ga... [21:52:19] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host doh2002.wikimedia.org [21:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:00] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:56:16] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host doh2002.wikimedia.org [21:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:29] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host doh2002.wikimedia.org [21:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:24] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:58:51] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host doh2002.wikimedia.org [21:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:21] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host doh2002.wikimedia.org [22:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:28] PROBLEM - SSH on ms-be2035 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:04:20] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host doh2002.wikimedia.org [22:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:06] note to self: why'd you try to create 2 VMs at the same time again.. that messes with netbox data [22:05:17] cleaning up IPs in netbox [22:06:07] (03PS1) 10Ppchelko: Provide current $PATH to the verify script [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/692995 [22:06:50] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:07:40] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host doh2002.wikimedia.org [22:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:44] (03CR) 10jerkins-bot: [V: 04-1] Provide current $PATH to the verify script [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/692995 (owner: 10Ppchelko) [22:08:20] jouncebot: now [22:08:21] No deployments scheduled for the next 0 hour(s) and 51 minute(s) [22:08:24] jouncebot: next [22:08:24] In 0 hour(s) and 51 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210519T2300) [22:08:33] * Urbanecm goes to update interwiki cache (again) [22:09:23] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692997 [22:09:25] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692997 (owner: 10Urbanecm) [22:09:27] (03PS2) 10Ppchelko: ratelimit: add verify stage [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/692983 [22:09:27] !log urbanecm@deploy1002 update-interwiki-cache aborted: Update interwiki cache (duration: 00m 11s) [22:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:50] (03Abandoned) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692997 (owner: 10Urbanecm) [22:10:16] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692998 [22:10:18] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692998 (owner: 10Urbanecm) [22:11:05] (03PS2) 10Ppchelko: Provide current $PATH to the verify script [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/692995 [22:11:08] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh2001.wikimedia.org [22:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:11] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692998 (owner: 10Urbanecm) [22:12:22] !log urbanecm@deploy1002 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 14s) [22:12:23] 10SRE, 10Traffic, 10vm-requests: Please create two Ganeti VMs for Wikidough - https://phabricator.wikimedia.org/T283192 (10Dzahn) @ssingh Please add "doh" to the list of clusters on https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions [22:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:30] (03CR) 10Ppchelko: "So a bit of context:" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/692995 (owner: 10Ppchelko) [22:17:28] (03CR) 10Ppchelko: "And a disclaimer - I have not a lot of idea what exactly I'm doing, so there's probably a much nicer way of achieving the goal" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/692995 (owner: 10Ppchelko) [22:18:12] RECOVERY - SSH on ms-be2035 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:18:40] !log razzi@cumin1001 START - Cookbook sre.dns.netbox [22:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:40] !log razzi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [22:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:14] !log razzi@cumin1001 START - Cookbook sre.dns.netbox [22:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:41] !log Start server-side upload for 3 video file (T283102, T283054) [22:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:46] T283102: Server side upload for Psubhashish - https://phabricator.wikimedia.org/T283102 [22:22:47] T283054: Server side upload for Butko - https://phabricator.wikimedia.org/T283054 [22:23:22] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:23:48] 10SRE, 10Traffic, 10vm-requests: Please create two Ganeti VMs for Wikidough - https://phabricator.wikimedia.org/T283192 (10ssingh) >>! In T283192#7099503, @Dzahn wrote: > @ssingh Please add "doh" to the list of clusters on https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions Thank you for t... [22:24:09] (03PS1) 10Ssingh: site: add doh200[12] to site.pp with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/693003 (https://phabricator.wikimedia.org/T283192) [22:24:56] PROBLEM - SSH on ms-be2035 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:25:51] !log razzi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:21] (03CR) 10Dzahn: [C: 03+2] "thanks, was just uploading the same thing but was slow :)" [puppet] - 10https://gerrit.wikimedia.org/r/693003 (https://phabricator.wikimedia.org/T283192) (owner: 10Ssingh) [22:27:34] !log Start server-side upload for 1 video file (T283186) [22:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:37] T283186: Server side upload for Sturm - https://phabricator.wikimedia.org/T283186 [22:27:48] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:29:28] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh2002.wikimedia.org [22:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:02] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:41:06] RECOVERY - SSH on ms-be2035 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:44:26] !log [urbanecm@mwmaint1002 ~/uploads]$ sleep 3600 && mwscript importImages.php --wiki=commonswiki --comment-ext=txt --sleep=7200 --user=Lusccasdeutsch . # T278856 # 3 video files [22:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:30] T278856: Server side upload for Lusccasdeutsch (master task) - https://phabricator.wikimedia.org/T278856 [22:45:01] (03PS1) 10Dzahn: DHCP: add MAC addresses for doh2001 and doh2002 [puppet] - 10https://gerrit.wikimedia.org/r/693007 (https://phabricator.wikimedia.org/T283192) [22:45:44] (03PS1) 10Clarakosi: api-gateway: Replace echoapi with http-https-echo [deployment-charts] - 10https://gerrit.wikimedia.org/r/693008 (https://phabricator.wikimedia.org/T261367) [22:48:05] (03CR) 10Dzahn: [C: 03+2] DHCP: add MAC addresses for doh2001 and doh2002 [puppet] - 10https://gerrit.wikimedia.org/r/693007 (https://phabricator.wikimedia.org/T283192) (owner: 10Dzahn) [22:48:10] PROBLEM - SSH on ms-be2035 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:48:56] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:53:36] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:58:24] (03CR) 10Clarakosi: api-gateway: Make use of host_rewrite_path_regex option (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/692717 (owner: 10Ppchelko) [22:59:12] (03CR) 10Ppchelko: [C: 03+1] "I like this" [deployment-charts] - 10https://gerrit.wikimedia.org/r/693008 (https://phabricator.wikimedia.org/T261367) (owner: 10Clarakosi) [23:00:02] (03CR) 10Ppchelko: api-gateway: Make use of host_rewrite_path_regex option (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/692717 (owner: 10Ppchelko) [23:00:05] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210519T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:00:54] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:01:00] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:03:26] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:06:33] (03PS2) 10Ppchelko: api-gateway: Make use of host_rewrite_path_regex option [deployment-charts] - 10https://gerrit.wikimedia.org/r/692717 [23:08:22] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:14:42] RECOVERY - SSH on ms-be2035 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:18:02] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:18:06] (03PS1) 10Ppchelko: api-gateway: use virtual clusters to split r ans w traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/693017 [23:18:29] (03CR) 10Ppchelko: "untested!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/693017 (owner: 10Ppchelko) [23:20:28] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:23:24] 10SRE, 10Traffic, 10vm-requests: Please create two Ganeti VMs for Wikidough - https://phabricator.wikimedia.org/T283192 (10Dzahn) added to DHCP but when i started OS install I noticed we need to add it to partman still. will be back in a little while [23:26:54] PROBLEM - SSH on ms-be2035 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:31:27] (03PS2) 10Krinkle: api-gateway: use virtual clusters to split r and w traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/693017 (owner: 10Ppchelko) [23:43:52] RECOVERY - SSH on ms-be2035 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:47:12] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:47:18] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:48:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [23:49:38] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:53:36] PROBLEM - SSH on ms-be2035 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:53:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [23:56:35] (03PS3) 10Razzi: db1125: decommission db1125 [puppet] - 10https://gerrit.wikimedia.org/r/692984 (https://phabricator.wikimedia.org/T283125) [23:56:46] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [23:58:34] RECOVERY - SSH on ms-be2035 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring