[00:00:27] RECOVERY - Check systemd state on lists1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:47] PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_exclude_backups.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:20:01] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 73648664 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:22:23] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 510712 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:18:11] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:25:31] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.071 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:33:07] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:50:21] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 6.451 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:57:47] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:54:49] (03CR) 10ArielGlenn: [C: 03+1] "This looks ready to go pending verification of MAILTO functionality." [puppet] - 10https://gerrit.wikimedia.org/r/682010 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [04:55:06] (03CR) 10ArielGlenn: [C: 03+1] "This looks ready to go pending verification of MAILTO functionality." [puppet] - 10https://gerrit.wikimedia.org/r/682011 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [04:55:14] (03CR) 10ArielGlenn: [C: 03+1] "This looks ready to go pending verification of MAILTO functionality." [puppet] - 10https://gerrit.wikimedia.org/r/682012 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [04:58:57] (03CR) 10ArielGlenn: [C: 03+2] Add Wikidata QRank link in dumps.wikimedia.org/other/analytics [puppet] - 10https://gerrit.wikimedia.org/r/681994 (https://phabricator.wikimedia.org/T278416) (owner: 10Ottomata) [04:59:12] (03PS1) 10KartikMistry: Update cxserver to 2021-04-21-044024-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/682032 (https://phabricator.wikimedia.org/T279045) [05:11:49] (03PS2) 10KartikMistry: Enable ContentTranslation as a default tool for 11 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681077 (https://phabricator.wikimedia.org/T279422) [05:39:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1079 to clone db1158 T258361', diff saved to https://phabricator.wikimedia.org/P15506 and previous config saved to /var/cache/conftool/dbconfig/20210423-053907-marostegui.json [05:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:19] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [05:40:07] !log Stop db1079 to clone db1158 (lag will appear on s7 on wiki replicas) [05:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:20] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) Transfer from db1079 to db1158 started. [06:00:38] (03PS1) 10Marostegui: mariadb: Productionize db1124 into s7. [puppet] - 10https://gerrit.wikimedia.org/r/682035 (https://phabricator.wikimedia.org/T258361) [06:03:04] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1124 into s7. [puppet] - 10https://gerrit.wikimedia.org/r/682035 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:04:56] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1124.eqiad.wmnet'] ` The log ca... [06:07:04] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1124.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1124.eqiad.wmnet'] ` [06:10:32] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Upgrade lists-next to bullseye mailman versions - https://phabricator.wikimedia.org/T280887 (10Marostegui) @Legoktm we can do it on Monday if you like just ping on #wikimedia-databases whenever you want to start and we can coordinate. [06:10:55] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Upgrade lists-next to bullseye mailman versions - https://phabricator.wikimedia.org/T280887 (10Marostegui) I meant wikimedia-databases irc channel, not the tag :) [06:11:52] (03CR) 10Marostegui: [C: 03+1] mariadb: Allow lists1001.wikimedia.org to talk to m5 [puppet] - 10https://gerrit.wikimedia.org/r/681753 (https://phabricator.wikimedia.org/T278614) (owner: 10Legoktm) [06:17:33] (03CR) 10Ayounsi: [C: 03+1] Enable per flow ECMP for kubernetes/kubestage [homer/public] - 10https://gerrit.wikimedia.org/r/681789 (https://phabricator.wikimedia.org/T238909) (owner: 10Alexandros Kosiaris) [06:36:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ayounsi) I disabled the two switch ports: https://netbox.wikimedia.org/extras/changelog/56363/ https://netbox.wikimedia.org/extr... [06:46:53] (03CR) 10Gilles: [C: 03+1] cache: do not serve webp files to Safari [puppet] - 10https://gerrit.wikimedia.org/r/681685 (https://phabricator.wikimedia.org/T280439) (owner: 10Ema) [06:47:19] 10SRE, 10MediaWiki-General, 10Traffic, 10Browser-Support-Apple-Safari, 10Patch-For-Review: File:Chessboard480.svg WEBP thumbnail version not visible on safari when size is fixed at 208px - https://phabricator.wikimedia.org/T280439 (10Gilles) This started happening because Safari 14 is supposed to support... [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210423T0700) [07:14:23] (03CR) 10Ema: [C: 03+2] cache: do not serve webp files to Safari [puppet] - 10https://gerrit.wikimedia.org/r/681685 (https://phabricator.wikimedia.org/T280439) (owner: 10Ema) [07:21:40] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Legoktm) wikimediacz-l and wikimediabr-l have volunteered to go first/early (thank you!). Question: do we need to migrate lists that were renamed? e.g. there's a disabled "tool-... [07:22:18] !log removing junk bounced email addresses from yahoo from all mailing lists [07:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:29] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/681789 (https://phabricator.wikimedia.org/T238909) (owner: 10Alexandros Kosiaris) [07:27:39] (03Merged) 10jenkins-bot: Enable per flow ECMP for kubernetes/kubestage [homer/public] - 10https://gerrit.wikimedia.org/r/681789 (https://phabricator.wikimedia.org/T238909) (owner: 10Alexandros Kosiaris) [07:42:54] 10SRE, 10Prod-Kubernetes, 10Pybal, 10Traffic, and 2 others: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10JMeybohm) Very cool! >>! In T238909#7023429, @akosiaris wrote: > [] Look into switching to `"externalTrafficPolicy":"Local"` in... [07:44:43] 10SRE, 10Prod-Kubernetes, 10Pybal, 10Traffic, and 2 others: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10akosiaris) >>! In T238909#7028764, @JMeybohm wrote: > Very cool! > >>>! In T238909#7023429, @akosiaris wrote: >> [] Look into sw... [07:48:46] PROBLEM - WDQS SPARQL on wdqs1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:56:22] !log deleting db1156 s2 database and reloading it from logical backups T280492 [07:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:32] T280492: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 [07:56:51] (03CR) 10Gehel: [C: 04-1] "A few issues inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [07:59:00] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1019.eqiad.wmnet [07:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:24] RECOVERY - WDQS SPARQL on wdqs1011 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.050 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:11:10] ACKNOWLEDGEMENT - HP RAID on ms-be1019 is CRITICAL: CRITICAL: Slot 3: Failed: 2I:2:3 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T280961 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:11:17] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T280961 (10ops-monitoring-bot) [08:12:02] !log filippo@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ms-be1019.eqiad.wmnet [08:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:35] !log upgrading d-i image for bullseye to RC1 release [08:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:52] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1021.eqiad.wmnet [08:12:55] !log upgrading d-i image for bullseye to RC1 release T275873 [08:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:07] T275873: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 [08:14:30] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T280961 (10fgiunchedi) Host will be ready for decom next week and filesystems are mostly empty already, no need to replace disks. Leaving the task open until decom [08:19:29] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T280961 (10fgiunchedi) [08:19:53] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1021.eqiad.wmnet [08:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:06] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1020.eqiad.wmnet [08:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:49] hnowlan: hi, I see you created deployment-eventlog08 but didn't sign it's Puppet certs yet, want me to take care of those? [08:27:48] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1020.eqiad.wmnet [08:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:06] (03CR) 10David Caro: "> Patch Set 4:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/681694 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [08:32:08] (03CR) 10David Caro: [C: 03+2] icinga: use a bash command wrapper to allow sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/681694 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [08:33:15] (03PS1) 10Marostegui: mariadb: Productionize db1158 [puppet] - 10https://gerrit.wikimedia.org/r/682094 (https://phabricator.wikimedia.org/T258361) [08:34:00] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1158 [puppet] - 10https://gerrit.wikimedia.org/r/682094 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [08:35:00] RECOVERY - Device not healthy -SMART- on ms-be1019 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1019&var-datasource=eqiad+prometheus/ops [08:37:45] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1158 is now replicating. Once it's caught up I will enable GTID and start checking tables. [08:39:32] (03PS1) 10Elukey: profile::thanos::swift: add account for ML serve cluster [puppet] - 10https://gerrit.wikimedia.org/r/682097 (https://phabricator.wikimedia.org/T280773) [08:39:45] (03Merged) 10jenkins-bot: icinga: use a bash command wrapper to allow sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/681694 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [08:41:35] (03PS3) 10David Caro: icinga: use a sudo-friendly command to get command_file [software/spicerack] - 10https://gerrit.wikimedia.org/r/681754 (https://phabricator.wikimedia.org/T280641) [08:41:46] (03PS2) 10Elukey: profile::thanos::swift: add account for ML serve cluster [puppet] - 10https://gerrit.wikimedia.org/r/682097 (https://phabricator.wikimedia.org/T280773) [08:42:29] (03CR) 10Elukey: "Suggestions about naming are very welcome, I started this change to kick off the conversation about it :)" [puppet] - 10https://gerrit.wikimedia.org/r/682097 (https://phabricator.wikimedia.org/T280773) (owner: 10Elukey) [08:43:13] (03PS1) 10David Caro: upgrade-and-reboot: add possibility to use sudo [cookbooks] - 10https://gerrit.wikimedia.org/r/682098 (https://phabricator.wikimedia.org/T280641) [08:43:15] (03PS1) 10David Caro: wmcs.ceph: Added mon upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/682099 [08:44:30] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1158: GTID enabled and tables being checked [08:49:57] (03PS2) 10David Caro: wmcs.ceph: Added mon upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/682099 [08:51:04] (03CR) 10Filippo Giunchedi: "LGTM! See inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682097 (https://phabricator.wikimedia.org/T280773) (owner: 10Elukey) [08:52:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 25%: Repool db1079', diff saved to https://phabricator.wikimedia.org/P15509 and previous config saved to /var/cache/conftool/dbconfig/20210423-085212-root.json [08:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:45] (03CR) 10jerkins-bot: [V: 04-1] wmcs.ceph: Added mon upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/682099 (owner: 10David Caro) [08:53:10] (03CR) 10Elukey: profile::thanos::swift: add account for ML serve cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682097 (https://phabricator.wikimedia.org/T280773) (owner: 10Elukey) [08:53:38] (03PS3) 10Elukey: profile::thanos::swift: add account for ML serve cluster [puppet] - 10https://gerrit.wikimedia.org/r/682097 (https://phabricator.wikimedia.org/T280773) [08:54:48] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, this will need the corresponding change in private.git (both really private and the public private)" [puppet] - 10https://gerrit.wikimedia.org/r/682097 (https://phabricator.wikimedia.org/T280773) (owner: 10Elukey) [09:01:10] (03PS3) 10David Caro: wmcs.ceph: Added mon upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/682099 [09:04:34] (03CR) 10jerkins-bot: [V: 04-1] wmcs.ceph: Added mon upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/682099 (owner: 10David Caro) [09:07:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 50%: Repool db1079', diff saved to https://phabricator.wikimedia.org/P15510 and previous config saved to /var/cache/conftool/dbconfig/20210423-090716-root.json [09:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:07] (03CR) 10Elukey: [C: 03+2] hadoop: improve default log4j config [puppet] - 10https://gerrit.wikimedia.org/r/680383 (https://phabricator.wikimedia.org/T276906) (owner: 10Elukey) [09:14:23] (03PS2) 10Muehlenhoff: Enable failoid role for failoid1002/2002 [puppet] - 10https://gerrit.wikimedia.org/r/681696 [09:17:07] (03PS4) 10David Caro: wmcs.ceph: Added mon upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/682099 [09:19:31] (03PS1) 10Matthias Mullie: Enable Extension:MediaSeach on testcommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682102 (https://phabricator.wikimedia.org/T265939) [09:21:53] (03CR) 10Matthias Mullie: [C: 04-2] "DNM until 2 branches after I0a021e6c3a02350493eca2197b2515dbfd1c1c88 has been merged" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682102 (https://phabricator.wikimedia.org/T265939) (owner: 10Matthias Mullie) [09:22:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 75%: Repool db1079', diff saved to https://phabricator.wikimedia.org/P15511 and previous config saved to /var/cache/conftool/dbconfig/20210423-092220-root.json [09:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:02] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/cumin] - 10https://gerrit.wikimedia.org/r/681758 (owner: 10Volans) [09:25:37] (03CR) 10Jcrespo: "This is ready to merge, but asking a question to be sure this is the logic you want." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/681763 (owner: 10Legoktm) [09:26:14] (03CR) 10Jcrespo: "> Patch Set 2:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/681763 (owner: 10Legoktm) [09:27:47] (03PS1) 10Matthias Mullie: Enable Extension:MediaSearch on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682105 (https://phabricator.wikimedia.org/T265939) [09:28:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: 2021-04-30) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 (10jcrespo) a:05Papaul→03jcrespo Thank you. [09:29:15] (03CR) 10Muehlenhoff: [C: 03+2] Enable failoid role for failoid1002/2002 [puppet] - 10https://gerrit.wikimedia.org/r/681696 (owner: 10Muehlenhoff) [09:31:45] 10SRE, 10Traffic, 10Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904 (10jbond) >>! In T265904#6722401, @jbond wrote: >> note: I created a [[ https://tickets.puppetlabs.com/browse/FACT-2843 | bug against facter4 ]] which is related > > FYI this has been resolve... [09:31:50] (03CR) 10David Caro: [C: 03+2] aptrepo.ceph: filter out debugging packages [puppet] - 10https://gerrit.wikimedia.org/r/681299 (owner: 10David Caro) [09:37:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 100%: Repool db1079', diff saved to https://phabricator.wikimedia.org/P15512 and previous config saved to /var/cache/conftool/dbconfig/20210423-093723-root.json [09:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:58] (03PS1) 10David Caro: wmcs.ceph: add cookbook to upgrade all osds [cookbooks] - 10https://gerrit.wikimedia.org/r/682106 (https://phabricator.wikimedia.org/T280641) [09:43:38] (03PS1) 10Muehlenhoff: Switch to failoid2002 [puppet] - 10https://gerrit.wikimedia.org/r/682107 [09:43:40] (03PS1) 10Muehlenhoff: Switch to failoid1002 [puppet] - 10https://gerrit.wikimedia.org/r/682108 [09:49:03] !log installing xorg-server security updates [09:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:29] (03CR) 10Arturo Borrero Gonzalez: wmcs: Refactor a bit the openstack commands (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/679792 (owner: 10David Caro) [09:50:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs: Refactor a bit the openstack commands [cookbooks] - 10https://gerrit.wikimedia.org/r/679792 (owner: 10David Caro) [09:52:46] (03CR) 10Effie Mouzeli: rdf-streaming-updater: create helmfile.d structure (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [09:52:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.ceph: add cookbook to upgrade all osds [cookbooks] - 10https://gerrit.wikimedia.org/r/682106 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [09:53:06] (03CR) 10Jbond: [C: 03+1] "lgtm some minor optional nits" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/681754 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [09:53:19] dcaro: fyi ^ [09:54:45] (03PS3) 10Effie Mouzeli: rdf-streaming-updater: enable HA capability [deployment-charts] - 10https://gerrit.wikimedia.org/r/679519 (https://phabricator.wikimedia.org/T273098) (owner: 10Mstyles) [09:54:56] !log installing xen security updates [09:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:27] (03CR) 10Effie Mouzeli: [C: 03+1] "I can't help much on the rbac stuff, I will leave that too alex and janis" [deployment-charts] - 10https://gerrit.wikimedia.org/r/679519 (https://phabricator.wikimedia.org/T273098) (owner: 10Mstyles) [10:01:13] jbond42: thanks! [10:06:00] !log installing Linux 4.19.181 updates from Buster 10.9 point release (no reboots, just updating the packages) [10:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:51] (03PS2) 10David Caro: wmcs: Refactor a bit the openstack commands [cookbooks] - 10https://gerrit.wikimedia.org/r/679792 [10:07:54] (03CR) 10David Caro: wmcs: Refactor a bit the openstack commands (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/679792 (owner: 10David Caro) [10:08:15] (03PS3) 10David Caro: wmcs: Refactor a bit the openstack commands [cookbooks] - 10https://gerrit.wikimedia.org/r/679792 [10:09:48] 10SRE, 10observability, 10CAS-SSO, 10Patch-For-Review, and 2 others: Sign-in links from Grafana dashboards don't work when not signed into SSO - https://phabricator.wikimedia.org/T269272 (10fgiunchedi) As a data point, after the forcelogin change (thanks!) I haven't experienced faulty logins/redirects when... [10:14:19] (03CR) 10David Caro: [C: 03+2] wmcs: Refactor a bit the openstack commands [cookbooks] - 10https://gerrit.wikimedia.org/r/679792 (owner: 10David Caro) [10:16:39] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:16:56] (03Merged) 10jenkins-bot: wmcs: Refactor a bit the openstack commands [cookbooks] - 10https://gerrit.wikimedia.org/r/679792 (owner: 10David Caro) [10:17:35] (03PS4) 10David Caro: icinga: use a sudo-friendly command to get command_file [software/spicerack] - 10https://gerrit.wikimedia.org/r/681754 (https://phabricator.wikimedia.org/T280641) [10:20:50] anybody working on kubestatge nodes in codfw? [10:24:13] ah I see akosiaris working on it on cr2-codfw [10:24:15] :) [10:24:24] (03CR) 10David Caro: [C: 03+2] icinga: use a sudo-friendly command to get command_file [software/spicerack] - 10https://gerrit.wikimedia.org/r/681754 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [10:24:34] yes, that's me and jayme messing with bfd [10:24:49] perfect, just wanted to double check [10:24:52] and I see you running 'show bfd sessions' ;-) [10:25:37] yes I tried "status" "neighbors" "etc.." [10:25:44] every time I have to look it up [10:25:47] :D [10:26:22] (03CR) 10David Caro: [C: 03+2] "Merging for now, will change once we have a nicer runbook (working on where to put/how to structure)." [puppet] - 10https://gerrit.wikimedia.org/r/680254 (owner: 10David Caro) [10:26:38] 10SRE, 10observability, 10CAS-SSO: thanos u/i gives errors if left idle for a few hours - https://phabricator.wikimedia.org/T268233 (10fgiunchedi) FWIW this is still happening (namely when GET'ing a query with an sso session in need for refresh, the thanos UI shows `Error executing query: OK`, fully refreshi... [10:26:39] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:26:48] same thing for this one ^ [10:27:17] (03CR) 10David Caro: [C: 03+2] wmcs: Add link to runbook on puppet alerts. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/680254 (owner: 10David Caro) [10:27:51] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:14] (03Merged) 10jenkins-bot: icinga: use a sudo-friendly command to get command_file [software/spicerack] - 10https://gerrit.wikimedia.org/r/681754 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [10:37:37] (03PS9) 10ArielGlenn: snapshot: Migrate cronjobs in pagetitles to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/678338 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [10:38:23] (03CR) 10ArielGlenn: [C: 03+2] snapshot: Migrate cronjobs in pagetitles to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/678338 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [10:39:39] (03PS2) 10ArielGlenn: snapshot: Migrate cronjobs in categoriesrdf to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/681357 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [10:39:49] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:40:24] (03CR) 10ArielGlenn: [C: 03+2] snapshot: Migrate cronjobs in categoriesrdf to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/681357 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [10:40:45] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:41:22] (03PS2) 10ArielGlenn: snapshot: Migrate cronjobs in dump_global_blocks to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/681360 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [10:42:03] (03CR) 10ArielGlenn: [C: 03+2] snapshot: Migrate cronjobs in dump_global_blocks to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/681360 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [10:44:14] (03PS2) 10ArielGlenn: snapshot: Migrate cronjobs in dump_machine_vision to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/681361 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [10:45:16] (03CR) 10ArielGlenn: [C: 03+2] snapshot: Migrate cronjobs in dump_machine_vision to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/681361 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [10:46:18] (03PS2) 10ArielGlenn: snapshot: Migrate cronjobs in shorturl to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/682010 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [10:47:03] (03CR) 10ArielGlenn: [C: 03+2] snapshot: Migrate cronjobs in shorturl to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/682010 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [10:47:51] (03PS2) 10ArielGlenn: snapshot: Migrate cronjobs in contentxlation to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/682011 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [10:48:58] (03CR) 10ArielGlenn: [C: 03+2] snapshot: Migrate cronjobs in contentxlation to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/682011 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [10:49:53] (03PS2) 10ArielGlenn: snapshot: Migrate cronjobs in mediaperprojectlists to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/682012 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [10:50:37] (03CR) 10ArielGlenn: [C: 03+2] snapshot: Migrate cronjobs in mediaperprojectlists to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/682012 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [10:54:28] 10Puppet, 10SRE, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10ArielGlenn) As folks might guess from all the merges, the first email via MAILTO to ops-dumps arrived today, verifying that part of the migrati... [10:56:07] PROBLEM - Disk space on failoid1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=72%): /tmp 0 MB (0% inode=72%): /var/tmp 0 MB (0% inode=72%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=failoid1001&var-datasource=eqiad+prometheus/ops [11:00:05] (03PS1) 10Jbond: debmonitor-client: add python3.4 support back [software/debmonitor] - 10https://gerrit.wikimedia.org/r/682112 [11:01:36] ^ fixing the old failoid nodes [11:02:29] (03PS2) 10Jbond: debmonitor-client: add python3.4 support back [software/debmonitor] - 10https://gerrit.wikimedia.org/r/682112 [11:11:07] 10SRE, 10Mail, 10Wikimedia-Mailing-lists: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Ladsgroup) Today I removed around 100 email addresses from all mailing lists and still I'm getting hundreds of uncaught bounce notification with any email to listadmins@ [11:17:12] (03PS1) 10Ladsgroup: snapshot: Remove absented crons [puppet] - 10https://gerrit.wikimedia.org/r/682113 (https://phabricator.wikimedia.org/T273673) [11:17:27] RECOVERY - Disk space on failoid1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=failoid1001&var-datasource=eqiad+prometheus/ops [11:17:38] (03PS1) 10Jbond: wmf_auto_reimage_lib: use correct SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/682114 [11:19:07] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/682113 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [11:26:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/682112 (owner: 10Jbond) [11:44:53] (03PS2) 10Jbond: wmf_auto_reimage_lib: use correct SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/682114 [11:47:16] 10SRE, 10SRE-Access-Requests: Requesting access to Wikimedia Analytics Data for Silvan Heintze - https://phabricator.wikimedia.org/T280541 (10akosiaris) [11:47:41] (03CR) 10Jbond: [C: 03+2] debmonitor-client: add python3.4 support back [software/debmonitor] - 10https://gerrit.wikimedia.org/r/682112 (owner: 10Jbond) [11:50:51] !log installing perf updates from Buster 10.9 point release [11:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:23] 10SRE, 10SRE-Access-Requests: Requesting access to Wikimedia Analytics Data for Silvan Heintze - https://phabricator.wikimedia.org/T280541 (10akosiaris) [11:54:28] (03PS1) 10Jbond: 0.2.8-3: prepare release [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/682116 [11:56:03] (03PS1) 10Alexandros Kosiaris: Add sihe to analycs-wmde-users and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/682118 (https://phabricator.wikimedia.org/T280541) [11:56:51] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Wikimedia Analytics Data for Silvan Heintze - https://phabricator.wikimedia.org/T280541 (10akosiaris) [11:57:23] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add sihe to analycs-wmde-users and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/682118 (https://phabricator.wikimedia.org/T280541) (owner: 10Alexandros Kosiaris) [11:58:33] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Wikimedia Analytics Data for Silvan Heintze - https://phabricator.wikimedia.org/T280541 (10akosiaris) 05Open→03Resolved a:03akosiaris @Silvan_WMDE, Hi! The change expanding your access has been merged. Give it 30m or so to fully pro... [11:59:58] (03CR) 10Jbond: [C: 03+1] "minor nit but lgtm" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/682098 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [12:05:21] 10SRE, 10Commons, 10MediaWiki-Uploading, 10Structured Data Engineering, and 3 others: Various errors when trying to upload large files (Could not acquire lock, Service Temporarily Unavailable, 503 Backend fetch failed, 502 Next Hop Connection Failed) - https://phabricator.wikimedia.org/T280926 (10akosiaris... [12:06:34] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup) >>! In T280322#7028718, @Legoktm wrote: > wikimediacz-l and wikimediabr-l have volunteered to go first/early (thank you!). > > Question: do we need to migrate lists t... [12:12:22] (03CR) 10Jbond: [C: 03+2] 0.2.8-3: prepare release [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/682116 (owner: 10Jbond) [12:13:28] 10SRE, 10SRE-Access-Requests: Requesting access to Wikimedia Analytics Data for Aisha Khatun - https://phabricator.wikimedia.org/T280967 (10akosiaris) Hi Aisha. There is not such thing as `All` as far as groups go. Could you please clarify what exactly you are requesting access to? [12:13:45] 10SRE, 10SRE-Access-Requests: Requesting access to Wikimedia Analytics Data for Aisha Khatun - https://phabricator.wikimedia.org/T280967 (10akosiaris) p:05Triage→03Medium [12:14:31] (03CR) 10Ladsgroup: [C: 03+1] lists: Backup /var/lib/mailman3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/681763 (owner: 10Legoktm) [12:16:48] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/682114 (owner: 10Jbond) [12:16:51] (03Merged) 10jenkins-bot: 0.2.8-3: prepare release [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/682116 (owner: 10Jbond) [12:19:06] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Upgrade lists-next to bullseye mailman versions - https://phabricator.wikimedia.org/T280887 (10akosiaris) p:05Triage→03Medium [12:19:20] 10SRE, 10Services, 10Toolhub, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10akosiaris) p:05Triage→03Medium [12:19:44] 10SRE, 10Wikimedia-Mailing-lists: After lists have been migrated, https://lists.wikimedia.org/mailman/listinfo/ should redirect to postorius - https://phabricator.wikimedia.org/T280893 (10akosiaris) p:05Triage→03Medium [12:20:07] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T280961 (10akosiaris) p:05Triage→03Low [12:23:20] (03CR) 10Alexandros Kosiaris: [C: 04-1] quotereviewer.py: Stop using "blacklist" (031 comment) [software] - 10https://gerrit.wikimedia.org/r/681736 (https://phabricator.wikimedia.org/T254646) (owner: 10Reedy) [12:24:00] (03CR) 10Jcrespo: [C: 03+1] "No need to explain, just wanted to make sure it was intended." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/681763 (owner: 10Legoktm) [12:30:28] (03PS1) 10Aklapper: phabricator weekly changes email: List board column trigger changes [puppet] - 10https://gerrit.wikimedia.org/r/682120 [12:34:57] 10SRE, 10SRE-Access-Requests: Requesting access to Wikimedia Analytics Data for Aisha Khatun - https://phabricator.wikimedia.org/T280967 (10Gehel) I've been pointed to https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Requesting_access for some explanation about access levels. My understanding is that... [12:39:35] 10ops-eqiad: Rack/power audit in eqiad c8/d5 - https://phabricator.wikimedia.org/T280977 (10ayounsi) p:05Triage→03High [12:43:18] (03PS1) 10Jbond: changelog: need to bump the actual version number as the source changed [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/682122 [12:44:37] (03CR) 10Jbond: [C: 03+2] changelog: need to bump the actual version number as the source changed [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/682122 (owner: 10Jbond) [12:47:13] PROBLEM - Query Service HTTP Port on wdqs1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [12:49:33] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.058 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:49:41] RECOVERY - Query Service HTTP Port on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [12:50:57] !log upload new debmonitor-client packages [12:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:28] (03PS1) 10Aklapper: phabricator weekly changes email: List dashboard panel changes [puppet] - 10https://gerrit.wikimedia.org/r/682124 [12:52:33] 10SRE, 10SRE-Access-Requests: Requesting access to Wikimedia Analytics Data for Aisha Khatun - https://phabricator.wikimedia.org/T280967 (10elukey) Plus we'll also need a kerberos principal! :) https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos#Create_a_principal_for_a_real_user [12:52:36] 10SRE, 10SRE-Access-Requests: Requesting access to Wikimedia Analytics Data for Aisha Khatun - https://phabricator.wikimedia.org/T280967 (10AKhatun_WMF) [12:55:19] 10SRE, 10SRE-Access-Requests: Requesting access to Wikimedia Analytics Data for Aisha Khatun - https://phabricator.wikimedia.org/T280967 (10AKhatun_WMF) Thanks, I've updated it. The only thing thats left in the 'All of the above' section on [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Requesti... [12:55:28] 10SRE, 10observability, 10CAS-SSO: thanos u/i gives errors if left idle for a few hours - https://phabricator.wikimedia.org/T268233 (10fgiunchedi) >>! In T268233#7029062, @fgiunchedi wrote: > FWIW this is still happening (namely when GET'ing a query with an sso session in need for refresh, the thanos UI show... [12:56:28] (03CR) 10Jbond: [C: 03+2] wmf_auto_reimage_lib: use correct SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/682114 (owner: 10Jbond) [12:57:27] 10SRE, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee Manuel - https://phabricator.wikimedia.org/T280162 (10Manuel) Thank you all! [13:11:27] 10SRE: Integrate Buster 10.9 point update - https://phabricator.wikimedia.org/T279054 (10MoritzMuehlenhoff) [13:12:39] (03PS1) 10Elukey: profile::thanos::swift: add fake credentials for mlserve_prod [labs/private] - 10https://gerrit.wikimedia.org/r/682125 (https://phabricator.wikimedia.org/T280773) [13:15:39] (03CR) 10Klausman: [C: 03+1] profile::thanos::swift: add account for ML serve cluster [puppet] - 10https://gerrit.wikimedia.org/r/682097 (https://phabricator.wikimedia.org/T280773) (owner: 10Elukey) [13:17:22] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::thanos::swift: add fake credentials for mlserve_prod [labs/private] - 10https://gerrit.wikimedia.org/r/682125 (https://phabricator.wikimedia.org/T280773) (owner: 10Elukey) [13:17:28] (03CR) 10Alexandros Kosiaris: "Patch is correct, a couple of comments on the commit message." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/681500 (owner: 10Dzahn) [13:19:44] (03CR) 10Elukey: [V: 03+2 C: 03+2] profile::thanos::swift: add fake credentials for mlserve_prod [labs/private] - 10https://gerrit.wikimedia.org/r/682125 (https://phabricator.wikimedia.org/T280773) (owner: 10Elukey) [13:20:00] (03CR) 10Alexandros Kosiaris: [C: 03+1] exim: Drop support for legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/681242 (https://phabricator.wikimedia.org/T280472) (owner: 10Legoktm) [13:20:29] (03Abandoned) 10Gehel: elasticsearch: add methods to upgrade elasticsearch and plugins [software/spicerack] - 10https://gerrit.wikimedia.org/r/491254 (https://phabricator.wikimedia.org/T202885) (owner: 10Gehel) [13:22:31] (03CR) 10Elukey: [C: 03+2] profile::thanos::swift: add account for ML serve cluster [puppet] - 10https://gerrit.wikimedia.org/r/682097 (https://phabricator.wikimedia.org/T280773) (owner: 10Elukey) [13:25:36] Grazie ema! https://lwn.net/Articles/852112/ [13:27:39] wow [13:28:05] (03PS1) 10Andrew Bogott: Dummy passwords for Trove service user [labs/private] - 10https://gerrit.wikimedia.org/r/682126 (https://phabricator.wikimedia.org/T212595) [13:29:50] 10SRE, 10SRE-Access-Requests: Requesting access to Wikimedia Analytics Data for Aisha Khatun - https://phabricator.wikimedia.org/T280967 (10CBogen) > Name of approving party (manager for WMF/WMDE staff): @CBogen Approved! [13:30:12] (03CR) 10Alexandros Kosiaris: [C: 03+1] exim: Clean up remnants of legacy_mailing_lists [puppet] - 10https://gerrit.wikimedia.org/r/681724 (https://phabricator.wikimedia.org/T280472) (owner: 10Legoktm) [13:31:17] (03PS2) 10David Caro: upgrade-and-reboot: add possibility to use sudo [cookbooks] - 10https://gerrit.wikimedia.org/r/682098 (https://phabricator.wikimedia.org/T280641) [13:31:19] (03CR) 10David Caro: upgrade-and-reboot: add possibility to use sudo (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/682098 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [13:33:51] !log roll restart of all thanos-swift proxies to pick up new ML account - T280773 [13:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:02] T280773: Swift account to store ML models - https://phabricator.wikimedia.org/T280773 [13:35:46] (03CR) 10David Caro: [C: 03+2] upgrade-and-reboot: add possibility to use sudo [cookbooks] - 10https://gerrit.wikimedia.org/r/682098 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [13:37:02] (03PS1) 10Andrew Bogott: Trove: move most trove activity into a service project [puppet] - 10https://gerrit.wikimedia.org/r/682129 (https://phabricator.wikimedia.org/T212595) [13:37:19] Nemo_bis: <3 [13:37:31] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Dummy passwords for Trove service user [labs/private] - 10https://gerrit.wikimedia.org/r/682126 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [13:38:15] (03CR) 10jerkins-bot: [V: 04-1] Trove: move most trove activity into a service project [puppet] - 10https://gerrit.wikimedia.org/r/682129 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [13:38:43] (03Merged) 10jenkins-bot: upgrade-and-reboot: add possibility to use sudo [cookbooks] - 10https://gerrit.wikimedia.org/r/682098 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [13:39:48] (03PS2) 10Andrew Bogott: Trove: move most trove activity into a service project [puppet] - 10https://gerrit.wikimedia.org/r/682129 (https://phabricator.wikimedia.org/T212595) [13:42:51] (03PS3) 10Andrew Bogott: Trove: move most trove activity into a service project [puppet] - 10https://gerrit.wikimedia.org/r/682129 (https://phabricator.wikimedia.org/T212595) [13:44:39] (03PS4) 10Andrew Bogott: Trove: move most trove activity into a service project [puppet] - 10https://gerrit.wikimedia.org/r/682129 (https://phabricator.wikimedia.org/T212595) [13:49:27] 10Puppet, 10SRE: Determine safe concurrent puppet run batches via cumin - https://phabricator.wikimedia.org/T280622 (10jbond) @akosiaris thanks for digging into this a bit further, and appolagise for not leaving more then a drive by comment: > How long did the run take? My reading of the graphs says ~40m (fro... [13:49:41] (03PS5) 10Andrew Bogott: Trove: move most trove activity into a service project [puppet] - 10https://gerrit.wikimedia.org/r/682129 (https://phabricator.wikimedia.org/T212595) [13:50:16] 10SRE, 10Prod-Kubernetes, 10Pybal, 10Traffic, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10JMeybohm) During testing today, we had some sideline issues because calico-node was dying (as we brought down the network inter... [13:50:47] (03CR) 10Muehlenhoff: C:package_builder: Add Script for building debian packages from git (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/681445 (owner: 10Jbond) [13:56:33] (03PS1) 10Jbond: cumin: fix typo in reimage script [puppet] - 10https://gerrit.wikimedia.org/r/682132 [13:56:51] (03PS5) 10Reedy: quotereviewer.py: Stop using "blacklist" [software] - 10https://gerrit.wikimedia.org/r/681736 (https://phabricator.wikimedia.org/T254646) [13:58:25] (03PS6) 10Andrew Bogott: Trove: move most trove activity into a service project [puppet] - 10https://gerrit.wikimedia.org/r/682129 (https://phabricator.wikimedia.org/T212595) [13:58:43] (03CR) 10Jbond: [C: 03+2] cumin: fix typo in reimage script [puppet] - 10https://gerrit.wikimedia.org/r/682132 (owner: 10Jbond) [14:00:46] (03PS7) 10Andrew Bogott: Trove: move most trove activity into a service project [puppet] - 10https://gerrit.wikimedia.org/r/682129 (https://phabricator.wikimedia.org/T212595) [14:01:30] 10SRE, 10LDAP-Access-Requests, 10CAS-SSO: CAS SSO for reedy - https://phabricator.wikimedia.org/T279244 (10Reedy) 05Open→03Declined Yeah, I think the information I need is in netbox (or via just SSH-ing to the server directly) The notice on racktables would be nice, but would agree if it's excessive wor... [14:01:57] 10SRE, 10ops-eqiad: Rack/power audit in eqiad c8/d5 - https://phabricator.wikimedia.org/T280977 (10ayounsi) @andrew @arturo, in parallel it would be useful to know how much growth is expected in WMCS (especially the servers going behind the cloudsw) for the next FY. [14:04:52] (03CR) 10Andrew Bogott: [C: 03+2] Trove: move most trove activity into a service project [puppet] - 10https://gerrit.wikimedia.org/r/682129 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [14:06:26] (03PS4) 10Jbond: C:package_builder: Add Script for building debian packages from git [puppet] - 10https://gerrit.wikimedia.org/r/681445 [14:06:28] (03CR) 10Jbond: "thanks" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/681445 (owner: 10Jbond) [14:19:05] (03PS2) 10ArielGlenn: snapshot: Remove absented crons [puppet] - 10https://gerrit.wikimedia.org/r/682113 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [14:20:43] (03CR) 10ArielGlenn: [C: 03+2] snapshot: Remove absented crons [puppet] - 10https://gerrit.wikimedia.org/r/682113 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [14:25:34] !log revert back bullseye image to daily build from last week (to rule out potential reimage issue) [14:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:42] 10SRE, 10ops-codfw: thanos-fe2001 machine check exception and crash/stall - https://phabricator.wikimedia.org/T280782 (10Papaul) @fgiunchedi it looks to me like a memory or CPU problem. I will swap A1 with B1 when i am back on site and see if we do have the same error on B1. ` Multi-bit memory errors detected... [14:31:58] 10SRE, 10ops-codfw: thanos-fe2001 machine check exception and crash/stall - https://phabricator.wikimedia.org/T280782 (10Papaul) a:03Papaul [14:43:39] (03CR) 10David Caro: "Just a question, everything else are nits (you can ignore them), LGTM, but I'm not familiar with the build process, so I'll leave the +1 f" (0320 comments) [puppet] - 10https://gerrit.wikimedia.org/r/681445 (owner: 10Jbond) [14:59:39] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on theemin.codfw.wmnet with reason: REIMAGE [14:59:39] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on theemin.codfw.wmnet with reason: REIMAGE [14:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:40] (03PS1) 10Effie Mouzeli: modules::conftool add safe-service-restart scap option [puppet] - 10https://gerrit.wikimedia.org/r/682141 (https://phabricator.wikimedia.org/T266055) [15:21:32] 10SRE, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10jijiki) [15:21:36] 10SRE, 10Performance-Team, 10serviceops, 10MW-1.37-notes (1.37.0-wmf.1; 2021-04-13), and 2 others: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 (10jijiki) [15:21:38] 10SRE, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [15:21:52] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jijiki: Upgrade memcached to version 1.6.x - https://phabricator.wikimedia.org/T270315 (10jijiki) [15:22:22] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jijiki: Upgrade memcached to version 1.6.x - https://phabricator.wikimedia.org/T270315 (10jijiki) 05Resolved→03Open [15:28:08] (03PS1) 10Effie Mouzeli: mediawiki::mcrouter_wancache: upgrade onhost memcached to 1.6 [puppet] - 10https://gerrit.wikimedia.org/r/682166 (https://phabricator.wikimedia.org/T270315) [15:29:52] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::mcrouter_wancache: upgrade onhost memcached to 1.6 [puppet] - 10https://gerrit.wikimedia.org/r/682166 (https://phabricator.wikimedia.org/T270315) (owner: 10Effie Mouzeli) [15:31:21] PROBLEM - Host cp1087 is DOWN: PING CRITICAL - Packet loss = 100% [15:31:23] (03PS2) 10Effie Mouzeli: mediawiki::mcrouter_wancache: upgrade onhost memcached to 1.6 [puppet] - 10https://gerrit.wikimedia.org/r/682166 (https://phabricator.wikimedia.org/T270315) [15:32:35] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::mcrouter_wancache: upgrade onhost memcached to 1.6 [puppet] - 10https://gerrit.wikimedia.org/r/682166 (https://phabricator.wikimedia.org/T270315) (owner: 10Effie Mouzeli) [15:37:39] PROBLEM - Host cp1087.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:43:33] RECOVERY - Host cp1087.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.47 ms [15:44:05] RECOVERY - Host cp1087 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [15:44:32] 10SRE, 10ops-eqiad, 10Traffic: cp1087 powercycled - https://phabricator.wikimedia.org/T278729 (10Cmjohnson) 05Open→03Resolved replaced cpu1 and cleared the idrac log, resolving, if the issue returns please re-open. [15:46:35] RECOVERY - Memory correctable errors -EDAC- on thumbor2001 is OK: (C)4 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor2001&var-datasource=codfw+prometheus/ops [15:46:57] (03CR) 10Jbond: "updated thanks" (0320 comments) [puppet] - 10https://gerrit.wikimedia.org/r/681445 (owner: 10Jbond) [15:52:02] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1100 - https://phabricator.wikimedia.org/T280132 (10Cmjohnson) 05Open→03Resolved The disk has been swapped, I am resolving this task because the on-site work has been completed. [15:57:33] 10SRE, 10ops-eqiad: Can't access thanos-fe1001.mgmt - https://phabricator.wikimedia.org/T280623 (10Cmjohnson) Password was incorrect, fixed [15:57:39] 10SRE, 10ops-eqiad: Can't access thanos-fe1001.mgmt - https://phabricator.wikimedia.org/T280623 (10Cmjohnson) 05Open→03Resolved [16:07:19] (03PS1) 10David Caro: wmcs.openstack: add cloudvirt maintenance cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/682169 (https://phabricator.wikimedia.org/T280641) [16:08:11] (03CR) 10David Caro: wmcs.openstack: add cloudvirt maintenance cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/682169 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [16:10:00] (03CR) 10David Caro: wmcs.openstack: add cloudvirt maintenance cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/682169 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [16:10:32] (03CR) 10jerkins-bot: [V: 04-1] wmcs.openstack: add cloudvirt maintenance cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/682169 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [16:19:24] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:40] jynus: is it ok to merge the mailman3 backup patch if the path doesn't exist yet? [16:19:53] sure [16:20:07] we can create it empty if it errors out [16:20:33] but that is exactly why I asked if it should be enabled now (the logic of it) [16:21:49] RECOVERY - Device not healthy -SMART- on an-worker1100 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1100&var-datasource=eqiad+prometheus/ops [16:22:24] the problem is monitoring will complain if it is empty [16:22:39] legoktm, considere adding it to the list of ignoring monitoring backups [16:23:31] we can just merge it next week once the path exists [16:23:47] 10SRE, 10ops-eqiad: htmldumper1001 power suply failure - https://phabricator.wikimedia.org/T280618 (10Cmjohnson) 05Open→03Resolved Loose power cable [16:24:01] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:48] legoktm, https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/files/backup/job_monitoring_ignorelist [16:25:08] my suggestion is either to wait for merging or add it there so it doesn't warn about empty backups, any of the 2 [16:25:47] (03PS1) 10Dzahn: site: add planet role on planet1003, testing bullseye [puppet] - 10https://gerrit.wikimedia.org/r/682171 [16:26:02] (03CR) 10jerkins-bot: [V: 04-1] site: add planet role on planet1003, testing bullseye [puppet] - 10https://gerrit.wikimedia.org/r/682171 (owner: 10Dzahn) [16:27:23] (03PS2) 10Dzahn: site: add planet role on planet1003, testing bullseye [puppet] - 10https://gerrit.wikimedia.org/r/682171 [16:32:21] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:59] (03CR) 10Legoktm: "After IRC discussion, we'll merge this once mailman3 has been installed on lists1001 and /var/lib/mailman3 exists." [puppet] - 10https://gerrit.wikimedia.org/r/681763 (owner: 10Legoktm) [16:35:59] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:20] (03PS5) 10Jbond: C:package_builder: Add Script for building debian packages from git [puppet] - 10https://gerrit.wikimedia.org/r/681445 [16:37:05] (03CR) 10Dzahn: [C: 03+2] site: add planet role on planet1003, testing bullseye [puppet] - 10https://gerrit.wikimedia.org/r/682171 (owner: 10Dzahn) [16:40:12] (03PS1) 10Jbond: nstall_server: add sretest1002 to dhcp files [puppet] - 10https://gerrit.wikimedia.org/r/682173 [16:40:51] (03PS2) 10Jbond: install_server: add sretest1002 to dhcp files [puppet] - 10https://gerrit.wikimedia.org/r/682173 [16:41:48] 10Puppet, 10SRE: Determine safe concurrent puppet run batches via cumin - https://phabricator.wikimedia.org/T280622 (10akosiaris) >>! In T280622#7029474, @jbond wrote: > @akosiaris thanks for digging into this a bit further, and appolagise for not leaving more then a drive by comment: > >> How long did the ru... [16:49:42] 10Puppet, 10SRE: Determine safe concurrent puppet run batches via cumin - https://phabricator.wikimedia.org/T280622 (10jbond) > The result was 1000+ DNS requests per agent run. yes i also hit the same issue in $JOB~1 will try and find the relevant bugs and check on progress. As mentioned the pop caches will... [16:52:30] (03PS1) 10Ottomata: refine - allow configuring RefineMonitors since and until params [puppet] - 10https://gerrit.wikimedia.org/r/682175 (https://phabricator.wikimedia.org/T273789) [16:54:00] (03CR) 10jerkins-bot: [V: 04-1] refine - allow configuring RefineMonitors since and until params [puppet] - 10https://gerrit.wikimedia.org/r/682175 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [16:56:56] (03PS2) 10Ottomata: refine - allow configuring RefineMonitors since and until params [puppet] - 10https://gerrit.wikimedia.org/r/682175 (https://phabricator.wikimedia.org/T273789) [16:57:05] 10SRE: try planet on bullseye - https://phabricator.wikimedia.org/T280989 (10Dzahn) [16:57:16] 10SRE: try planet on bullseye - https://phabricator.wikimedia.org/T280989 (10Dzahn) p:05Triage→03Low [16:58:35] (03CR) 10jerkins-bot: [V: 04-1] refine - allow configuring RefineMonitors since and until params [puppet] - 10https://gerrit.wikimedia.org/r/682175 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [17:00:16] 10SRE: try planet on bullseye - https://phabricator.wikimedia.org/T280989 (10Dzahn) [x] planet1003.eqiad.wmnet created with bullseye https://gerrit.wikimedia.org/r/c/operations/puppet/+/681774 https://gerrit.wikimedia.org/r/c/operations/puppet/+/681779 [x] applied planet role https://gerrit.wikimedia.org/r/c/... [17:02:23] 10SRE, 10Prod-Kubernetes, 10Pybal, 10Traffic, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10akosiaris) > [] Simulate node failures and record/evaluate recovery times We 've looked into this with @JMeybohm. We 've notic... [17:02:33] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1087.eqiad.wmnet [17:02:36] (03PS3) 10Ottomata: refine - allow configuring RefineMonitors since and until params [puppet] - 10https://gerrit.wikimedia.org/r/682175 (https://phabricator.wikimedia.org/T273789) [17:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs100[123] - https://phabricator.wikimedia.org/T276644 (10Cmjohnson) [17:03:13] 10SRE, 10ops-eqiad, 10Traffic: cp1087 powercycled - https://phabricator.wikimedia.org/T278729 (10elukey) To keep archives happy: repooled after a chat with Brandon :) [17:03:47] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29164/console" [puppet] - 10https://gerrit.wikimedia.org/r/682175 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [17:03:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs100[123] - https://phabricator.wikimedia.org/T276644 (10Cmjohnson) a:05Cmjohnson→03RobH Assigning this to @robh to complete install [17:05:02] (03CR) 10Ottomata: [V: 03+1 C: 03+2] refine - allow configuring RefineMonitors since and until params [puppet] - 10https://gerrit.wikimedia.org/r/682175 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [17:06:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10Cmjohnson) [17:07:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10Cmjohnson) a:05Cmjohnson→03RobH Assigning to @robh for installs [17:12:58] (03CR) 10Alexandros Kosiaris: [C: 03+2] quotereviewer.py: Stop using "blacklist" [software] - 10https://gerrit.wikimedia.org/r/681736 (https://phabricator.wikimedia.org/T254646) (owner: 10Reedy) [17:13:00] (03PS4) 10Giuseppe Lavagetto: helmfile: install a simple deployment shell [puppet] - 10https://gerrit.wikimedia.org/r/681432 [17:13:32] (03PS1) 10Dzahn: planet: use python3 tidylib and libxml2 versions on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/682181 (https://phabricator.wikimedia.org/T280989) [17:14:06] (03Merged) 10jenkins-bot: quotereviewer.py: Stop using "blacklist" [software] - 10https://gerrit.wikimedia.org/r/681736 (https://phabricator.wikimedia.org/T254646) (owner: 10Reedy) [17:15:10] PROBLEM - Check no envoy runtime configuration is left persistent on planet1003 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [17:15:38] PROBLEM - Check that envoy is running on planet1003 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is inactive https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [17:16:30] (03PS2) 10Awight: [DNM] Revert "Temporarily disable some reportupdater jobs" [puppet] - 10https://gerrit.wikimedia.org/r/680021 (https://phabricator.wikimedia.org/T279046) [17:17:28] (03PS3) 10Awight: Revert "Temporarily disable some reportupdater jobs" [puppet] - 10https://gerrit.wikimedia.org/r/680021 (https://phabricator.wikimedia.org/T279046) [17:17:34] (03PS3) 10Majavah: redis::multidc: Make discovery optional [puppet] - 10https://gerrit.wikimedia.org/r/669447 [17:19:50] (03CR) 10Dzahn: [C: 03+2] planet: use python3 tidylib and libxml2 versions on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/682181 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [17:23:19] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Pause while discussion is ongoing on https://phabricator.wikimedia.org/T280718" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681665 (owner: 10Alexandros Kosiaris) [17:24:27] 10SRE: Expose live font list (fc-list) on a public webpage - https://phabricator.wikimedia.org/T280829 (10akosiaris) >>! In T280829#7025378, @Dzahn wrote: > ACK! > > It's kind of a duplicate of T280718 before that was renamed at least. Should we merge into T280718? It looks like that is the task where discuss... [17:31:41] 10SRE: Expose live font list (fc-list) on a public webpage - https://phabricator.wikimedia.org/T280829 (10Dzahn) Yes, fine with me, we have a couple tickets semi-related. Actually updating the file is separate too, in T79424. [17:32:44] 10SRE: Expose live font list (fc-list) on a public webpage - https://phabricator.wikimedia.org/T280829 (10akosiaris) Cool, done so, thanks! [17:32:51] 10SRE: Expose live font list (fc-list) on a public webpage - https://phabricator.wikimedia.org/T280829 (10akosiaris) [17:33:17] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10akosiaris) [17:36:00] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/682183 [17:38:52] (03PS1) 10Esanders: Make DiscussionTool's sourcemodetoolbar available on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682184 [17:41:45] (03PS2) 10Dzahn: ci/deployment-server: adding kubernetes namespace for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/681500 [17:44:49] 10SRE, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Dwisehaupt) [17:54:50] (03PS1) 10Legoktm: mailman: Don't un-advertise a list when disabling it [puppet] - 10https://gerrit.wikimedia.org/r/682185 [17:55:59] (03CR) 10Dzahn: [C: 03+2] "tested query. it's empty result right now, fyi" [puppet] - 10https://gerrit.wikimedia.org/r/682120 (owner: 10Aklapper) [17:56:06] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10akosiaris) I 've gone through the various tasks (T280829, T79424, T210960 and T180923, let me know if I have missed... [17:59:39] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10Dzahn) We could run one time on a random (canary) appserver and let it write the output to a local file. Then we c... [18:01:36] (03CR) 10Dzahn: "I can't just merge this, changing GRANTs needs DBA to deploy changes." [puppet] - 10https://gerrit.wikimedia.org/r/682124 (owner: 10Aklapper) [18:03:51] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10akosiaris) >>! In T280718#7030090, @Dzahn wrote: > We could run one systemd timer on a random (canary) appserver an... [18:13:12] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [18:15:30] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 1 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [18:24:53] (03PS1) 10Dwisehaupt: Add new payments hosts to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/682186 (https://phabricator.wikimedia.org/T266481) [18:35:40] (03CR) 10Effie Mouzeli: [C: 03+1] conftool: add comments about 2 dedicated videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/679432 (https://phabricator.wikimedia.org/T279100) (owner: 10Dzahn) [18:38:53] (03CR) 10Dzahn: [C: 03+2] conftool: add comments about 2 dedicated videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/679432 (https://phabricator.wikimedia.org/T279100) (owner: 10Dzahn) [18:48:02] (03PS4) 10Legoktm: redis::multidc: Make discovery optional [puppet] - 10https://gerrit.wikimedia.org/r/669447 (owner: 10Majavah) [18:49:04] (03CR) 10Dwisehaupt: "Adding in monitoring for the new payments servers when we are ready." [puppet] - 10https://gerrit.wikimedia.org/r/682186 (https://phabricator.wikimedia.org/T266481) (owner: 10Dwisehaupt) [18:49:06] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29166/console" [puppet] - 10https://gerrit.wikimedia.org/r/669447 (owner: 10Majavah) [18:49:40] (03CR) 10Legoktm: [V: 03+1 C: 03+2] redis::multidc: Make discovery optional [puppet] - 10https://gerrit.wikimedia.org/r/669447 (owner: 10Majavah) [18:59:54] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Legoktm) vereinat-l asked to delay until one of their admins is available at the beginning of May. [19:01:59] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Legoktm) >>! In T280322#7029299, @Ladsgroup wrote: >>>! In T280322#7028718, @Legoktm wrote: >> Question: do we need to migrate lists that were renamed? e.g. there's a disabled "t... [19:02:10] 10SRE, 10fundraising-tech-ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Dwisehaupt) [19:09:51] !log closing duplicate/wrong cluster indices in cloudelastic [19:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:22] (03PS1) 10Legoktm: [WIP] lists: Move renamed lists into hiera [puppet] - 10https://gerrit.wikimedia.org/r/682192 [19:12:37] (03CR) 10jerkins-bot: [V: 04-1] [WIP] lists: Move renamed lists into hiera [puppet] - 10https://gerrit.wikimedia.org/r/682192 (owner: 10Legoktm) [19:15:36] (03PS2) 10Legoktm: [WIP] lists: Move renamed lists into hiera [puppet] - 10https://gerrit.wikimedia.org/r/682192 [19:16:03] (03CR) 10jerkins-bot: [V: 04-1] [WIP] lists: Move renamed lists into hiera [puppet] - 10https://gerrit.wikimedia.org/r/682192 (owner: 10Legoktm) [19:18:50] (03PS3) 10Legoktm: [WIP] lists: Move renamed lists into hiera [puppet] - 10https://gerrit.wikimedia.org/r/682192 [19:20:06] (03CR) 10jerkins-bot: [V: 04-1] [WIP] lists: Move renamed lists into hiera [puppet] - 10https://gerrit.wikimedia.org/r/682192 (owner: 10Legoktm) [19:23:47] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10Glrx) I (and presumably most of those seeking the font list) want the font list that is on the image scalers; that... [19:24:27] (03CR) 10Legoktm: "There's something wrong with my apache.conf.erb template, but I'm not seeing it." [puppet] - 10https://gerrit.wikimedia.org/r/682192 (owner: 10Legoktm) [19:27:28] (03CR) 10Majavah: "maybe this?" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/682192 (owner: 10Legoktm) [19:29:13] 10SRE, 10Mail, 10Wikimedia-Mailing-lists: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Ladsgroup) I just did more than 10K force unsubscriptions from all mailing lists. Hopefully this will help with the mess [19:41:35] !log [apt1001:~] $ sudo -i reprepro copy bullseye-wikimedia buster-wikimedia envoyproxy - copy envoy package from buster to bullseye T280989 [19:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:45] T280989: try planet on bullseye - https://phabricator.wikimedia.org/T280989 [20:07:11] (03CR) 10Legoktm: [WIP] lists: Move renamed lists into hiera (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/682192 (owner: 10Legoktm) [20:07:16] (03CR) 10Chico Venancio: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681835 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [20:07:26] (03PS4) 10Legoktm: [WIP] lists: Move renamed lists into hiera [puppet] - 10https://gerrit.wikimedia.org/r/682192 [20:07:35] (03CR) 10Chico Venancio: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681836 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [20:08:05] (03CR) 10Chico Venancio: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681834 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [20:08:14] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29170/console" [puppet] - 10https://gerrit.wikimedia.org/r/682192 (owner: 10Legoktm) [20:12:51] (03PS5) 10Legoktm: [WIP] lists: Move renamed lists into hiera [puppet] - 10https://gerrit.wikimedia.org/r/682192 [20:13:37] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29171/console" [puppet] - 10https://gerrit.wikimedia.org/r/682192 (owner: 10Legoktm) [20:15:21] !log [apt1001:~] $ sudo -i reprepro -C main includedeb bullseye-wikimedia /home/dzahn/rawdog_2.23-2_all.deb (T280989) [20:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:29] T280989: try planet on bullseye - https://phabricator.wikimedia.org/T280989 [20:21:32] (03PS6) 10Legoktm: lists: Move renamed lists into hiera [puppet] - 10https://gerrit.wikimedia.org/r/682192 [20:30:40] (03CR) 10Dzahn: [C: 03+1] "compiler output (spot checked alias file, http redirects) looks good to me: https://puppet-compiler.wmflabs.org/compiler1003/29172/lists10" [puppet] - 10https://gerrit.wikimedia.org/r/682192 (owner: 10Legoktm) [20:35:41] (03CR) 10Dzahn: "Ideally if you can find a DBA to deploy the GRANT change" [puppet] - 10https://gerrit.wikimedia.org/r/682124 (owner: 10Aklapper) [20:43:38] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup) oh I thought about just disabled mailing lists (not renamed ones) and preserving history. I agree with you on renamed ones too. [20:47:28] PROBLEM - snapshot of s3 in codfw on alert1001 is CRITICAL: snapshot for s3 at codfw taken more than 3 days ago: Most recent backup 2021-04-20 20:34:33 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [20:54:04] PROBLEM - PyBal backends health check on lvs3007 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp3054.esams.wmnet, cp3058.esams.wmnet, cp3056.esams.wmnet are marked down but pooled: textlb_443: Servers cp3054.esams.wmnet, cp3058.esams.wmnet, cp3056.esams.wmnet are marked down but pooled: testlb6_443: Servers cp3060.esams.wmnet, cp3054.esams.wmnet, cp3058.esams.wmnet are marked down but pooled: textlb6_443: Servers [20:54:04] t, cp3054.esams.wmnet, cp3058.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:54:32] There seems to be a huge spike in response time right now [20:54:48] PROBLEM - PyBal backends health check on lvs3005 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb6_443: Servers cp3058.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:56:28] RECOVERY - PyBal backends health check on lvs3007 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:57:14] RECOVERY - PyBal backends health check on lvs3005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:58:37] errr [20:58:51] hoo: via esams? [20:59:03] Yes [20:59:08] High load on varnish [20:59:32] Seems better now [20:59:42] https://grafana.wikimedia.org/d/000000608/datacenter-overview?orgId=1&var-datasource=esams%20prometheus%2Fops&var-cluster=All [20:59:48] (both the load and the response time) [21:00:05] https://grafana.wikimedia.org/d/000000330/varnish-machine-stats?orgId=1&var-server=cp3050&var-datasource=esams%20prometheus%2Fops [21:01:11] Looks like a "sudden spike in incoming traffic"... [21:03:01] 10SRE, 10observability, 10CAS-SSO: grafana-rw SSO redirect breaks template parameters due to double encoding - https://phabricator.wikimedia.org/T281004 (10Krinkle) [21:04:02] 10SRE, 10observability, 10CAS-SSO, 10Patch-For-Review, and 2 others: Sign-in links from Grafana dashboards don't work when not signed into SSO - https://phabricator.wikimedia.org/T269272 (10Krinkle) fyi: {T281004} [21:04:19] legoktm: https://phabricator.wikimedia.org/T279809 wasn't so long back [21:07:32] thanks [21:11:49] legoktm: I can't actually see what's on it. I just remember it from the time. [21:12:00] PROBLEM - SSH on mw1279.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:12:14] let me fix the perms on it [21:13:06] done [21:13:38] Thanks legoktm [21:13:43] Hope it's useful [21:13:53] (03PS11) 10Mstyles: rdf-streaming-updater: create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) [21:36:38] !log removing 1 file for legal compliance [21:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:51] (03PS2) 10Legoktm: lists: Fix mailman3 apache config [puppet] - 10https://gerrit.wikimedia.org/r/681785 (https://phabricator.wikimedia.org/T278612) [22:07:25] 10SRE: try planet on bullseye - https://phabricator.wikimedia.org/T280989 (10MoritzMuehlenhoff) We'll need to find a different aggregator, then: rawdog was removed from Bullseye since it was never ported to Python 2, see https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=938333 [22:08:00] (03PS5) 1001miki10: Disable ContentTranslation New article campaign in fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672416 (https://phabricator.wikimedia.org/T277473) [23:48:30] PROBLEM - snapshot of x1 in codfw on alert1001 is CRITICAL: snapshot for x1 at codfw taken more than 3 days ago: Most recent backup 2021-04-20 23:40:09 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [23:53:24] PROBLEM - Disk space on mwlog1001 is CRITICAL: DISK CRITICAL - free space: /srv 279998 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwlog1001&var-datasource=eqiad+prometheus/ops