[00:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210226T0000). [00:00:04] Jdlrobson: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:20] i can deploy today [00:00:36] Jdlrobson: hey, around? [00:02:41] (03PS2) 10Dzahn: tcpircbot: remove deploy1001 from allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/635108 (https://phabricator.wikimedia.org/T275831) [00:02:56] (03CR) 10jerkins-bot: [V: 04-1] tcpircbot: remove deploy1001 from allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/635108 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [00:02:58] Jdlrobson: ping? [00:03:21] (03PS3) 10Dzahn: tcpircbot: remove deploy1001 from allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/635108 (https://phabricator.wikimedia.org/T275831) [00:03:37] (03PS2) 10Dzahn: common/scap/DHCP: remove deploy1001 from scap hosts and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/635111 (https://phabricator.wikimedia.org/T275831) [00:03:57] (03CR) 10jerkins-bot: [V: 04-1] common/scap/DHCP: remove deploy1001 from scap hosts and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/635111 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [00:04:00] Urbanecm: here [00:04:07] cool :) [00:04:12] sorry a bit late! [00:04:16] (03CR) 10Urbanecm: [C: 03+2] Do not log graph errors to WMF servers [extensions/Graph] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/666999 (https://phabricator.wikimedia.org/T274557) (owner: 10Jdlrobson) [00:04:19] np :) [00:04:24] just a backport this time? [00:04:36] (03PS2) 10Dzahn: site: remove deploy1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/635112 (https://phabricator.wikimedia.org/T275831) [00:04:59] (03CR) 10jerkins-bot: [V: 04-1] site: remove deploy1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/635112 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [00:05:49] Urbanecm: yep [00:05:58] expecting to see some nice impact in logstash [00:06:11] i hope it doesn't mean "more entries" :D [00:07:10] haha no .. should hopefully cut out a big chunk of them [00:10:27] (03Merged) 10jenkins-bot: Do not log graph errors to WMF servers [extensions/Graph] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/666999 (https://phabricator.wikimedia.org/T274557) (owner: 10Jdlrobson) [00:10:58] \o [00:11:05] (03PS3) 10Dzahn: package_builder: convert cowbuilder cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/667008 (https://phabricator.wikimedia.org/T273673) [00:11:30] (03CR) 10jerkins-bot: [V: 04-1] package_builder: convert cowbuilder cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/667008 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [00:11:38] Jdlrobson: can you test at mwdebug1001, please? [00:11:50] Urbanecm: on it [00:12:16] Urbanecm: that did it! [00:12:21] cool! [00:12:21] please sync away [00:12:23] syncing [00:14:19] *dances* [00:14:30] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.32/extensions/Graph/: 9d5cf348f5dda32f8889d5160bb1fe34a4e07f8c: Do not log graph errors to WMF servers (T274557) (duration: 01m 36s) [00:14:32] Jdlrobson: it's live, enjoy :) [00:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:36] anything else? [00:14:39] T274557: Broken graphs on wikis accounting for 45% of our client side errors - https://phabricator.wikimedia.org/T274557 [00:15:46] nope that's it. thanks Urbanecm [00:15:50] any time [00:15:51] now i just watch the graphs.... [00:16:36] (03PS4) 10Dzahn: package_builder: convert cowbuilder cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/667008 (https://phabricator.wikimedia.org/T273673) [00:16:40] hehe [00:17:17] (03PS2) 10Dzahn: switch deployment CNAME from deploy1001 to deploy1002 [dns] - 10https://gerrit.wikimedia.org/r/635113 (https://phabricator.wikimedia.org/T265963) [00:26:29] (03PS3) 10Dzahn: hiera/scap: switch deployment server to deploy1002 [puppet] - 10https://gerrit.wikimedia.org/r/635105 (https://phabricator.wikimedia.org/T265963) [00:33:05] (03PS1) 10Dzahn: DHCP: remove deploy2001 [puppet] - 10https://gerrit.wikimedia.org/r/667040 (https://phabricator.wikimedia.org/T275832) [00:33:07] (03PS1) 10Dzahn: hiera/scap: remove deploy2001 from firewalls and dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/667041 (https://phabricator.wikimedia.org/T275832) [00:33:09] (03PS1) 10Dzahn: tcpircbot: remove deploy2001 [puppet] - 10https://gerrit.wikimedia.org/r/667042 (https://phabricator.wikimedia.org/T275832) [00:39:35] (03PS1) 10Dzahn: scap: switch codfw deployment server and scap master to deploy2002 [puppet] - 10https://gerrit.wikimedia.org/r/667043 (https://phabricator.wikimedia.org/T265963) [00:41:36] (03CR) 10Dzahn: "Let's start with this one" [puppet] - 10https://gerrit.wikimedia.org/r/667043 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [00:43:46] (03CR) 10Dzahn: "I already ran "scap-sync-master" manually meanwhile, so it's synced. But this would still make it automatic without a human having to run " [puppet] - 10https://gerrit.wikimedia.org/r/667031 (owner: 10Dzahn) [01:16:42] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/667045 [01:46:28] (03PS1) 10Legoktm: [WIP] Add shellbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 [02:19:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:22:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:51:50] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1008 is OK: (C)5e+06 ge (W)1e+06 ge 5.452e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1008 [03:01:24] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1001 is OK: (C)5e+06 ge (W)1e+06 ge 7.249e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [03:08:44] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1007 is OK: (C)5e+06 ge (W)1e+06 ge 2.271e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1007 [03:17:38] PROBLEM - WDQS SPARQL on wdqs1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:19:44] RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.054 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:47:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Wikidata, and 3 others: Upgrade firmware on wdqs1009 - https://phabricator.wikimedia.org/T274751 (10RKemper) @Cmjohnson The data reload is complete on `wdqs1009`, so the host can now have its firmware upgraded and be rebooted at its convenience. Note this is an internal wdqs... [04:21:24] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-reload [04:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:23:31] !log T267927 [WDQS Data Reload] `sudo -i cookbook sre.wdqs.data-reload wdqs2008.codfw.wmnet --task-id T267927 --reload-data wikidata --reason 'T267927: Reload wikidata jnl from fresh dumps' --reuse-downloaded-dump --depool` on `ryankemper@cumin2001` tmux session `wdqs_data_reload_2008` [04:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:23:39] T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927 [04:53:22] (03CR) 10Ryan Kemper: [C: 03+2] Don't drop namespace if it's already gone [puppet] - 10https://gerrit.wikimedia.org/r/666610 (https://phabricator.wikimedia.org/T269331) (owner: 10ZPapierski) [04:53:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:54:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:54:43] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] "Great explanation in the commit body, made this super simple to review" [puppet] - 10https://gerrit.wikimedia.org/r/666610 (https://phabricator.wikimedia.org/T269331) (owner: 10ZPapierski) [04:57:58] (03PS1) 10Ryan Kemper: wdqs: improve replaceNamespace log output [puppet] - 10https://gerrit.wikimedia.org/r/667054 (https://phabricator.wikimedia.org/T269331) [05:02:05] (03CR) 10Ryan Kemper: "Optional patch to explicitly mention we're deleting an old namespace. It might be that things are already clear enough and the log message" [puppet] - 10https://gerrit.wikimedia.org/r/667054 (https://phabricator.wikimedia.org/T269331) (owner: 10Ryan Kemper) [05:05:43] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Medium error reported for sda on elastic2045 - https://phabricator.wikimedia.org/T275345 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ryankemper on cumin2001.codfw.wmnet for hosts: ` elastic2045.codfw.wmnet ` The log can be found in `/v... [05:07:33] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:07:51] !log T275345 `sudo -i wmf-auto-reimage-host --conftool -p T275345 elastic2045.codfw.wmnet` on `ryankemper@cumin2001` tmux session `elastic_reimage_elastic1065` [05:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:57] T275345: Medium error reported for sda on elastic2045 - https://phabricator.wikimedia.org/T275345 [05:13:11] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:23:47] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (NOOP 9 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28267/console" [puppet] - 10https://gerrit.wikimedia.org/r/666775 (owner: 10Ebernhardson) [05:25:14] !log [relforge] Downtimed `relforge1004` until `2021-03-02 07:23:36` (https://phabricator.wikimedia.org/T275658 is in flight to fix broken `kibana.service`) [05:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:39] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] Open relforge elasticsearch tls ports to analytics network [puppet] - 10https://gerrit.wikimedia.org/r/666775 (owner: 10Ebernhardson) [05:27:43] !log ryankemper@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2045.codfw.wmnet with reason: REIMAGE [05:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:46] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2045.codfw.wmnet with reason: REIMAGE [05:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:31] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Medium error reported for sda on elastic2045 - https://phabricator.wikimedia.org/T275345 (10RKemper) Side note: Just noticed I named the tmux session `elastic1065`. Fortunately as can be seen above we're reimaging the proper host, `elastic2045` :P [05:36:58] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Medium error reported for sda on elastic2045 - https://phabricator.wikimedia.org/T275345 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic2045.codfw.wmnet'] ` and were **ALL** successful. [05:59:33] Patch deployed in wmf.32 before it reached to deployment servers is not reflecting in Production. [05:59:40] What can be reason? [06:00:01] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/ContentTranslation/+/refs/heads/wmf/1.36.0-wmf.32 is correct but Special:Version shows older one. [06:01:10] Can anyone help in syncing ContentTranslation or should I do that? [06:14:44] ^ Missing patch is: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/666327 <-- Urbanecm [06:17:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1169 to clone db1134 T275343', diff saved to https://phabricator.wikimedia.org/P14490 and previous config saved to /var/cache/conftool/dbconfig/20210226-061705-marostegui.json [06:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:14] T275343: Reimage db1134 to Buster and repool it - https://phabricator.wikimedia.org/T275343 [06:25:07] OK. I'll do sync and see. No other updates needed except ContentTranslation sync. [06:32:49] !log kartik@deploy1001 Synchronized php-1.36.0-wmf.32/extensions/ContentTranslation: Resync ContentTranslation for {{gerrit|666327}} (duration: 01m 16s) [06:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:02] OK. That didn't work. Patch is lost in between somewhere. Even though it is merged :/ [06:35:53] kart_: was the submodule bumped? [06:36:25] legoktm: yep [06:36:51] legoktm: Can you check where patch went? I can't even cherry-pick again. [06:36:51] * legoktm logs in [06:37:43] (03PS1) 10Marostegui: instances.yaml: Remove db1092 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/667056 (https://phabricator.wikimedia.org/T275019) [06:38:38] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1092 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/667056 (https://phabricator.wikimedia.org/T275019) (owner: 10Marostegui) [06:38:51] ah [06:38:56] the submodule pointer is at cd5cd3c9d2229417000bb2732093ffff9d887bc7 [06:39:11] the patch you're looking for is e6b1a7cd0d4e8e0329ee40faee256f26597def43 [06:39:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1092 from dbctl T275019', diff saved to https://phabricator.wikimedia.org/P14492 and previous config saved to /var/cache/conftool/dbconfig/20210226-063914-marostegui.json [06:39:18] so why didn't it bump automatically? [06:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:21] T275019: decommission db1092.eqiad.wmnet - https://phabricator.wikimedia.org/T275019 [06:40:25] kart_: I think the patch was cherry-picked too early? [06:40:35] anyways, we just need to manually bump the submodule [06:40:42] (03CR) 10Elukey: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667032 (https://phabricator.wikimedia.org/T275767) (owner: 10Razzi) [06:41:01] legoktm: yes. Probably that's reason. [06:41:14] legoktm: Can you help to bump the submodule? [06:41:27] yeah, give me a minute [06:44:06] * legoktm is waiting on `git pull` to finish [06:47:05] 10SRE, 10vm-requests, 10Patch-For-Review: eqiad/codfw: 2x2 VM request for ML-Serve Kubernetes cluster - https://phabricator.wikimedia.org/T275630 (10elukey) 05Openβ†’03Resolved a:03elukey VMs created :) [06:49:00] (03PS1) 10Legoktm: Bump ContentTranslation [core] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/667057 [06:49:12] (03CR) 10Legoktm: [C: 03+2] Bump ContentTranslation [core] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/667057 (owner: 10Legoktm) [06:49:22] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Bump ContentTranslation [core] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/667057 (owner: 10Legoktm) [06:50:41] for reference, this manual bump process is explained at https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Updating_the_submodule :) [06:50:56] and np :) [06:53:19] ah. :P [06:53:29] I was just lost where the patch was lost! [06:53:42] Syncing.. [06:53:57] !log kartik@deploy1001 Synchronized php-1.36.0-wmf.32/extensions/ContentTranslation: Bump ContentTranslation to e6b1a7c to include lost {{gerrit|666327}} backport (duration: 00m 58s) [06:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:52] Thanks a lot legoktm, things seems working :) [06:56:43] wooo [06:57:38] * kart_ notes not to cherry-pick too early for upcoming branch. [06:57:40] kart_: (if I did timezone conversion right) so I think the problem was that the ContentTranslation bump was merged before https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/c44158ea6983ab16fe28acefc09cf44baae829d3 landed, which meant Gerrit didn't know to update the submodule [06:58:32] Yeah. So, it was cherry-picked and merged and couldn't deploy because wmf.32 wasn't there in deployment server. [06:59:17] PROBLEM - Hadoop DataNode on an-worker1099 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [06:59:39] !log restart datanode on an-worker1099 - soft lockup kernel errors [06:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:38] !log reboot an-worker1099 to clear out kernel soft lockup errors [07:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:52] !log Stop MySQL on db2106 to clone db2147 T275633 [07:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:59] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [07:05:49] PROBLEM - Host an-worker1099 is DOWN: PING CRITICAL - Packet loss = 100% [07:06:35] RECOVERY - Host an-worker1099 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [07:08:01] RECOVERY - Hadoop DataNode on an-worker1099 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [07:14:25] 10SRE, 10ops-eqiad: eqiad: Move maps1001 same rack A4 - https://phabricator.wikimedia.org/T273983 (10elukey) [07:14:27] 10SRE, 10ops-eqiad, 10DBA: eqiad: move db1111 to rack A8 - https://phabricator.wikimedia.org/T273982 (10elukey) [07:14:55] (03CR) 10Elukey: [C: 04-1] "Trying to fix my mess in https://phabricator.wikimedia.org/T260445#6863444, let's wait to merge this, sorry :(" [puppet] - 10https://gerrit.wikimedia.org/r/667032 (https://phabricator.wikimedia.org/T275767) (owner: 10Razzi) [07:15:02] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) 05Resolvedβ†’03Open @wiki_willy I am terribly sorry to re-open this task, please be patient, but I discovered that I made an error (got fooled by... [07:30:47] PROBLEM - Check systemd state on elastic2045 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:47:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:49:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:52:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 1%: Repool db1169 after cloning db1134', diff saved to https://phabricator.wikimedia.org/P14494 and previous config saved to /var/cache/conftool/dbconfig/20210226-075219-root.json [07:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:04] (03PS1) 10Marostegui: mariadb: Productionize db2147 [puppet] - 10https://gerrit.wikimedia.org/r/667079 (https://phabricator.wikimedia.org/T275633) [07:57:50] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2147 [puppet] - 10https://gerrit.wikimedia.org/r/667079 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210226T0800) [08:01:41] (03PS1) 10Muehlenhoff: profile::zuul::server: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/667102 [08:03:45] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/667008 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [08:04:43] !log run ipmi mc reset cold for analytics1058 - mgmt responding to pings and ipmi, but not to ssh [08:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 5%: Repool db1169 after cloning db1134', diff saved to https://phabricator.wikimedia.org/P14495 and previous config saved to /var/cache/conftool/dbconfig/20210226-080722-root.json [08:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:08] (03PS1) 10Marostegui: install_server: Do not reimage db2147 [puppet] - 10https://gerrit.wikimedia.org/r/667104 (https://phabricator.wikimedia.org/T275633) [08:13:04] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2147 [puppet] - 10https://gerrit.wikimedia.org/r/667104 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [08:14:23] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 79 probes of 597 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:17:33] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1058.eqiad.wmnet with reason: REIMAGE [08:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:52] (03PS1) 10Gerrit maintenance bot: Add tay to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/667105 (https://phabricator.wikimedia.org/T275803) [08:19:35] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1058.eqiad.wmnet with reason: REIMAGE [08:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repool db1169 after cloning db1134', diff saved to https://phabricator.wikimedia.org/P14496 and previous config saved to /var/cache/conftool/dbconfig/20210226-082226-root.json [08:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:47] (03CR) 10Jbond: "Is there more info or a task on the error?" [puppet] - 10https://gerrit.wikimedia.org/r/666989 (owner: 10Dzahn) [08:25:38] (03PS2) 10Volans: code style: improve doc and link doc from tox [software/spicerack] - 10https://gerrit.wikimedia.org/r/666934 [08:26:11] (03CR) 10Volans: "addressed comment" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/666934 (owner: 10Volans) [08:28:00] (03PS4) 10Jbond: admin: add ops to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/666940 (https://phabricator.wikimedia.org/T275731) [08:28:08] (03CR) 10Jbond: [V: 03+1] "> Patch Set 3: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/666940 (https://phabricator.wikimedia.org/T275731) (owner: 10Jbond) [08:28:18] (03CR) 10David Caro: [C: 03+1] code style: improve doc and link doc from tox [software/spicerack] - 10https://gerrit.wikimedia.org/r/666934 (owner: 10Volans) [08:28:22] (03PS5) 10Jbond: admin: add contint-roots to contint-admins using a yaml reference [puppet] - 10https://gerrit.wikimedia.org/r/666941 (https://phabricator.wikimedia.org/T275731) [08:28:24] (03CR) 10Filippo Giunchedi: [C: 03+1] Expand range of Modify Kafka max replica lag slope alert [puppet] - 10https://gerrit.wikimedia.org/r/666966 (https://phabricator.wikimedia.org/T273702) (owner: 10Ottomata) [08:29:02] (03CR) 10Elukey: [C: 03+1] Expand range of Modify Kafka max replica lag slope alert [puppet] - 10https://gerrit.wikimedia.org/r/666966 (https://phabricator.wikimedia.org/T273702) (owner: 10Ottomata) [08:29:15] (03CR) 10Elukey: [C: 03+2] Expand range of Modify Kafka max replica lag slope alert [puppet] - 10https://gerrit.wikimedia.org/r/666966 (https://phabricator.wikimedia.org/T273702) (owner: 10Ottomata) [08:30:37] (03CR) 10Filippo Giunchedi: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/666677 (https://phabricator.wikimedia.org/T275658) (owner: 10Ryan Kemper) [08:30:41] (03CR) 10Filippo Giunchedi: [C: 03+1] kibana: use different settings based off version [puppet] - 10https://gerrit.wikimedia.org/r/666677 (https://phabricator.wikimedia.org/T275658) (owner: 10Ryan Kemper) [08:31:24] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 597 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:35:13] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/666899 (owner: 10Klausman) [08:37:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 15%: Repool db1169 after cloning db1134', diff saved to https://phabricator.wikimedia.org/P14497 and previous config saved to /var/cache/conftool/dbconfig/20210226-083729-root.json [08:37:32] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:17] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/666940 (https://phabricator.wikimedia.org/T275731) (owner: 10Jbond) [08:42:11] (03CR) 10Jbond: [C: 03+1] modules/sudo: Add TMUX variable to kept env vars [puppet] - 10https://gerrit.wikimedia.org/r/666899 (owner: 10Klausman) [08:42:45] (03CR) 10Jbond: [C: 03+2] admin: add ops to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/666940 (https://phabricator.wikimedia.org/T275731) (owner: 10Jbond) [08:42:51] (03CR) 10Jbond: [C: 03+2] admin: add contint-roots to contint-admins using a yaml reference [puppet] - 10https://gerrit.wikimedia.org/r/666941 (https://phabricator.wikimedia.org/T275731) (owner: 10Jbond) [08:44:50] (03CR) 10Muehlenhoff: O:idp: add netbox as an authorised servie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666957 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [08:46:52] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review: legoktm can't build CI docker images without using root because he's no longer in contint-admins - https://phabricator.wikimedia.org/T275731 (10jbond) This has now been added i have included the full list of the contin-admins... [08:47:30] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review: legoktm can't build CI docker images without using root because he's no longer in contint-admins - https://phabricator.wikimedia.org/T275731 (10jbond) 05Openβ†’03Resolved a:03jbond [08:49:38] (03PS1) 10Jbond: O:idp: fix service pattern match for netbox [puppet] - 10https://gerrit.wikimedia.org/r/667109 (https://phabricator.wikimedia.org/T244849) [08:50:11] (03CR) 10Klausman: [C: 03+2] modules/sudo: Add TMUX variable to kept env vars [puppet] - 10https://gerrit.wikimedia.org/r/666899 (owner: 10Klausman) [08:50:14] (03PS4) 10Klausman: modules/sudo: Add TMUX variable to kept env vars [puppet] - 10https://gerrit.wikimedia.org/r/666899 [08:50:25] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/667043 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [08:50:29] (03CR) 10Klausman: [V: 03+2 C: 03+2] modules/sudo: Add TMUX variable to kept env vars [puppet] - 10https://gerrit.wikimedia.org/r/666899 (owner: 10Klausman) [08:50:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28268/console" [puppet] - 10https://gerrit.wikimedia.org/r/667109 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [08:51:35] (03CR) 10Jbond: O:idp: add netbox as an authorised servie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666957 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [08:51:56] (03CR) 10Jbond: O:idp: fix service pattern match for netbox [puppet] - 10https://gerrit.wikimedia.org/r/667109 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [08:52:23] klausman: think you may need to revert that change [08:52:28] why? [08:52:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: Repool db1169 after cloning db1134', diff saved to https://phabricator.wikimedia.org/P14498 and previous config saved to /var/cache/conftool/dbconfig/20210226-085233-root.json [08:52:38] db1107 : Feb 26 08:51:16 : jynus : parse error in /etc/sudoers near line 6 ; TTY=pts/0 ; PWD=/home/jynus ; USER=root ; [08:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:42] see email [08:52:49] dangit [08:53:10] let's kill puppet? [08:53:23] doing now [08:53:36] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/667109 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [08:55:27] !log disabled puppet pending rollback of https://gerrit.wikimedia.org/r/666899 [08:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:48] puppet is disabled [08:56:19] (03PS1) 10Klausman: sudoers: fix broken env list [puppet] - 10https://gerrit.wikimedia.org/r/667110 [08:56:32] We should add a visudo -c check to this particular file [08:57:19] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/667110 (owner: 10Klausman) [08:57:44] klausman: ack will create a CR in a sec [08:57:54] will you be able to deploy? [08:58:27] Probably not. Currently waiting for jerkins [08:58:35] (03CR) 10Klausman: [C: 03+2] sudoers: fix broken env list [puppet] - 10https://gerrit.wikimedia.org/r/667110 (owner: 10Klausman) [08:58:59] Indeed I can't [08:59:15] Oh hang on. [08:59:17] why not? [08:59:44] act one sec let me try and fix [08:59:46] deploying [09:00:07] My brain blanked for a minute and I forgot I have access to the root pw [09:00:38] and the local puppet run is triggered from a cron which runs a root, so that'll also work fine [09:02:08] * jbond42 was going to run `cumin 'a:puppetmaster' 'sed -i '/env_keep/d' /etc/sudoers' 'puppet agent -t' [09:02:34] jbond42: batch! or will kill the puppetmasters [09:02:52] (03PS1) 10Filippo Giunchedi: rsyslog: give 'ops' group access to centrallog files [puppet] - 10https://gerrit.wikimedia.org/r/667112 (https://phabricator.wikimedia.org/T254605) [09:03:25] So all that is needed is re-enabling Puppet? [09:04:41] klausman: did yu fix it via root? i have not run anything yet but propose the following if not https://phabricator.wikimedia.org/P14499 [09:04:59] I fixed it as root, yes. [09:05:05] su -, root pw, puppet-merge [09:05:06] ack ill enable again then [09:06:07] klausman, was a revert/patch deployed, at what time? [09:06:13] lets test a single host first? [09:06:18] yes [09:06:35] i tested on cumin1001 and worked [09:07:07] I also checked the fix-patch with visudo -c [09:07:08] my worry is scheduled jobs depending on sudo [09:07:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repool db1169 after cloning db1134', diff saved to https://phabricator.wikimedia.org/P14500 and previous config saved to /var/cache/conftool/dbconfig/20210226-090736-root.json [09:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:46] jbond42: can't we just renable puppet + puppet run fleetwide via cumin? [09:07:50] let's not delay deploy a lot after verification [09:08:00] now that we have a fix [09:08:08] I just tested it and it worked for analytics1058 [09:08:11] The file is on cumin1001 and works fin [09:08:14] fine* [09:08:15] so lets reenable [09:08:25] if 2 people say it is fixed [09:08:27] ? [09:08:47] jynus: moritzm: the enabled had allready kicked in when you asked to hold of [09:08:53] thanks [09:09:11] sudo cumin '*' 'run-puppet-agent --failed-only ' also run (checking output) [09:09:47] jbond42: with batching right?? [09:09:51] jbond42: puppet didn't failed on the hosts AFAIK [09:09:51] Sorry about all this, guys. I should've read the man page more closely and/or run visudo -c on the original change. [09:09:52] (sorry to ask just to be sure) [09:09:59] so that will probably be uwseless [09:10:08] yes I agree [09:10:13] (the --failed-only party) [09:10:15] it needs to be fleet wide [09:10:19] and yes BATCH!!! [09:10:21] but with batch [09:10:25] -b 10 [09:10:27] or 15 [09:10:29] ahh of course (was wondering why it ran so quick) [09:10:29] not more [09:10:32] otherwise volans is right, we kill the puppet masters [09:10:49] ack will run with -b15 fleet wide and no failed-hosts [09:11:06] sudo cumin -b 15 '*' 'run-puppet-agent ' [09:11:14] jbond42: -q [09:11:19] you don't want the output probably [09:11:23] ack, sounds good [09:11:28] ack thanks anything elses before i hit enter [09:11:51] no [09:11:56] can't think of anything [09:11:56] ack going for it [09:12:17] for gmail filter to delete them you can use: [09:12:18] from:(nagios@) SECURITY after:2021-02-25 [09:12:32] jbond42, ongoing now? [09:12:45] yes its running now [09:12:58] super thanks jbond42 [09:13:09] !log pupet enabled post sudoers fix, running puppet fleet wide with cumin -b 15 '*' 'run-puppet-agent ' [09:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:18] I am documenting it as IC [09:13:20] (03CR) 10Ladsgroup: [C: 03+1] package_builder: convert cowbuilder cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/667008 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [09:13:25] thanks jynus [09:14:14] meanwhile, if people can check random errors that could be happening due to sudo disabled? [09:14:36] health of things like jobqueue/appservers, analytics hosts, etc. [09:14:41] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10Marostegui) Reminder: do not add IPV6 entries for DB hosts. [09:15:16] jynus: checking at random is a bit difficult, all the sudo usages should be in timers or crons or similar things that should fail, in theory, alerting us [09:15:33] I know it is vage suggestion [09:15:49] it is more like "be aware/vigilant" [09:15:58] let's pay attention to the cron spam from root@ and icinga :) [09:17:12] jbond42: qq - I keep seeing a flood of security emails, is it due to hosts without the fix? [09:17:34] yes I think so, just checked elastic1059 [09:17:34] yeah, that's caused by the sudoers parse error [09:17:57] "CRITICAL: 4363 mails in exim queue." [09:18:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:18:40] that's the only thing I see on icinga that is probably related [09:18:56] so it may take some time until we receive other errors [09:19:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:21:00] (03PS1) 10Jbond: sudo: add validate_cmd for sudoers file [puppet] - 10https://gerrit.wikimedia.org/r/667119 [09:21:02] (03PS1) 10Jbond: sudo: add validate_cmd for sudoers file [puppet] - 10https://gerrit.wikimedia.org/r/667120 [09:21:45] thanks jbond42 --^ [09:22:33] (03CR) 10Kormat: "Q: Do we actually know the implications of exposing $TMUX to a different user (root, in this case)?" [puppet] - 10https://gerrit.wikimedia.org/r/666899 (owner: 10Klausman) [09:22:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: Repool db1169 after cloning db1134', diff saved to https://phabricator.wikimedia.org/P14501 and previous config saved to /var/cache/conftool/dbconfig/20210226-092240-root.json [09:22:43] I certainly don't want to test the theory that running puppet at the same time across the fleet will kill the puppetmasters, though we fixed the specific problem of puppetmaster running out of memory a couple of years back [09:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:46] (03CR) 10Klausman: [C: 03+1] sudo: add validate_cmd for sudoers file [puppet] - 10https://gerrit.wikimedia.org/r/667120 (owner: 10Jbond) [09:23:21] is there a percentage update on puppet run? [09:23:26] 10% [09:23:28] Man, I made Friday morning way more exciting that I had any intention to. [09:24:01] (03CR) 10Ladsgroup: phabricator::tools: replace cron jobs with timers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [09:24:17] !log root@cumin1001 START - Cookbook sre.dns.netbox [09:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:26:50] (03CR) 10Kormat: sudo: add validate_cmd for sudoers file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667120 (owner: 10Jbond) [09:27:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:28:03] 505 unknown checks pending [09:28:08] 700+ unread emails, that brings me back to corporate days :) [09:28:46] !log root@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:28:51] (03CR) 10Kormat: sudo: add validate_cmd for sudoers file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667120 (owner: 10Jbond) [09:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:12] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem - https://phabricator.wikimedia.org/T271058 (10aborrero) a:05Jclark-ctrβ†’03MoritzMuehlenhoff I don't see errors anymore on cloudnet1004 after the kernel upgrade. I thin... [09:31:12] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Marostegui) @Papaul reminder for next iterations, please do not add ipv6 entries for DB hosts (T270101) I have already removed them from netbox Thanks! [09:32:23] (03CR) 10Jbond: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/666899 (owner: 10Klausman) [09:32:25] 20% [09:32:31] PROBLEM - exim queue on mx1001 is CRITICAL: CRITICAL: 11794 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim [09:32:52] !bash 09:28:08 700+ unread emails, that brings me back to corporate days :) [09:32:52] Majavah: Stored quip at https://bash.toolforge.org/quip/vyyv3XcBpU87LSFJDOgx [09:33:04] !log aborrero@cumin2001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2001-dev.wikimedia.org [09:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:30] so far I only seen nagios and 2 manual runs fail, nothing else (that is good) [09:35:44] (03PS2) 10Jbond: sudo: add validate_cmd for sudoers file [puppet] - 10https://gerrit.wikimedia.org/r/667119 [09:36:46] (03PS1) 10Vgutierrez: ATS: Enable parent proxies on ats-tls at upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/667121 (https://phabricator.wikimedia.org/T274888) [09:36:59] (03PS2) 10Jbond: sudo: add validate_cmd for sudoers file [puppet] - 10https://gerrit.wikimedia.org/r/667120 [09:37:26] sobanski: the great exchange massacre of 2008? :) [09:37:34] (03CR) 10Jbond: sudo: add validate_cmd for sudoers file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667120 (owner: 10Jbond) [09:37:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 65%: Repool db1169 after cloning db1134', diff saved to https://phabricator.wikimedia.org/P14502 and previous config saved to /var/cache/conftool/dbconfig/20210226-093743-root.json [09:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:51] 30% [09:38:16] (03PS3) 10Jbond: sudo: add validate_cmd for sudoers file [puppet] - 10https://gerrit.wikimedia.org/r/667120 [09:38:50] godog: no, every Monday [09:39:02] hey, jbond, I don't need every 10%, just give a heads up on, eg. 50 and completion [09:39:15] jynus: ack :) [09:39:21] aprox [09:39:25] sobanski: hehe easy to believe [09:39:58] btw I added an action item to the incident to understand how much of a batch we can use for fleetwide puppet runs [09:40:57] godog: yes i think you are right fyi i have definetly run it with larger batch sizes without issue but not confident what that size was [09:41:15] !log aborrero@cumin2001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cloudcontrol2001-dev.wikimedia.org [09:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:01] jbond42: yeah, I remember fixing passenger's pool size exactly to fix the recurrent issue where we'd accidentally kill the puppet masters, hasn't happened ever since iirc [09:42:48] (03PS1) 10Elukey: Add specific settings for Hadoop workers on Buster with GPUs [puppet] - 10https://gerrit.wikimedia.org/r/667122 (https://phabricator.wikimedia.org/T231067) [09:43:24] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcephosd2001-dev.codfw.wmnet [09:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:15] (03PS2) 10Elukey: Add specific settings for Hadoop workers on Buster with GPUs [puppet] - 10https://gerrit.wikimedia.org/r/667122 (https://phabricator.wikimedia.org/T231067) [09:44:16] godog: ack we can do some tests next week [09:44:25] PROBLEM - exim queue on mx2001 is CRITICAL: CRITICAL: 5483 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim [09:44:41] ^^^ thats gone down [09:45:00] or was it mx1001 before [09:48:47] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd2001-dev.codfw.wmnet [09:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:35] also, that was probably from a time when palladium was a single server, right? we even extended to the second backend server in the mean time [09:49:59] (03CR) 10Jbond: "Just making a note here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/180506/2/modules/sudo/files/sudoers is where HOME was first " [puppet] - 10https://gerrit.wikimedia.org/r/666899 (owner: 10Klausman) [09:50:30] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcephosd2001-dev.codfw.wmnet [09:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:36] !log aborrero@cumin2001 START - Cookbook sre.hosts.reboot-single for host cloudservices2002-dev.wikimedia.org [09:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:31] fyi its at 50% now (have updated the IR) [09:52:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repool db1169 after cloning db1134', diff saved to https://phabricator.wikimedia.org/P14503 and previous config saved to /var/cache/conftool/dbconfig/20210226-095247-root.json [09:52:47] (03CR) 10Jbond: [C: 03+2] O:idp: fix service pattern match for netbox [puppet] - 10https://gerrit.wikimedia.org/r/667109 (https://phabricator.wikimedia.org/T244849) (owner: 10Jbond) [09:52:47] !log aborrero@cumin2001 START - Cookbook sre.hosts.reboot-single for host cloudservices2003-dev.wikimedia.org [09:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={cloud_dev_pdns,cloud_dev_pdns_rec} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:54:21] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd2001-dev.codfw.wmnet [09:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:47] !log aborrero@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2002-dev.wikimedia.org [09:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:58:46] we are down to a few dozens unknowns [09:59:13] !log aborrero@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2003-dev.wikimedia.org [09:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:41] !log aborrero@cumin2001 START - Cookbook sre.hosts.reboot-single for host cloudweb2001-dev.wikimedia.org [09:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:26] (03PS2) 10Filippo Giunchedi: rsyslog: give 'ops' group access to centrallog files [puppet] - 10https://gerrit.wikimedia.org/r/667112 (https://phabricator.wikimedia.org/T254605) [10:05:55] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcephosd2002-dev.codfw.wmnet [10:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:41] !log aborrero@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb2001-dev.wikimedia.org [10:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:29] 10SRE, 10Security: Investigate potential issues with the sudoeres env_keep values - https://phabricator.wikimedia.org/T275852 (10jbond) [10:07:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 85%: Repool db1169 after cloning db1134', diff saved to https://phabricator.wikimedia.org/P14504 and previous config saved to /var/cache/conftool/dbconfig/20210226-100750-root.json [10:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:01] kormat: fyi ^^^ https://phabricator.wikimedia.org/T275852 [10:08:37] jbond42: πŸ‘. i'll add info about issues with $HOME [10:08:46] great thanks [10:09:58] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd2002-dev.codfw.wmnet [10:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:16] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcephosd2003-dev.codfw.wmnet [10:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:47] 10SRE: Mediawiki Swift PUTs from eqiad to codfw reported slow - https://phabricator.wikimedia.org/T275752 (10fgiunchedi) [10:16:37] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1096.eqiad.wmnet with reason: REIMAGE [10:16:38] (03CR) 10Alexandros Kosiaris: [C: 03+2] "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/667015 (owner: 10JMeybohm) [10:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:33] (03Merged) 10jenkins-bot: event*: Enable egress networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/667015 (owner: 10JMeybohm) [10:18:42] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1096.eqiad.wmnet with reason: REIMAGE [10:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:33] (03CR) 10Kormat: [C: 03+1] sudo: add validate_cmd for sudoers file [puppet] - 10https://gerrit.wikimedia.org/r/667120 (owner: 10Jbond) [10:22:45] jbond42: huh, that's odd. `/sbin/visudo` exists on cumin1001, but it's not owned by any debian package [10:22:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repool db1169 after cloning db1134', diff saved to https://phabricator.wikimedia.org/P14505 and previous config saved to /var/cache/conftool/dbconfig/20210226-102254-root.json [10:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:05] it's idental to /usr/sbin/visudo, which _is_ from a debian package (`sudo`) [10:23:31] ohh. doh. /sbin is symlinked to /usr/sbin. [10:23:40] nvm :) [10:23:52] kormat: its debian busterthing which ln ... yes :) [10:24:28] (03CR) 10Klausman: [C: 03+1] sudo: add validate_cmd for sudoers file [puppet] - 10https://gerrit.wikimedia.org/r/667120 (owner: 10Jbond) [10:25:18] 10SRE, 10Security: Investigate potential issues with the sudoeres env_keep values - https://phabricator.wikimedia.org/T275852 (10Kormat) An example of where passing $HOME caused issues for me: https://gerrit.wikimedia.org/r/c/operations/software/wmfmariadbpy/+/631717 Effectively, anything that assumes $HOME wi... [10:25:58] kormat: https://wiki.debian.org/UsrMerge [10:26:11] jbond42: aye. i ran into it on linux mint a few weeks back [10:26:21] ack [10:26:23] didn't realise debian had already done it [10:26:35] (and that dpkg -S didn't support it) [10:27:05] only from buster and io would have to summon moritzm for more details :) [10:27:13] no worries, thanks :) [10:28:49] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd2003-dev.codfw.wmnet [10:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:01] 10SRE, 10Security: Investigate potential issues with the sudoeres env_keep values - https://phabricator.wikimedia.org/T275852 (10Kormat) Re: passing through $TMUX, i'm not concerned about //security// issues, i'm just slightly concerned that it's going to cause.. unexpected outcomes. Like $HOME, it's non-stand... [10:30:53] (03CR) 10Ayounsi: [C: 03+1] check_puppetrun: go critical puppet is disabled for more then a week [puppet] - 10https://gerrit.wikimedia.org/r/666950 (owner: 10Jbond) [10:31:25] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'production' . [10:31:25] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'staging' . [10:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:38] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [10:31:39] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [10:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:57] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [10:31:58] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [10:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:12] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'production' . [10:32:12] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [10:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:31] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [10:32:31] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [10:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:01] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:33:01] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [10:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:16] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [10:33:17] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'production' . [10:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:31] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'production' . [10:33:31] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'staging' . [10:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:44] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [10:33:44] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [10:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:24] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcephmon2001-dev.codfw.wmnet [10:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:13] (03PS1) 10Muehlenhoff: debmonitor: Bump the uwsgi buffer size to 8192 [puppet] - 10https://gerrit.wikimedia.org/r/667132 (https://phabricator.wikimedia.org/T275599) [10:36:31] (03PS1) 10Alexandros Kosiaris: apertium: Remove old plain release values [deployment-charts] - 10https://gerrit.wikimedia.org/r/667133 [10:36:33] (03PS1) 10Alexandros Kosiaris: sessionstore: Enable egress networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/667134 [10:38:02] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1039.eqiad.wmnet [10:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:21] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephmon2001-dev.codfw.wmnet [10:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium: Remove old plain release values [deployment-charts] - 10https://gerrit.wikimedia.org/r/667133 (owner: 10Alexandros Kosiaris) [10:38:44] (03CR) 10Alexandros Kosiaris: [C: 03+2] sessionstore: Enable egress networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/667134 (owner: 10Alexandros Kosiaris) [10:38:54] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcephmon2002-dev.codfw.wmnet [10:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:05] kormat, jbond42: yeah, Buster and later does this by default in debootstrap (which is what the debian-installer uses). some details https://wiki.debian.org/UsrMerge [10:39:22] (03Merged) 10jenkins-bot: apertium: Remove old plain release values [deployment-charts] - 10https://gerrit.wikimedia.org/r/667133 (owner: 10Alexandros Kosiaris) [10:39:27] (03Merged) 10jenkins-bot: sessionstore: Enable egress networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/667134 (owner: 10Alexandros Kosiaris) [10:39:35] there was some disagreement on the specifics of the migration (which is also the reason why it's poorly represented in dpkg) [10:39:44] jynus: fyi finished [10:39:48] thanks [10:39:59] exim stills struggling [10:40:04] but eventually the Debian TechCom rules that this is the way to proceed https://wiki.debian.org/UsrMerge [10:40:05] so I won't close incident for now [10:40:19] cheers moritzm [10:41:54] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephmon2002-dev.codfw.wmnet [10:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:29] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcephmon2003-dev.codfw.wmnet [10:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:06] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [10:44:06] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [10:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:27] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1039.eqiad.wmnet [10:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:47] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [10:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:03] (03CR) 10Vgutierrez: "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1001/28269/" [puppet] - 10https://gerrit.wikimedia.org/r/667121 (https://phabricator.wikimedia.org/T274888) (owner: 10Vgutierrez) [10:49:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:50:11] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephmon2003-dev.codfw.wmnet [10:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:32] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcephmon2003-dev.codfw.wmnet [10:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:28] (03PS1) 10Alexandros Kosiaris: mobileapps: Enable egress network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/667138 [10:54:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/667132 (https://phabricator.wikimedia.org/T275599) (owner: 10Muehlenhoff) [10:54:22] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephmon2003-dev.codfw.wmnet [10:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:53] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/666677 (https://phabricator.wikimedia.org/T275658) (owner: 10Ryan Kemper) [10:55:43] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcephmon2002-dev.codfw.wmnet [10:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:26] (03PS1) 10Alexandros Kosiaris: push-notifications: Switch networkpolicy to true [deployment-charts] - 10https://gerrit.wikimedia.org/r/667140 [10:59:36] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephmon2002-dev.codfw.wmnet [10:59:37] (03CR) 10Alexandros Kosiaris: [C: 03+2] mobileapps: Enable egress network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/667138 (owner: 10Alexandros Kosiaris) [10:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:47] (03CR) 10Alexandros Kosiaris: [C: 03+2] push-notifications: Switch networkpolicy to true [deployment-charts] - 10https://gerrit.wikimedia.org/r/667140 (owner: 10Alexandros Kosiaris) [11:00:21] (03Merged) 10jenkins-bot: mobileapps: Enable egress network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/667138 (owner: 10Alexandros Kosiaris) [11:00:29] (03Merged) 10jenkins-bot: push-notifications: Switch networkpolicy to true [deployment-charts] - 10https://gerrit.wikimedia.org/r/667140 (owner: 10Alexandros Kosiaris) [11:00:42] !log dcaro@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [11:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:04] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'production' . [11:02:04] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'staging' . [11:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:25] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [11:02:25] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [11:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:51] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [11:02:51] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [11:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:57] (03CR) 10Elukey: [C: 03+2] Add specific settings for Hadoop workers on Buster with GPUs [puppet] - 10https://gerrit.wikimedia.org/r/667122 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [11:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:09] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'production' . [11:03:09] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [11:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:26] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [11:03:27] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [11:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:44] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [11:03:44] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'production' . [11:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:12] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [11:04:12] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'production' . [11:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:39] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'staging' . [11:04:39] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'production' . [11:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:02] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [11:05:02] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [11:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:13] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [11:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:53] !log dcaro@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [11:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:13] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/679/" [puppet] - 10https://gerrit.wikimedia.org/r/667132 (https://phabricator.wikimedia.org/T275599) (owner: 10Muehlenhoff) [11:10:24] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [11:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:05] !log dcaro@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [11:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:10] (03CR) 10Vgutierrez: [C: 03+2] ATS: Enable parent proxies on ats-tls at upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/667121 (https://phabricator.wikimedia.org/T274888) (owner: 10Vgutierrez) [11:15:23] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [11:15:23] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [11:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:59] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [11:15:59] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [11:16:06] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10OlyKalinichenkoSpeedAndFunction) Please add new key ` ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQCrWAH/rao/XRJ6VjjsFnr141AwNsatGcSMQ5WvR9LhNUwNFnUnAXr... [11:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:34] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [11:16:35] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [11:16:35] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [11:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:06] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [11:17:06] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [11:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:28] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' . [11:17:28] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'canary' . [11:17:31] !log dcaro@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [11:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:47] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [11:17:47] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [11:17:47] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [11:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:07] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [11:18:07] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'production' . [11:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:28] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [11:18:29] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [11:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:50] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [11:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:17] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [11:19:17] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'canary' . [11:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:34] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1096.eqiad.wmnet with reason: REIMAGE [11:19:42] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [11:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:20:06] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'sessionstore' for release 'production' . [11:20:06] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'sessionstore' for release 'staging' . [11:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:25] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [11:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:43] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'production' . [11:20:43] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [11:20:43] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' . [11:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:09] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [11:21:09] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [11:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:34] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'production' . [11:21:34] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' . [11:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:49] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [11:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:47] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1096.eqiad.wmnet with reason: REIMAGE [11:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:01] !log dcaro@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [11:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:12] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1005.wikimedia.org [11:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:07] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [11:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:02] !log dcaro@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [11:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:02] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [11:37:02] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [11:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:01] !log rolling restart of ats-tls on cp500[1-5] [11:38:02] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [11:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:51] !log dcaro@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [11:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:16] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol1005.wikimedia.org [11:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:29] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1004.wikimedia.org [11:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:17] 10SRE, 10Mail, 10observability: Add exim queue size to grafana graph - https://phabricator.wikimedia.org/T275867 (10jbond) p:05Triageβ†’03Medium [11:45:38] (03PS1) 10Alexandros Kosiaris: kafka-jumbo: Add EQIAD_PRIVATE_PRIVATE1_KUBESTAGEPODS_CODFW [puppet] - 10https://gerrit.wikimedia.org/r/667143 [11:47:04] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [11:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:26] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [11:47:26] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [11:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:50] (03PS2) 10Alexandros Kosiaris: kafka-jumbo: Add EQIAD_PRIVATE_PRIVATE1_KUBESTAGEPODS_CODFW [puppet] - 10https://gerrit.wikimedia.org/r/667143 [11:53:06] (03CR) 10Alexandros Kosiaris: [C: 03+2] kafka-jumbo: Add EQIAD_PRIVATE_PRIVATE1_KUBESTAGEPODS_CODFW [puppet] - 10https://gerrit.wikimedia.org/r/667143 (owner: 10Alexandros Kosiaris) [11:54:42] !log delete exim messages in the queue ro root@wikimedia.org older then 7200 seconds and younger the 10800 seconds on mx2001 [11:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:19] !log delete exim messages in the queue ro root@wikimedia.org older then 7200 seconds and younger the 10800 seconds on mx1001 [11:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:44] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [11:58:44] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [11:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:25] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [11:59:25] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [11:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:14] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol1004.wikimedia.org [12:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:47] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1003.wikimedia.org [12:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:30] (03CR) 10Hnowlan: [C: 03+2] osm: add missing production step to import script [puppet] - 10https://gerrit.wikimedia.org/r/666596 (owner: 10Hnowlan) [12:02:45] (03PS1) 10Arturo Borrero Gonzalez: toolforge: initial support for Debian Buster on bastions [puppet] - 10https://gerrit.wikimedia.org/r/667144 (https://phabricator.wikimedia.org/T275865) [12:04:34] (03CR) 10jerkins-bot: [V: 04-1] toolforge: initial support for Debian Buster on bastions [puppet] - 10https://gerrit.wikimedia.org/r/667144 (https://phabricator.wikimedia.org/T275865) (owner: 10Arturo Borrero Gonzalez) [12:05:25] (03PS2) 10Arturo Borrero Gonzalez: toolforge: initial support for Debian Buster on bastions [puppet] - 10https://gerrit.wikimedia.org/r/667144 (https://phabricator.wikimedia.org/T275865) [12:05:52] (03CR) 10Kosta Harlan: [C: 03+1] [beta] GrowthExperiments: disable remote API use on some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667010 (https://phabricator.wikimedia.org/T274198) (owner: 10GergΕ‘ Tisza) [12:07:28] !log dcaro@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [12:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:18] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [12:10:18] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [12:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:43] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [12:10:43] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [12:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:51] (03PS1) 10Marostegui: instances.yaml: Add db2147 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/667145 (https://phabricator.wikimedia.org/T275633) [12:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:16] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol1003.wikimedia.org [12:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:13] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2147 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/667145 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [12:12:45] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [12:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:38] RECOVERY - exim queue on mx2001 is OK: OK: Less than 2000 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim [12:14:06] (03PS1) 10Alexandros Kosiaris: eventgate-analytics-external: Add port 443 to schema rule [deployment-charts] - 10https://gerrit.wikimedia.org/r/667146 [12:14:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add new vslow,dump host to codfw s4 - T275633', diff saved to https://phabricator.wikimedia.org/P14508 and previous config saved to /var/cache/conftool/dbconfig/20210226-121438-marostegui.json [12:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:46] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [12:16:16] (03PS1) 10Marostegui: db2147: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/667147 (https://phabricator.wikimedia.org/T275633) [12:17:05] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventgate-analytics-external: Add port 443 to schema rule [deployment-charts] - 10https://gerrit.wikimedia.org/r/667146 (owner: 10Alexandros Kosiaris) [12:17:29] (03CR) 10Marostegui: [C: 03+2] db2147: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/667147 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [12:17:59] (03Merged) 10jenkins-bot: eventgate-analytics-external: Add port 443 to schema rule [deployment-charts] - 10https://gerrit.wikimedia.org/r/667146 (owner: 10Alexandros Kosiaris) [12:18:37] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [12:18:37] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [12:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:02] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [12:19:02] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [12:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:09] (03CR) 10Urbanecm: [C: 03+2] "no-op for prod, Kosta's request" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667010 (https://phabricator.wikimedia.org/T274198) (owner: 10GergΕ‘ Tisza) [12:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:20:40] (03Merged) 10jenkins-bot: [beta] GrowthExperiments: disable remote API use on some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667010 (https://phabricator.wikimedia.org/T274198) (owner: 10GergΕ‘ Tisza) [12:22:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:22:28] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [12:22:28] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [12:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:33] (03PS1) 10Elukey: amd_rocm: deploy the hcc package only with version < 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/667151 [12:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:08] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [12:23:08] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [12:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:52] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28270/console" [puppet] - 10https://gerrit.wikimedia.org/r/667151 (owner: 10Elukey) [12:25:36] (03CR) 10Elukey: [V: 03+1 C: 03+2] amd_rocm: deploy the hcc package only with version < 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/667151 (owner: 10Elukey) [12:27:43] (03PS3) 10Arturo Borrero Gonzalez: toolforge: initial support for Debian Buster on bastions [puppet] - 10https://gerrit.wikimedia.org/r/667144 (https://phabricator.wikimedia.org/T275865) [12:28:33] (03PS1) 10Hnowlan: osm:imposm: Make imposm updater proxy-aware [puppet] - 10https://gerrit.wikimedia.org/r/667153 (https://phabricator.wikimedia.org/T238753) [12:29:11] (03PS1) 10Elukey: amd_rocm: remove hcc from base packages [puppet] - 10https://gerrit.wikimedia.org/r/667154 [12:29:59] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28271/console" [puppet] - 10https://gerrit.wikimedia.org/r/667153 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan) [12:30:54] (03CR) 10Elukey: [C: 03+2] amd_rocm: remove hcc from base packages [puppet] - 10https://gerrit.wikimedia.org/r/667154 (owner: 10Elukey) [12:32:17] (03PS2) 10Hnowlan: prometheus::postgres_exporter: disk metrics and custom queries [puppet] - 10https://gerrit.wikimedia.org/r/666888 (https://phabricator.wikimedia.org/T248858) [12:32:32] (03CR) 10Hnowlan: prometheus::postgres_exporter: disk metrics and custom queries (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/666888 (https://phabricator.wikimedia.org/T248858) (owner: 10Hnowlan) [12:34:32] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28272/console" [puppet] - 10https://gerrit.wikimedia.org/r/667153 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan) [12:36:16] (03PS1) 10Alexandros Kosiaris: eventgate-analytics{,-external}: refresh kafka-jumbo list [deployment-charts] - 10https://gerrit.wikimedia.org/r/667155 [12:37:26] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventgate-analytics{,-external}: refresh kafka-jumbo list [deployment-charts] - 10https://gerrit.wikimedia.org/r/667155 (owner: 10Alexandros Kosiaris) [12:38:12] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28273/console" [puppet] - 10https://gerrit.wikimedia.org/r/666888 (https://phabricator.wikimedia.org/T248858) (owner: 10Hnowlan) [12:38:27] (03Merged) 10jenkins-bot: eventgate-analytics{,-external}: refresh kafka-jumbo list [deployment-charts] - 10https://gerrit.wikimedia.org/r/667155 (owner: 10Alexandros Kosiaris) [12:40:18] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [12:40:19] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [12:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:43] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [12:40:43] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [12:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:40] RECOVERY - exim queue on mx1001 is OK: OK: Less than 2000 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim [12:47:48] 10SRE, 10Mail, 10observability: Add exim queue size to grafana graph - https://phabricator.wikimedia.org/T275867 (10jbond) [12:49:07] (03CR) 10MSantos: [C: 03+1] osm:imposm: Make imposm updater proxy-aware [puppet] - 10https://gerrit.wikimedia.org/r/667153 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan) [12:56:58] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] osm:imposm: Make imposm updater proxy-aware [puppet] - 10https://gerrit.wikimedia.org/r/667153 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan) [12:59:06] !log upgrade memcached on mc1031, mc2031 [12:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:08] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1031.eqiad.wmnet [13:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:32] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1031.eqiad.wmnet [13:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:22] (03CR) 10Jbond: [C: 03+2] check_puppetrun: go critical puppet is disabled for more then a week [puppet] - 10https://gerrit.wikimedia.org/r/666950 (owner: 10Jbond) [13:13:13] (03PS1) 10Muehlenhoff: Reduce TTL for irc CNAME to 5 minutes [dns] - 10https://gerrit.wikimedia.org/r/667161 [13:16:20] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/667112 (https://phabricator.wikimedia.org/T254605) (owner: 10Filippo Giunchedi) [13:17:51] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/667132 (https://phabricator.wikimedia.org/T275599) (owner: 10Muehlenhoff) [13:19:07] (03CR) 10Jbond: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/667161 (owner: 10Muehlenhoff) [13:20:15] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [13:20:52] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10jbond) [13:22:13] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10jbond) @OlyKalinichenkoSpeedAndFunction are you also able to confirm L3 status as per: >>! In T275677#6858091, @brennen wrote: > @OlyKalinichenko... [13:22:15] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [13:25:35] 10SRE: This task tracks the preparation of our base system services for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 (10MoritzMuehlenhoff) [13:26:02] 10SRE: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 (10MoritzMuehlenhoff) p:05Triageβ†’03Medium [13:28:21] 10SRE, 10Security: Investigate potential issues with the sudoeres env_keep values - https://phabricator.wikimedia.org/T275852 (10jbond) in relation to home I think it makes sense to remove it and for users who want that behaviour by default could simply add `alias sudo='sudo -H'` (would obvioulsy need to be co... [13:28:24] 10SRE, 10Security: Investigate potential issues with the sudoeres env_keep values - https://phabricator.wikimedia.org/T275852 (10jbond) p:05Triageβ†’03Medium [13:30:08] (03PS1) 10Muehlenhoff: Add bullseye-wikimedia to apt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/667162 (https://phabricator.wikimedia.org/T275873) [13:32:44] (03CR) 10Kormat: Remove labsdb1012 from puppet in preparation for rename (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663865 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [13:38:51] !log dcaro@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [13:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:17] 10SRE, 10Traffic: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809 (10jbond) p:05Triageβ†’03Medium [13:44:08] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [13:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:46] !log dcaro@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [13:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:01] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [13:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:41] !log dcaro@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [13:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:59] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Services, 10Service-deployment-requests: [DRAFT] New Service Request tegola - https://phabricator.wikimedia.org/T274390 (10MSantos) [13:56:56] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [13:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:53] (03PS1) 10Jgiannelos: WIP: Deply tegola on kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/667165 (https://phabricator.wikimedia.org/T275874) [14:04:35] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 103758536 and 26 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:05:15] (03CR) 10Jgiannelos: [C: 04-1] "Blocking this for now, since its WIP." [deployment-charts] - 10https://gerrit.wikimedia.org/r/667165 (https://phabricator.wikimedia.org/T275874) (owner: 10Jgiannelos) [14:06:51] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 848 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:17:37] !log dcaro@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [14:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:20:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:21:05] 10SRE: Mediawiki Swift PUTs from eqiad to codfw reported slow - https://phabricator.wikimedia.org/T275752 (10fgiunchedi) Status update: I've been going through the `object-server` `PUT` logs on the codfw backend to investigate for performance differences related to e.g. filesystem aging. So far nothing stands ou... [14:22:52] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [14:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:25] !log dcaro@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [14:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:59] (03PS1) 10David Caro: WIP step_by_step: Added cli option to ask confirmation before each command [software/spicerack] - 10https://gerrit.wikimedia.org/r/667170 [14:30:40] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [14:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:57] !log dcaro@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [14:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:51] (03CR) 10jerkins-bot: [V: 04-1] WIP step_by_step: Added cli option to ask confirmation before each command [software/spicerack] - 10https://gerrit.wikimedia.org/r/667170 (owner: 10David Caro) [14:35:51] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 38985368 and 24 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:37:10] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [14:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:05] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 329056 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:38:19] !log dcaro@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [14:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:30] (03PS2) 10David Caro: etcdctl: Fix commands sent to control node [software/spicerack] - 10https://gerrit.wikimedia.org/r/666961 [14:41:32] (03PS1) 10David Caro: remote: fix typing confusion [software/spicerack] - 10https://gerrit.wikimedia.org/r/667172 [14:42:42] 10SRE, 10Security: Investigate potential issues with the sudoeres env_keep values - https://phabricator.wikimedia.org/T275852 (10klausman) `TMUX` being visible is, as mentioned, not a security issue when sudo'ing to non-root. The var contains just a path, with its own permissions, and racing attacks with symli... [14:43:35] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [14:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:30] !log dcaro@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [14:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:45] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [14:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:52] (03CR) 10David Caro: "For some reason it depends on the previous patch, as it seems to trigger the typing issue with the decorated property." [software/spicerack] - 10https://gerrit.wikimedia.org/r/666961 (owner: 10David Caro) [14:50:13] (03CR) 10Alexandros Kosiaris: [C: 04-1] "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/666982 (owner: 10JMeybohm) [14:50:55] (03CR) 10Alexandros Kosiaris: [C: 04-1] "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/666982 (owner: 10JMeybohm) [14:51:51] !log dcaro@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [14:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:05] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [14:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:54] (03PS4) 10Ayounsi: Capirca POC [homer/public] - 10https://gerrit.wikimedia.org/r/663535 (https://phabricator.wikimedia.org/T273865) [14:59:15] !log dcaro@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [14:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:32] (03CR) 10Alexandros Kosiaris: [C: 04-1] "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/666982 (owner: 10JMeybohm) [15:04:29] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [15:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:57] (03PS3) 10Hnowlan: prometheus::postgres_exporter: disk metrics and custom queries [puppet] - 10https://gerrit.wikimedia.org/r/666888 (https://phabricator.wikimedia.org/T248858) [15:11:37] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28274/console" [puppet] - 10https://gerrit.wikimedia.org/r/666888 (https://phabricator.wikimedia.org/T248858) (owner: 10Hnowlan) [15:29:21] (03PS1) 10Elukey: install_server: switch to partman's reuse-parts.cfg for hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/667180 (https://phabricator.wikimedia.org/T231067) [15:30:29] (03CR) 10Elukey: [C: 03+2] install_server: switch to partman's reuse-parts.cfg for hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/667180 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [15:36:59] (03CR) 10SBassett: [C: 03+1] "> You dont need to review the puppet code, just the idea that we pull from one source with --delete so there can only be one version of /s" [puppet] - 10https://gerrit.wikimedia.org/r/667031 (owner: 10Dzahn) [15:41:11] (03CR) 10Filippo Giunchedi: [C: 03+1] "Untested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/666888 (https://phabricator.wikimedia.org/T248858) (owner: 10Hnowlan) [15:49:10] 10SRE, 10Traffic: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809 (10BBlack) Following up a bit on other paths through this problem: > From there we can probably tune the nuke_limit to where LRU nuke failures are rare enough that we're ok with the tradeoff... [15:49:38] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/667162 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [15:51:15] (03PS1) 10David Caro: wmcs.toolforge.etcd: Added cookbook to depool and remove a node [cookbooks] - 10https://gerrit.wikimedia.org/r/667183 [15:51:58] (03PS2) 10David Caro: wmcs.toolforge.etcd: Added cookbook to depool and remove a node [cookbooks] - 10https://gerrit.wikimedia.org/r/667183 (https://phabricator.wikimedia.org/T274497) [15:53:18] (03Abandoned) 10David Caro: DONOTMERGE wmcs: Move to class-based cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/658631 (owner: 10David Caro) [15:53:28] (03Abandoned) 10David Caro: wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658637 (owner: 10David Caro) [15:55:05] (03PS3) 10David Caro: wmcs.toolforge.etcd: Added cookbook to depool and remove a node [cookbooks] - 10https://gerrit.wikimedia.org/r/667183 (https://phabricator.wikimedia.org/T274497) [15:58:10] 10SRE: Mediawiki Swift PUTs from eqiad to codfw reported slow - https://phabricator.wikimedia.org/T275752 (10fgiunchedi) Inspecting the client side, all requests that take > 70s are from eqiad jobrunners, it is unclear to me why yet: ` $ zcat ms-fe*/swift.log-20210224.gz | awk '/proxy-server:.*PUT/ && $21 > 80... [15:58:16] (03CR) 10jerkins-bot: [V: 04-1] wmcs.toolforge.etcd: Added cookbook to depool and remove a node [cookbooks] - 10https://gerrit.wikimedia.org/r/667183 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [15:59:10] (03CR) 10David Caro: "Fyi. tests are failing due to missing a new release of spicerack." [cookbooks] - 10https://gerrit.wikimedia.org/r/667183 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [16:02:18] (03PS2) 10Jgiannelos: WIP: Deploy tegola on kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/667165 (https://phabricator.wikimedia.org/T275874) [16:05:22] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul) @Marostegui understood. We will have to mentioned that on all the next racking task now as a side note so i do not forget. Thanks. [16:09:35] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/667112 (https://phabricator.wikimedia.org/T254605) (owner: 10Filippo Giunchedi) [16:13:08] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) a:05RobHβ†’03wiki_willy I would recommend opening a new task rather than reopening a resolved racking task and adding to the 'racking' timeline for... [16:19:35] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Marostegui) Thank you! [16:26:19] RECOVERY - Host elastic1033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [16:26:27] 10SRE, 10ops-eqiad, 10Discovery, 10Discovery-Search (Current work): elastic1033's mgmt is unreachable - https://phabricator.wikimedia.org/T275733 (10Cmjohnson) 05Openβ†’03Resolved a:03Cmjohnson fixed, the cable needed to be replaced. [16:27:41] 10SRE, 10ops-eqiad, 10Analytics: an-worker1111 PS Redundancy alert - https://phabricator.wikimedia.org/T275732 (10Cmjohnson) 05Openβ†’03Resolved a:03Cmjohnson Fixed, loose cable [16:46:17] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 219000704 and 13 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:48:25] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 394784 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:51:41] RECOVERY - IPMI Sensor Status on an-worker1111 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:52:35] RECOVERY - Device not healthy -SMART- on an-worker1097 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1097&var-datasource=eqiad+prometheus/ops [16:57:41] 10SRE, 10SRE-Access-Requests: wikidata.org delegated Full Google Search Console access for abaso@wikimedia.org - https://phabricator.wikimedia.org/T275240 (10jbond) @dr0ptp4kt AFAIK what you are asking for is only possible if we where to share the the super user password with you. This would give you admin ac... [17:02:12] 10SRE: Mediawiki Swift PUTs from eqiad to codfw reported slow - https://phabricator.wikimedia.org/T275752 (10fgiunchedi) Looking back a few days, e.g. Feb 4-5th, the list of hosts that take > 80s is still eqiad jobrunners, and suspiciously all have been running buster by that date: ` $ zcat ms-fe2*/swift.log-20... [17:04:28] 10SRE, 10ops-eqiad, 10Analytics: Degraded RAID on an-worker1097 - https://phabricator.wikimedia.org/T274819 (10Cmjohnson) 05Openβ†’03Resolved Replaced the disk [17:07:20] 10SRE, 10ops-eqiad, 10DBA: db1162 crashed - https://phabricator.wikimedia.org/T275309 (10Cmjohnson) update: this was scheduled for today but when I sent the tech the access ticket I was told it's been re-assigned and someone should've contacted me. That did not happen. I need to figure it out and this will... [17:27:05] (03PS1) 10Papaul: DHCP: Add MAC address for mwmaint2002 [puppet] - 10https://gerrit.wikimedia.org/r/667212 (https://phabricator.wikimedia.org/T274170) [17:27:44] (03CR) 10jerkins-bot: [V: 04-1] DHCP: Add MAC address for mwmaint2002 [puppet] - 10https://gerrit.wikimedia.org/r/667212 (https://phabricator.wikimedia.org/T274170) (owner: 10Papaul) [17:28:06] 10SRE, 10GitLab, 10vm-requests, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10debt) [17:28:14] 10SRE, 10GitLab, 10SRE-Access-Requests, 10User-brennen: Access group for Gitlab contractors - https://phabricator.wikimedia.org/T274953 (10debt) [17:30:03] (03PS2) 10Papaul: DHCP: Add MAC address for mwmaint2002 [puppet] - 10https://gerrit.wikimedia.org/r/667212 (https://phabricator.wikimedia.org/T274170) [17:31:30] (03CR) 10Papaul: [C: 03+2] DHCP: Add MAC address for mwmaint2002 [puppet] - 10https://gerrit.wikimedia.org/r/667212 (https://phabricator.wikimedia.org/T274170) (owner: 10Papaul) [17:37:30] (03PS1) 10Papaul: Add mwmaint2002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/667213 (https://phabricator.wikimedia.org/T274170) [17:38:46] (03CR) 10Papaul: [C: 03+2] Add mwmaint2002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/667213 (https://phabricator.wikimedia.org/T274170) (owner: 10Papaul) [17:40:13] (03PS1) 10David Caro: wmcs.vps: add cookbook to create an instance of a prefix [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) [17:40:52] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install mwmaint2002 - https://phabricator.wikimedia.org/T274170 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mwmaint2002.codfw.wmnet ` The log can be found in `/var/log... [17:43:17] (03CR) 10jerkins-bot: [V: 04-1] wmcs.vps: add cookbook to create an instance of a prefix [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [17:49:33] (03PS1) 10Jbond: idp: add netbox-next domain to authorised list for netbox [puppet] - 10https://gerrit.wikimedia.org/r/667217 [17:51:14] (03PS2) 10Jbond: idp: add netbox-next domain to authorised list for netbox [puppet] - 10https://gerrit.wikimedia.org/r/667217 [17:52:14] (03PS3) 10Jbond: idp: add netbox-next domain to authorised list for netbox [puppet] - 10https://gerrit.wikimedia.org/r/667217 [17:53:04] (03CR) 10CRusnov: [C: 03+1] "lgtm thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/667217 (owner: 10Jbond) [17:53:16] (03CR) 10Jbond: [C: 03+2] idp: add netbox-next domain to authorised list for netbox [puppet] - 10https://gerrit.wikimedia.org/r/667217 (owner: 10Jbond) [17:55:11] (03PS1) 10Sbisson: Remove unused config for InukaPageView [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667219 (https://phabricator.wikimedia.org/T265921) [17:57:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:57:28] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mwmaint2002.codfw.wmnet with reason: REIMAGE [17:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:59:23] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mwmaint2002.codfw.wmnet with reason: REIMAGE [17:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:31] (03CR) 10Ottomata: "Ah so this is what will enable us to manage the policy rules in the helmfiles?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/667015 (owner: 10JMeybohm) [18:01:30] (03CR) 10Dduvall: "This should be ready for review now. The latest image published from this change is docker-registry.wikimedia.org/wikimedia/mediawiki-mult" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666492 (https://phabricator.wikimedia.org/T274182) (owner: 10Dduvall) [18:07:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install mwmaint2002 - https://phabricator.wikimedia.org/T274170 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mwmaint2002.codfw.wmnet'] ` and were **ALL** successful. [18:10:19] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install mwmaint2002 - https://phabricator.wikimedia.org/T274170 (10Papaul) [18:11:53] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install mwmaint2002 - https://phabricator.wikimedia.org/T274170 (10Papaul) 05Openβ†’03Resolved This is complete . [18:22:00] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10wiki_willy) No worries @elukey, it looks like I missed the double count in rack A4 as well. If these hosts need to stay in row A though, the only other 10... [18:35:53] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: legoktm can't build CI docker images without using root because he's no longer in contint-admins - https://phabricator.wikimedia.org/T275731 (10Legoktm) Thanks! [18:37:55] 10SRE, 10SRE-Access-Requests: wikidata.org delegated Full Google Search Console access for abaso@wikimedia.org - https://phabricator.wikimedia.org/T275240 (10dr0ptp4kt) Thanks @jbond. IIRC the Webmaster Central URL at https://www.google.com/webmasters/verification/details?hl=en&domain=wikidata.org ought to ma... [18:44:55] 10SRE, 10ops-eqiad, 10DBA: db1162 crashed - https://phabricator.wikimedia.org/T275309 (10Marostegui) Thanks for the update. Much appreciated! [18:59:29] (03CR) 10Dzahn: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/667031 (owner: 10Dzahn) [19:02:50] (03CR) 10Dzahn: "The situation before this is that you can run the "scap-sync-master" command on any deployment server and manually specify a source and it" [puppet] - 10https://gerrit.wikimedia.org/r/667031 (owner: 10Dzahn) [19:08:09] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:08:40] 10SRE, 10serviceops: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) T274170 introduced new hardware mwmaint2002 and can be used now. timing :p [19:10:04] 10SRE, 10serviceops: move mwmaint2002 into production - https://phabricator.wikimedia.org/T275905 (10Dzahn) [19:12:13] 10SRE, 10serviceops: move mwmaint2002 into production - https://phabricator.wikimedia.org/T275905 (10Dzahn) [19:19:08] 10SRE, 10serviceops: move mwmaint2002 into production, replace mwmaint2001 - https://phabricator.wikimedia.org/T275905 (10Dzahn) [19:32:02] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Sergey Trofimovsky from Speed & Function - https://phabricator.wikimedia.org/T275722 (10KFrancis) @jbond Hello, would you please confirm if Sergey Trofimovsky us an employee or contractor for Speed & Function? Would you please also... [19:33:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:33:12] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Eugene Chernov from Speed & Function - https://phabricator.wikimedia.org/T275679 (10KFrancis) @jbond Hello, would you please confirm if Eugene Chernov us an employee or contractor for Speed & Function? Would you please also let me... [19:33:53] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10KFrancis) @jbond @jbond Hello, would you please confirm if Oly Kalinichenko us an employee or contractor for Speed & Function? Would you please a... [19:35:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:36:36] (03PS1) 10Dzahn: tcpircbot: add mwmaint2002 to allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/667238 (https://phabricator.wikimedia.org/T275905) [19:39:57] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:42:13] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:45:34] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Sergey Trofimovsky from Speed & Function - https://phabricator.wikimedia.org/T275722 (10jbond) @KFrancis they are not staff AFAIK the are contractors for Speed & Function. At a high level gitlab1001 / gitlab1002 are servers which w... [19:45:59] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Eugene Chernov from Speed & Function - https://phabricator.wikimedia.org/T275679 (10jbond) @KFrancis they are not staff AFAIK the are contractors for Speed & Function. At a high level gitlab1001 / gitlab1002 are servers which will... [19:46:17] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10jbond) @KFrancis they are not staff AFAIK the are contractors for Speed & Function. At a high level gitlab1001 / gitlab1002 are servers which wil... [19:51:01] (03PS1) 10Dzahn: mariadb: add mwmaint2002 to production-m5 SQL grants [puppet] - 10https://gerrit.wikimedia.org/r/667240 (https://phabricator.wikimedia.org/T274170) [19:53:53] (03PS1) 10Dzahn: mediawiki::maintenance: sync home dir from mwmaint2001 to mwmaint2002 [puppet] - 10https://gerrit.wikimedia.org/r/667241 (https://phabricator.wikimedia.org/T275905) [19:54:05] (03PS1) 10Ahmon Dancy: Extend wmfSwiftConfig placeholder keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667242 [19:54:34] (03PS1) 10Ahmon Dancy: env.php: Allow the datacenter/realm to be specified in MW_REALM environment variable. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667243 [19:54:55] (03PS1) 10Ahmon Dancy: wmf-config/CommonSettings.php: Add MW_NO_ETCD handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667244 (https://phabricator.wikimedia.org/T238436) [19:55:48] (03CR) 10jerkins-bot: [V: 04-1] env.php: Allow the datacenter/realm to be specified in MW_REALM environment variable. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667243 (owner: 10Ahmon Dancy) [19:56:27] (03CR) 10jerkins-bot: [V: 04-1] wmf-config/CommonSettings.php: Add MW_NO_ETCD handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667244 (https://phabricator.wikimedia.org/T238436) (owner: 10Ahmon Dancy) [19:57:30] (03PS2) 10Ahmon Dancy: env.php: Allow the datacenter/realm to be specified in MW_REALM environment variable. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667243 [19:58:45] (03PS1) 10Dzahn: add mwmaint2002 to maintenance hosts list for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/667245 (https://phabricator.wikimedia.org/T275905) [19:58:53] (03CR) 1020after4: [C: 03+1] deployment::rsync:: also sync patches directory [puppet] - 10https://gerrit.wikimedia.org/r/667031 (owner: 10Dzahn) [20:00:55] (03PS2) 10Ahmon Dancy: wmf-config/CommonSettings.php: Add MW_NO_ETCD handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667244 (https://phabricator.wikimedia.org/T238436) [20:02:35] (03CR) 10jerkins-bot: [V: 04-1] wmf-config/CommonSettings.php: Add MW_NO_ETCD handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667244 (https://phabricator.wikimedia.org/T238436) (owner: 10Ahmon Dancy) [20:09:13] (03PS1) 10Dzahn: scap: add mwmaint2002 to dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/667267 (https://phabricator.wikimedia.org/T275905) [20:12:56] (03PS3) 10Ahmon Dancy: wmf-config/CommonSettings.php: Add MW_NO_ETCD handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667244 (https://phabricator.wikimedia.org/T238436) [20:19:43] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:29:03] (03CR) 10Cwhite: [C: 03+1] rsyslog: give 'ops' group access to centrallog files [puppet] - 10https://gerrit.wikimedia.org/r/667112 (https://phabricator.wikimedia.org/T254605) (owner: 10Filippo Giunchedi) [20:29:48] (03CR) 10Cwhite: [C: 03+1] prometheus::postgres_exporter: disk metrics and custom queries [puppet] - 10https://gerrit.wikimedia.org/r/666888 (https://phabricator.wikimedia.org/T248858) (owner: 10Hnowlan) [20:29:48] !log deploy2001 - /srv/mediawiki-staging sudo find . -name *.cdb delete - deleted 190 GB of old cdb files (T275826 T265963) [20:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:56] T275826: L10n cache files building up on backup deploy hosts - https://phabricator.wikimedia.org/T275826 [20:29:57] T265963: Replace production deployment servers and update them to Buster - https://phabricator.wikimedia.org/T265963 [20:33:05] (03PS2) 10Dzahn: deployment::rsync:: also sync patches directory [puppet] - 10https://gerrit.wikimedia.org/r/667031 [20:34:26] (03CR) 10Dzahn: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/667031 (owner: 10Dzahn) [20:35:20] (03CR) 10Dzahn: [C: 03+2] deployment::rsync:: also sync patches directory [puppet] - 10https://gerrit.wikimedia.org/r/667031 (owner: 10Dzahn) [20:39:48] (03CR) 10Dzahn: "confirmed working on all 3 non-active servers. they all pull it from deploy1001 and the new timer works" [puppet] - 10https://gerrit.wikimedia.org/r/667031 (owner: 10Dzahn) [21:05:35] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:07:57] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:43:50] (03PS1) 10Ottomata: Include profile::analytics::jupyterhub on an-test-client1001 [puppet] - 10https://gerrit.wikimedia.org/r/667276 (https://phabricator.wikimedia.org/T262847) [21:44:18] (03CR) 10jerkins-bot: [V: 04-1] Include profile::analytics::jupyterhub on an-test-client1001 [puppet] - 10https://gerrit.wikimedia.org/r/667276 (https://phabricator.wikimedia.org/T262847) (owner: 10Ottomata) [21:46:21] (03CR) 1020after4: [C: 03+1] role::deployment_server: re-order includes, add comments, clean up [puppet] - 10https://gerrit.wikimedia.org/r/667018 (owner: 10Dzahn) [21:46:40] (03PS2) 10Ottomata: Include profile::analytics::jupyterhub on an-test-client1001 [puppet] - 10https://gerrit.wikimedia.org/r/667276 (https://phabricator.wikimedia.org/T262847) [21:46:59] (03CR) 10Dzahn: [C: 03+2] role::deployment_server: re-order includes, add comments, clean up [puppet] - 10https://gerrit.wikimedia.org/r/667018 (owner: 10Dzahn) [21:49:48] (03CR) 10Dzahn: "noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/667018 (owner: 10Dzahn) [21:53:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [21:57:11] (03PS1) 10Dzahn: deployment_server: add the php restart commands here as well [puppet] - 10https://gerrit.wikimedia.org/r/667277 (https://phabricator.wikimedia.org/T265963) [21:58:22] jinxer-wm: would be nice if it contained the hostname, after clicking I see it's a management switch, but I think that's "normal" [21:58:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [22:00:50] the email to noc@ has the hostname and details [22:01:06] (03PS2) 10Dzahn: deployment_server: add the php restart commands here as well [puppet] - 10https://gerrit.wikimedia.org/r/667277 (https://phabricator.wikimedia.org/T265963) [22:01:11] oh, we are getting icinga emails again now [22:01:47] still wondering how it actually stopped back then [22:03:32] (03CR) 10Dzahn: [C: 04-1] "sigh.. Function lookup() did not find a value for the name 'has_lvs'" [puppet] - 10https://gerrit.wikimedia.org/r/667277 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [22:12:14] (03CR) 10Dzahn: [C: 03+1] Reduce TTL for irc CNAME to 5 minutes [dns] - 10https://gerrit.wikimedia.org/r/667161 (owner: 10Muehlenhoff) [22:13:19] (03PS1) 10Dzahn: deployment: allow syncing home dirs to other dpeloyment servers [puppet] - 10https://gerrit.wikimedia.org/r/667278 (https://phabricator.wikimedia.org/T265963) [22:14:46] (03CR) 10Dzahn: "When I ran puppet I saw it remove the bind password, it became blank. Then additionally I saw it remove an LVS IP. I reverted.. that chang" [puppet] - 10https://gerrit.wikimedia.org/r/666989 (owner: 10Dzahn) [22:19:33] 10SRE, 10serviceops, 10Patch-For-Review: move mwmaint2002 into production, replace mwmaint2001 - https://phabricator.wikimedia.org/T275905 (10Dzahn) [22:19:52] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install mwmaint2002 - https://phabricator.wikimedia.org/T274170 (10Dzahn) [22:21:36] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/28279/" [puppet] - 10https://gerrit.wikimedia.org/r/667278 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [22:22:50] (03PS1) 10Dzahn: DHCP: remove mwmaint2001.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/667281 (https://phabricator.wikimedia.org/T275928) [22:25:43] (03PS1) 10Dzahn: remove mwmaint2001 from maintenance hosts and scap groups [puppet] - 10https://gerrit.wikimedia.org/r/667282 [22:26:10] (03CR) 10jerkins-bot: [V: 04-1] remove mwmaint2001 from maintenance hosts and scap groups [puppet] - 10https://gerrit.wikimedia.org/r/667282 (owner: 10Dzahn) [22:44:56] (03PS2) 10Dzahn: remove mwmaint2001 from maintenance hosts and scap groups [puppet] - 10https://gerrit.wikimedia.org/r/667282 [22:47:46] (03PS1) 10Dzahn: mariadb: remove mwmaint2001 from production-m5 grants [puppet] - 10https://gerrit.wikimedia.org/r/667288 (https://phabricator.wikimedia.org/T275928) [22:55:16] fdans: noc@ goes to all of SRE [22:57:16] (03PS1) 10Dzahn: site: add mwmaint2002.codfw.wmnet to maintenance server role [puppet] - 10https://gerrit.wikimedia.org/r/667292 (https://phabricator.wikimedia.org/T275905) [22:57:18] (03PS1) 10Dzahn: site: remove mwmaint2001.codfw.mwnet [puppet] - 10https://gerrit.wikimedia.org/r/667293 (https://phabricator.wikimedia.org/T275928) [22:57:59] (03CR) 10jerkins-bot: [V: 04-1] site: add mwmaint2002.codfw.wmnet to maintenance server role [puppet] - 10https://gerrit.wikimedia.org/r/667292 (https://phabricator.wikimedia.org/T275905) (owner: 10Dzahn) [22:58:06] (03CR) 10jerkins-bot: [V: 04-1] site: remove mwmaint2001.codfw.mwnet [puppet] - 10https://gerrit.wikimedia.org/r/667293 (https://phabricator.wikimedia.org/T275928) (owner: 10Dzahn) [22:59:16] legoktm yes, I can see that [22:59:19] thank you [22:59:23] that's embarrassing [23:00:54] no worries, it's not really documented anywhere [23:01:08] I added it to https://wikitech.wikimedia.org/w/index.php?title=Noc.wikimedia.org&type=revision&diff=1900758&oldid=1847258 for now [23:05:17] legoktm: i appreciate that [23:06:10] fdans: dont worry, we get soo much mail on root@ that this is merely a drop in the ocean [23:06:25] but yea, it is delivered [23:06:52] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup200[12] - https://phabricator.wikimedia.org/T274202 (10Papaul) [23:06:54] it took Thunderbird 20 minutes to download the 8000 emails in my inbox from this morning [23:09:47] o_O 8,000 emails [23:11:07] that was an abberation :p normally it's like 10-20 [23:11:53] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10Papaul) [23:15:18] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10Papaul) [23:15:44] now you made me find out how many emails I actually have [23:15:52] 277,416 read [23:16:18] last time i tried IMAP it would not really work anymore [23:16:28] im sitting at a niec 37,978 unread emails xD [23:16:32] nice* [23:17:18] Zppix: lol, i just don't like email and everything that comes with it :) [23:17:32] I'm to lazy to read mine [23:17:53] I like postal mail, but all I get are bank statements and utility bills in my mailbox [23:18:09] mark all as read. done. [23:18:16] thats how I work :P [23:18:30] Oh, they are; when the letter arrives they charged me already [23:18:58] looking at spam older first time since a long time.. omg.. many "interesting" scam attempts [23:19:16] like pretending it's "cpanel@wikipedia", nice try [23:19:32] "clean folder" [23:19:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:24:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:32:26] (03CR) 10Legoktm: [C: 03+1] Extend wmfSwiftConfig placeholder keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667242 (owner: 10Ahmon Dancy) [23:39:40] (03CR) 10Legoktm: [C: 04-1] "As for naming, either MW_NO_ETCD or MW_INITIALIZING_L10N seems fine...the first is describing what you want, and the second is what you're" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667244 (https://phabricator.wikimedia.org/T238436) (owner: 10Ahmon Dancy) [23:41:50] (03CR) 10Legoktm: "What are you planning to set MW_REALM to? IIRC we had discussed creating a "dev" or "local" realm if it isn't labs or prod." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667243 (owner: 10Ahmon Dancy)