[00:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Evening backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210304T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:00:54] PROBLEM - Aggregate IPsec Tunnel Status codfw on alert1001 is CRITICAL: instance=mc2027 site=codfw tunnel=mc1027_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [00:02:17] (03PS1) 10Bstorm: shared-storage: enable project NFS for wikipathways [puppet] - 10https://gerrit.wikimedia.org/r/668234 (https://phabricator.wikimedia.org/T276141) [00:02:20] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@e47f735]: search_satisfaction_daily: make files readable by druid ingestion (duration: 25m 35s) [00:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:22] PROBLEM - Ensure local MW versions match expected deployment on deploy1001 is CRITICAL: CRITICAL: 523 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [00:41:44] PROBLEM - mediawiki-installation DSH group on deploy1001 is CRITICAL: Host deploy1001 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [01:00:04] twentyafterfour: My dear minions, it's time we take the moon! Just kidding. Time for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210304T0100). [01:01:02] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 79 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:21:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:22:08] !log restarting php7.3-fpm on phab1001 to complete phabricator upgrade [01:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:23:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:23:45] (03CR) 10Wolfgang Kandek: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/668186 (https://phabricator.wikimedia.org/T275722) (owner: 10Jbond) [01:24:09] !log phabricator upgrade complete [01:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:48] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 60 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:03:02] (03PS1) 10Ladsgroup: Use the new mediawiki logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668241 (https://phabricator.wikimedia.org/T268230) [02:04:46] (03CR) 10Ladsgroup: [C: 04-2] "Trademark has not been filed yet. Do not merge unless it's cleared by legal." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668241 (https://phabricator.wikimedia.org/T268230) (owner: 10Ladsgroup) [03:11:05] (03PS2) 10Ladsgroup: Use the new mediawiki logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668241 (https://phabricator.wikimedia.org/T268230) [03:25:24] RECOVERY - Logstash Elasticsearch indexing errors #o11y on alert1001 is OK: (C)8 ge (W)1 ge 0.9958 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [03:47:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:49:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:34:42] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (backup1002), Fresh: 100 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:05:30] * kart_ updating apertium [05:05:39] (03CR) 10KartikMistry: [C: 03+2] Update apertium to 2021-03-03-170806-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/668134 (https://phabricator.wikimedia.org/T274262) (owner: 10KartikMistry) [05:06:19] (03Merged) 10jenkins-bot: Update apertium to 2021-03-03-170806-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/668134 (https://phabricator.wikimedia.org/T274262) (owner: 10KartikMistry) [05:10:16] !log kartik@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'staging' . [05:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:41] !log kartik@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'apertium' for release 'production' . [05:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:25] !log kartik@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'production' . [05:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:38] !log Updated apertium to 2021-03-03-170806-production (T274262) [05:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:45] T274262: Rebuild all blubber build docker images running on kubernetes - https://phabricator.wikimedia.org/T274262 [05:57:54] (03CR) 10Marostegui: [C: 03+1] "let me know when I can start transferring data" [puppet] - 10https://gerrit.wikimedia.org/r/663865 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [06:11:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2116 T275633', diff saved to https://phabricator.wikimedia.org/P14621 and previous config saved to /var/cache/conftool/dbconfig/20210304-061134-marostegui.json [06:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:43] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [06:11:47] !log Stop MySQL on db2116 to clone db2145 T275633 [06:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:51] (03PS1) 10Marostegui: mariadb: Upgrade package to 10.4.18 [software] - 10https://gerrit.wikimedia.org/r/668252 [06:17:51] (03CR) 10Marostegui: [C: 03+2] mariadb: Upgrade package to 10.4.18 [software] - 10https://gerrit.wikimedia.org/r/668252 (owner: 10Marostegui) [06:18:23] (03Merged) 10jenkins-bot: mariadb: Upgrade package to 10.4.18 [software] - 10https://gerrit.wikimedia.org/r/668252 (owner: 10Marostegui) [06:25:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1088 T276025', diff saved to https://phabricator.wikimedia.org/P14622 and previous config saved to /var/cache/conftool/dbconfig/20210304-062503-marostegui.json [06:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:10] T276025: decommission db1088.eqiad.wmnet - https://phabricator.wikimedia.org/T276025 [06:26:07] (03PS1) 10Marostegui: db1088: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/668253 (https://phabricator.wikimedia.org/T276025) [06:27:25] (03CR) 10Marostegui: [C: 03+2] db1088: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/668253 (https://phabricator.wikimedia.org/T276025) (owner: 10Marostegui) [06:47:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:49:48] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:52:56] (03PS1) 10Ryan Kemper: wdqs: new service alias query-preview [dns] - 10https://gerrit.wikimedia.org/r/668255 (https://phabricator.wikimedia.org/T266470) [06:58:53] 10ops-eqiad: mc1027.eqiad.wmnet is down, not powering back up - https://phabricator.wikimedia.org/T276415 (10Legoktm) [07:00:28] ACKNOWLEDGEMENT - SSH on mc1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Legoktm https://phabricator.wikimedia.org/T276415 https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:00:28] ACKNOWLEDGEMENT - Memcached on mc1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Legoktm https://phabricator.wikimedia.org/T276415 https://wikitech.wikimedia.org/wiki/Memcached [07:00:28] ACKNOWLEDGEMENT - Host mc1027 is DOWN: PING CRITICAL - Packet loss = 100% Legoktm https://phabricator.wikimedia.org/T276415 [07:01:03] (03CR) 10Ryan Kemper: "Hey Brandon! Quick refresher: you, gehel and I discussed on IRC a month ago about how to expose a new `query-preview.wikidata.org` which w" [dns] - 10https://gerrit.wikimedia.org/r/668255 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper) [07:04:02] (03PS3) 10Ryan Kemper: wdqs: expose wdqs1009 externally [puppet] - 10https://gerrit.wikimedia.org/r/668173 (https://phabricator.wikimedia.org/T266470) [07:07:25] (03PS2) 10Elukey: role::analytics_cluster::hadoop::worker: set linux 5.10 on GPU workers [puppet] - 10https://gerrit.wikimedia.org/r/668106 (https://phabricator.wikimedia.org/T231067) [07:09:12] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28362/console" [puppet] - 10https://gerrit.wikimedia.org/r/668106 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [07:11:40] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::analytics_cluster::hadoop::worker: set linux 5.10 on GPU workers [puppet] - 10https://gerrit.wikimedia.org/r/668106 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [07:11:50] (03CR) 10Ryan Kemper: "Hey Brandon! Quick refresher: you, gehel and I discussed on IRC a month ago about how to expose a new `query-preview.wikidata.org` which w" [puppet] - 10https://gerrit.wikimedia.org/r/668173 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper) [07:26:29] (03PS1) 10Giuseppe Lavagetto: redis: create new shard on mc1035 to sub in for mc1027 [puppet] - 10https://gerrit.wikimedia.org/r/668258 (https://phabricator.wikimedia.org/T272319) [07:29:04] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28363/console" [puppet] - 10https://gerrit.wikimedia.org/r/668258 (https://phabricator.wikimedia.org/T272319) (owner: 10Giuseppe Lavagetto) [07:35:28] (03CR) 10Elukey: [C: 03+1] "port:ip look good" [puppet] - 10https://gerrit.wikimedia.org/r/668258 (https://phabricator.wikimedia.org/T272319) (owner: 10Giuseppe Lavagetto) [07:38:02] !log reboot an-worker1096 to pick up 5.10 kernel [07:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:24] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:43:12] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] redis: create new shard on mc1035 to sub in for mc1027 [puppet] - 10https://gerrit.wikimedia.org/r/668258 (https://phabricator.wikimedia.org/T272319) (owner: 10Giuseppe Lavagetto) [07:47:14] RECOVERY - Aggregate IPsec Tunnel Status codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [07:48:46] 10SRE: migrate services from bast1002 to bast1003 - https://phabricator.wikimedia.org/T276399 (10MoritzMuehlenhoff) [07:48:53] 10SRE: migrate services from bast1002 to bast1003 - https://phabricator.wikimedia.org/T276399 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff I'll take care of this once the new server has arrived. [07:51:09] (03PS1) 10DCausse: [wdqs] fix puppet on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/668336 [07:52:24] (03CR) 10jerkins-bot: [V: 04-1] [wdqs] fix puppet on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/668336 (owner: 10DCausse) [07:52:32] (03CR) 10Muehlenhoff: [C: 03+2] striker: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668009 (owner: 10Muehlenhoff) [07:54:02] (03CR) 10Kosta Harlan: [C: 04-2] linkrecommendation: Use Envoy for requests to MediaWiki API (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667868 (https://phabricator.wikimedia.org/T276217) (owner: 10Kosta Harlan) [07:54:35] (03PS2) 10DCausse: [wdqs] fix puppet on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/668336 [08:00:31] (03PS3) 10Gehel: [wdqs] fix puppet on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/668336 (owner: 10DCausse) [08:01:50] (03CR) 10Gehel: [C: 03+2] [wdqs] fix puppet on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/668336 (owner: 10DCausse) [08:04:32] (03PS1) 10Elukey: role::analytics_cluster::hadoop::worker: add gpu-users [puppet] - 10https://gerrit.wikimedia.org/r/668337 (https://phabricator.wikimedia.org/T231067) [08:09:18] (03CR) 10Muehlenhoff: [C: 03+2] service::node: Remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/666930 (owner: 10Muehlenhoff) [08:10:32] 10SRE: Package udplog for Buster - https://phabricator.wikimedia.org/T276421 (10Majavah) [08:12:37] 10SRE: Package udplog for Buster - https://phabricator.wikimedia.org/T276421 (10Legoktm) [08:12:58] 10SRE: Package udplog for Buster - https://phabricator.wikimedia.org/T276421 (10Legoktm) [08:13:00] 10SRE, 10observability, 10Patch-For-Review: Migrate mwlog/udp2log servers to Buster - https://phabricator.wikimedia.org/T224565 (10Legoktm) [08:13:54] (03PS2) 10Muehlenhoff: cassandra: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668027 [08:14:31] (03CR) 10jerkins-bot: [V: 04-1] cassandra: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668027 (owner: 10Muehlenhoff) [08:19:57] (03CR) 10JMeybohm: linkrecommendation: Use Envoy for requests to MediaWiki API (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667868 (https://phabricator.wikimedia.org/T276217) (owner: 10Kosta Harlan) [08:20:47] (03PS3) 10Muehlenhoff: cassandra: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668027 [08:23:17] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/668337 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [08:23:52] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::hadoop::worker: add gpu-users [puppet] - 10https://gerrit.wikimedia.org/r/668337 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [08:24:05] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Remove conflicting gadget configuration for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668108 (https://phabricator.wikimedia.org/T276330) (owner: 10WMDE-Fisch) [08:27:00] (03CR) 10Elukey: [C: 03+1] "LGTM, let's run pcc on some nodes to be extra sure?" [puppet] - 10https://gerrit.wikimedia.org/r/668027 (owner: 10Muehlenhoff) [08:30:13] 10SRE: Package udplog for Buster - https://phabricator.wikimedia.org/T276421 (10Legoktm) The Debian packaging needs a refresh, and it'll need a few code tweaks to compile on Buster: ` diff --git a/srcmisc/packet-loss.cpp b/srcmisc/packet-loss.cpp index 01e13a2..861cfdc 100644 --- a/srcmisc/packet-loss.cpp +++ b... [08:33:59] !log elukey@deploy1002 Started deploy [analytics/refinery@605f8b8]: Fix for geoeditors monthly job [08:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:41] (03CR) 10Filippo Giunchedi: [C: 03+1] "Untested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/668189 (https://phabricator.wikimedia.org/T273919) (owner: 10Cwhite) [08:37:37] 10SRE: Package udplog for Buster - https://phabricator.wikimedia.org/T276421 (10Legoktm) Also the init script should be rewritten as a systemd service. [08:37:58] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/28365/" [puppet] - 10https://gerrit.wikimedia.org/r/668027 (owner: 10Muehlenhoff) [08:41:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:43:06] (03CR) 10Muehlenhoff: [C: 03+2] service::monitoring: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668010 (owner: 10Muehlenhoff) [08:45:03] !log elukey@deploy1002 Finished deploy [analytics/refinery@605f8b8]: Fix for geoeditors monthly job (duration: 11m 03s) [08:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:51:12] 10SRE, 10observability, 10serviceops, 10Parsoid (Tracking), 10User-jijiki: Create per cluster error rate alerts on Mediawiki servers - https://phabricator.wikimedia.org/T262078 (10jijiki) 05Open→03Resolved a:03jijiki [09:02:30] (03CR) 10David Caro: [C: 03+2] toolforge.etcdctl: Added removal of a member (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/666919 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [09:06:23] (03PS1) 10Giuseppe Lavagetto: docker::baseimages: fix systemd timer name [puppet] - 10https://gerrit.wikimedia.org/r/668341 [09:07:30] (03CR) 10jerkins-bot: [V: 04-1] toolforge.etcdctl: Added removal of a member [software/spicerack] - 10https://gerrit.wikimedia.org/r/666919 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [09:09:54] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28367/console" [puppet] - 10https://gerrit.wikimedia.org/r/668341 (owner: 10Giuseppe Lavagetto) [09:13:50] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28368/console" [puppet] - 10https://gerrit.wikimedia.org/r/668341 (owner: 10Giuseppe Lavagetto) [09:15:29] (03CR) 10Elukey: [C: 03+1] docker::baseimages: fix systemd timer name [puppet] - 10https://gerrit.wikimedia.org/r/668341 (owner: 10Giuseppe Lavagetto) [09:18:26] 10SRE: Package udplog for Buster - https://phabricator.wikimedia.org/T276421 (10MoritzMuehlenhoff) Ah, I hadn't see Kunal's update on the task until now, but my code changes to fix with current Boost are practically the same (sans the that is still uses an init script), I've just uploaded 1.8.5+deb10u1 to buster... [09:18:56] <_joe_> elukey: worst-case scenario we get a bunch of puppet failures [09:19:08] so, a normal day [09:19:22] !log uploaded udplog 1.8.5+deb10u1 to buster-wikimedia [09:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:35] (03PS1) 10Volans: DO NOT MERGE - debugging git clone in CI [dns] - 10https://gerrit.wikimedia.org/r/668345 [09:20:00] (03PS1) 10Jcrespo: install_Server: Apply custom/backup-format.cfg to backup[12]003 [puppet] - 10https://gerrit.wikimedia.org/r/668346 (https://phabricator.wikimedia.org/T274185) [09:20:12] (03PS2) 10Jcrespo: install_Server: Apply custom/backup-format.cfg to backup[12]003 [puppet] - 10https://gerrit.wikimedia.org/r/668346 (https://phabricator.wikimedia.org/T274185) [09:20:21] <_joe_> kormat: now that you're around, certainly [09:20:26] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] docker::baseimages: fix systemd timer name [puppet] - 10https://gerrit.wikimedia.org/r/668341 (owner: 10Giuseppe Lavagetto) [09:20:31] :D [09:22:08] (03CR) 10Jcrespo: [C: 03+2] install_Server: Apply custom/backup-format.cfg to backup[12]003 [puppet] - 10https://gerrit.wikimedia.org/r/668346 (https://phabricator.wikimedia.org/T274185) (owner: 10Jcrespo) [09:24:02] (03PS3) 10David Caro: toolforge.etcdctl: Added removal of a member [software/spicerack] - 10https://gerrit.wikimedia.org/r/666919 (https://phabricator.wikimedia.org/T274497) [09:25:05] 10SRE: Package udplog for Buster - https://phabricator.wikimedia.org/T276421 (10JMeybohm) p:05Triage→03Medium [09:27:43] (03CR) 10Volans: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/668345 (owner: 10Volans) [09:27:50] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts: ` backup2003.codfw.wmnet ` The log can be found in `/var/log/wm... [09:30:01] !log disabling puppet on all db hosts while deploying a puppet monitoring change T275497 [09:30:06] (03CR) 10jerkins-bot: [V: 04-1] toolforge.etcdctl: Added removal of a member [software/spicerack] - 10https://gerrit.wikimedia.org/r/666919 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [09:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:08] T275497: mariadb: Replication lag monitoring does not support circular replication - https://phabricator.wikimedia.org/T275497 [09:30:20] (03CR) 10Kormat: [C: 03+2] mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [09:30:29] (03CR) 10Volans: [C: 03+1] "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/666919 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [09:32:18] !log install linux 5.10 on an-worker[1097-1101] (GPU workers) and reboot them [09:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:49] 10SRE: Package udplog for Buster - https://phabricator.wikimedia.org/T276421 (10hashar) [09:40:32] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['backup2003.codfw.wmnet'] ` Of which those **FAILED**: ` ['backup2003.codfw.wmnet'] ` [09:43:23] PROBLEM - Host an-worker1098 is DOWN: PING CRITICAL - Packet loss = 100% [09:44:00] elukey: ^^^ [09:46:07] RECOVERY - Host an-worker1098 is UP: PING WARNING - Packet loss = 71%, RTA = 0.24 ms [09:46:40] volans: yep I am rebooting them, downtime expired (see above) [09:47:34] sorry missed that one [09:48:13] volans: nono thanks for the ping <3 [09:48:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:49:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:53:34] (03PS1) 10Jcrespo: dhcp: Removing extra space after MAC address to discard boot issues [puppet] - 10https://gerrit.wikimedia.org/r/668352 (https://phabricator.wikimedia.org/T274185) [09:54:24] (03CR) 10Jcrespo: [C: 03+2] dhcp: Removing extra space after MAC address to discard boot issues [puppet] - 10https://gerrit.wikimedia.org/r/668352 (https://phabricator.wikimedia.org/T274185) (owner: 10Jcrespo) [09:59:54] 10SRE, 10ops-eqiad: mc1027.eqiad.wmnet is down, not powering back up - https://phabricator.wikimedia.org/T276415 (10jijiki) p:05Triage→03Unbreak! [10:00:32] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts: ` backup2003.codfw.wmnet ` The log can be found in `/var/log/wm... [10:02:09] 10SRE, 10ops-eqiad: mc1027.eqiad.wmnet is down, not powering back up - https://phabricator.wikimedia.org/T276415 (10jijiki) @Jclark-ctr or @Cmjohnson please take a look if it is possible to power up this machine (we can coordinate on irc too), If the server is resting in peace, we can close this task since re... [10:04:51] (03PS2) 10David Caro: remote: fix typing confusion [software/spicerack] - 10https://gerrit.wikimedia.org/r/667172 [10:07:04] (03PS1) 10Jbond: admin: and end date for speed and function contractors [puppet] - 10https://gerrit.wikimedia.org/r/668354 (https://phabricator.wikimedia.org/T275679) [10:07:23] (03CR) 10Muehlenhoff: [C: 03+2] cassandra: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668027 (owner: 10Muehlenhoff) [10:08:50] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/668354 (https://phabricator.wikimedia.org/T275679) (owner: 10Jbond) [10:09:53] (03CR) 10Jbond: [C: 03+2] admin: and end date for speed and function contractors [puppet] - 10https://gerrit.wikimedia.org/r/668354 (https://phabricator.wikimedia.org/T275679) (owner: 10Jbond) [10:10:58] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10jcrespo) a:05jcrespo→03Papaul Hey, please help me, The server doesn't boot with PXE: ` Booting from BRCM MBA Slot 0400 v21.6.0 Broadcom UNDI PXE-2.1 v21.6.0 C... [10:12:03] (03PS1) 10JMeybohm: Add mwilliams to analytics-privatedata-users with no ssh access [puppet] - 10https://gerrit.wikimedia.org/r/668355 (https://phabricator.wikimedia.org/T275671) [10:12:18] 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): Add Matthew Williams to analytics-privatedata-users - https://phabricator.wikimedia.org/T275671 (10JMeybohm) [10:12:30] (03CR) 10jerkins-bot: [V: 04-1] Add mwilliams to analytics-privatedata-users with no ssh access [puppet] - 10https://gerrit.wikimedia.org/r/668355 (https://phabricator.wikimedia.org/T275671) (owner: 10JMeybohm) [10:13:08] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['backup2003.codfw.wmnet'] ` Of which those **FAILED**: ` ['backup2003.codfw.wmnet'] ` [10:17:33] PROBLEM - Host an-worker1101 is DOWN: PING CRITICAL - Packet loss = 100% [10:18:13] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Sergey Trofimovsky from Speed & Function - https://phabricator.wikimedia.org/T275722 (10JMeybohm) 05Open→03Resolved a:03JMeybohm [10:20:19] RECOVERY - Host an-worker1101 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [10:21:44] (03PS2) 10Volans: DO NOT MERGE - debugging git clone in CI [dns] - 10https://gerrit.wikimedia.org/r/668345 [10:21:52] (03CR) 10David Caro: [C: 03+2] remote: fix typing confusion [software/spicerack] - 10https://gerrit.wikimedia.org/r/667172 (owner: 10David Caro) [10:22:10] (03CR) 10jerkins-bot: [V: 04-1] DO NOT MERGE - debugging git clone in CI [dns] - 10https://gerrit.wikimedia.org/r/668345 (owner: 10Volans) [10:23:16] (03PS3) 10Volans: DO NOT MERGE - debugging git clone in CI [dns] - 10https://gerrit.wikimedia.org/r/668345 [10:26:00] (03CR) 10Jbond: doc: script to restart php-fpm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666309 (https://phabricator.wikimedia.org/T275468) (owner: 10Hashar) [10:28:29] (03Merged) 10jenkins-bot: toolforge.etcdctl: Added removal of a member [software/spicerack] - 10https://gerrit.wikimedia.org/r/666919 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [10:28:31] (03Merged) 10jenkins-bot: remote: fix typing confusion [software/spicerack] - 10https://gerrit.wikimedia.org/r/667172 (owner: 10David Caro) [10:28:52] (03PS2) 10JMeybohm: Add mwilliams to analytics-privatedata-users with no ssh access [puppet] - 10https://gerrit.wikimedia.org/r/668355 (https://phabricator.wikimedia.org/T275671) [10:31:26] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/668355 (https://phabricator.wikimedia.org/T275671) (owner: 10JMeybohm) [10:32:24] !log uploaded screen 4.2.1-3+deb8u1+wmf1 to jessie-wikimedia [10:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:22] (03CR) 10David Caro: [C: 03+2] etcdctl: Fix commands sent to control node [software/spicerack] - 10https://gerrit.wikimedia.org/r/666961 (owner: 10David Caro) [10:34:31] (03CR) 10jerkins-bot: [V: 04-1] etcdctl: Fix commands sent to control node [software/spicerack] - 10https://gerrit.wikimedia.org/r/666961 (owner: 10David Caro) [10:34:45] (03CR) 10Jbond: [C: 03+1] "this will need authorisations from analytics" [puppet] - 10https://gerrit.wikimedia.org/r/668355 (https://phabricator.wikimedia.org/T275671) (owner: 10JMeybohm) [10:38:05] 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): Add Matthew Williams to analytics-privatedata-users - https://phabricator.wikimedia.org/T275671 (10JMeybohm) Add @Ottomata for analytics approval [10:39:50] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/668184 (https://phabricator.wikimedia.org/T201491) (owner: 10Sahilgrewalhere) [10:40:21] !log drain + reimage analytics1059/1060 to Debian Buster [10:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:30] (03PS3) 10David Caro: etcdctl: Fix commands sent to control node [software/spicerack] - 10https://gerrit.wikimedia.org/r/666961 [10:57:09] (03PS1) 10JMeybohm: admin: Add aex shell account and to gitlab-roots [puppet] - 10https://gerrit.wikimedia.org/r/668360 (https://phabricator.wikimedia.org/T275677) [10:57:55] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10JMeybohm) [10:58:55] (03CR) 10David Caro: [C: 03+2] "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/666961 (owner: 10David Caro) [10:59:12] (03CR) 10David Caro: etcdctl: Fix commands sent to control node [software/spicerack] - 10https://gerrit.wikimedia.org/r/666961 (owner: 10David Caro) [10:59:16] (03CR) 10David Caro: [C: 03+2] etcdctl: Fix commands sent to control node [software/spicerack] - 10https://gerrit.wikimedia.org/r/666961 (owner: 10David Caro) [10:59:17] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10JMeybohm) phab account @OlyKalinichenkoSpeedAndFunction has signed L3 as of signature list, so I checked the box [11:00:04] mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210304T1100). [11:02:04] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1060.eqiad.wmnet with reason: REIMAGE [11:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:12] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1022.eqiad.wmnet [11:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:15] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1060.eqiad.wmnet with reason: REIMAGE [11:04:16] (03Merged) 10jenkins-bot: etcdctl: Fix commands sent to control node [software/spicerack] - 10https://gerrit.wikimedia.org/r/666961 (owner: 10David Caro) [11:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:49] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1022.eqiad.wmnet [11:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:01] (03PS1) 10Marostegui: mariadb: Productionize db2145 [puppet] - 10https://gerrit.wikimedia.org/r/668362 (https://phabricator.wikimedia.org/T275633) [11:09:13] PROBLEM - MariaDB read only x2 #page on db2142 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.4.15-MariaDB-log, Uptime 851917s, event_scheduler: True, 16.58 QPS, connection latency: 0.002486s, query latency: 0.000540s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:09:16] 10SRE, 10SRE-Access-Requests: Requesting access to stat boxes for mlitn - https://phabricator.wikimedia.org/T274749 (10JMeybohm) [11:09:35] crap, that's me. [11:09:40] \o/ [11:09:59] kormat: I will ack [11:10:04] * volans here [11:10:20] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Needs fixing after T274472 [11:10:22] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Needs fixing after T274472 [11:10:24] harmless or is anythng needed? [11:10:27] marostegui: thanks [11:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:28] T274472: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 [11:10:29] nothing needed [11:10:29] moritzm: harmless [11:10:32] thanks guys [11:10:33] ack [11:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:36] except for damage to my blood pressure [11:10:37] kormat: this is what happens when you say that you'll do rollout like me [11:10:46] elukey: 🥀 [11:10:51] :D [11:11:02] 10SRE, 10SRE-Access-Requests: Requesting access to stat boxes for mlitn - https://phabricator.wikimedia.org/T274749 (10JMeybohm) User is listed in namely, so corp LDAP is wrong. Waiting on sign off from @MarkTraceur [11:11:07] marostegui: i have downtime'd all of x2 for now [11:11:08] (in the sense that I usually do worse than that :D) [11:11:18] kormat: thanks, let me know if you need any help [11:11:21] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2145 [puppet] - 10https://gerrit.wikimedia.org/r/668362 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [11:11:22] <3 [11:11:37] <_joe_> I got the page now [11:11:39] <_joe_> nice [11:11:58] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1059.eqiad.wmnet with reason: REIMAGE [11:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:31] RECOVERY - MariaDB read only x2 #page on db2142 is OK: Version 10.4.15-MariaDB-log, Uptime 852177s, read_only: False, event_scheduler: True, 16.60 QPS, connection latency: 0.002481s, query latency: 0.000460s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:14:05] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1059.eqiad.wmnet with reason: REIMAGE [11:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:35] (03PS2) 10Majavah: betacluster: Switch udp2log to deployment-mwlog01 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668338 (https://phabricator.wikimedia.org/T276419) [11:18:22] (03CR) 10Jakob: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/668364 (https://phabricator.wikimedia.org/T268640) (owner: 10Jakob) [11:20:09] (03PS2) 10Kormat: mariadb: Add section parameters: core::multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/668031 [11:23:09] (03PS1) 10Marostegui: instances.yaml: Add db2145 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/668367 (https://phabricator.wikimedia.org/T275633) [11:24:40] 10SRE, 10Patch-For-Review: wmf-utils has an outdated script to update known hosts files - https://phabricator.wikimedia.org/T275806 (10JMeybohm) 05Open→03Resolved a:03JMeybohm Let's take this as marketing opportunity for https://wikitech.wikimedia.org/wiki/Wmf-sre-laptop If something alike happens to wmf... [11:24:58] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2145 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/668367 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [11:27:13] <_joe_> !log restarted redis on mc2027 to pick up the replication change [11:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:53] RECOVERY - Check health of redis instance on 6379 on mc2027 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 44210 keys, up 1 minutes 18 seconds - replication_delay is 0 https://wikitech.wikimedia.org/wiki/Redis [11:28:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2145 into dbctl depooled - T275633', diff saved to https://phabricator.wikimedia.org/P14624 and previous config saved to /var/cache/conftool/dbconfig/20210304-112848-marostegui.json [11:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:55] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [11:29:37] (03CR) 10Noa wmde: [C: 03+1] Update termbox to 2021-03-01-112916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/668364 (https://phabricator.wikimedia.org/T268640) (owner: 10Jakob) [11:29:49] (03CR) 10Jakob: [C: 03+2] Update termbox to 2021-03-01-112916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/668364 (https://phabricator.wikimedia.org/T268640) (owner: 10Jakob) [11:30:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:31:07] (03Merged) 10jenkins-bot: Update termbox to 2021-03-01-112916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/668364 (https://phabricator.wikimedia.org/T268640) (owner: 10Jakob) [11:31:10] (03CR) 10Jbond: [C: 03+2] "Looks good to me, will merge thanks for the contribution" [puppet] - 10https://gerrit.wikimedia.org/r/668184 (https://phabricator.wikimedia.org/T201491) (owner: 10Sahilgrewalhere) [11:31:24] (03PS1) 10Muehlenhoff: profile::docker::registry: Update monitoring check to check Buster [puppet] - 10https://gerrit.wikimedia.org/r/668371 [11:31:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:31:54] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Daimona - https://phabricator.wikimedia.org/T276351 (10JMeybohm) p:05Triage→03Medium [11:35:56] PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (5186) = 92.7% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:40:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2145 to s1 (and repool db2116) - T275633', diff saved to https://phabricator.wikimedia.org/P14625 and previous config saved to /var/cache/conftool/dbconfig/20210304-114052-marostegui.json [11:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:59] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [11:41:27] kormat: maybe it is time to restart mysql on db1115 per the above alarm [11:42:31] !log jakob@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [11:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:48] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-omega-eqiad on cloudelastic1005 is CRITICAL: 216.6 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-omega-eqiad&var-instance=cloudelastic1005&panelId=37 [11:45:43] marostegui: if you can ack it, I'll take care of it in a bit [11:46:02] kormat: sure [11:46:53] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Daimona - https://phabricator.wikimedia.org/T276351 (10JMeybohm) [11:47:11] kormat: done - thank you [11:50:03] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/668371 (owner: 10Muehlenhoff) [11:50:07] 10SRE, 10Data-Persistence-Backup, 10Goal: Puppetize media backups infrastructure - https://phabricator.wikimedia.org/T276442 (10jcrespo) [11:50:15] 10SRE, 10Packaging: Package udplog for Buster - https://phabricator.wikimedia.org/T276421 (10Peachey88) [11:50:17] (03PS1) 10Jcrespo: mediabackup: Initial setup for the swift media backup orchestator hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) [11:50:18] PROBLEM - Check systemd state on kubernetes1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:33] (03CR) 10jerkins-bot: [V: 04-1] mediabackup: Initial setup for the swift media backup orchestator hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [11:53:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:55:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/668090 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [11:56:57] (03PS2) 10WMDE-Fisch: Remove conflicting gadget configuration for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668108 (https://phabricator.wikimedia.org/T276330) [11:56:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:59:19] 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Puppetize media backups infrastructure - https://phabricator.wikimedia.org/T276442 (10jcrespo) p:05Triage→03High [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210304T1200). [12:00:04] CFisch_WMDE and Majavah: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:18] here, my patch is beta only [12:00:54] !log Stop mysql on db1117:3321 to clone db1159 [12:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:33] Majavah: I merge your patch and rebase it. It'll be there automatically in ten minutes-ish [12:01:42] thanks Amir1 [12:01:55] o/ [12:02:00] I assume CFisch_WMDE can self-serve? [12:02:31] I currently tried to start up my scripts but somehow my SSH connection does not fire up... [12:02:40] (03CR) 10Ladsgroup: [C: 03+2] betacluster: Switch udp2log to deployment-mwlog01 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668338 (https://phabricator.wikimedia.org/T276419) (owner: 10Majavah) [12:02:50] (03CR) 10Jbond: admin: Add aex shell account and to gitlab-roots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668360 (https://phabricator.wikimedia.org/T275677) (owner: 10JMeybohm) [12:02:55] It just hangs somewhere [12:03:09] maybe run it with -vv? [12:03:21] (03PS3) 10JMeybohm: Add mwilliams to analytics-privatedata-users with no ssh access [puppet] - 10https://gerrit.wikimedia.org/r/668355 (https://phabricator.wikimedia.org/T275671) [12:03:23] (03PS2) 10JMeybohm: admin: Add aex shell account and to gitlab-roots [puppet] - 10https://gerrit.wikimedia.org/r/668360 (https://phabricator.wikimedia.org/T275677) [12:03:24] I'll try to solve it... if I've got no progress in the next couple of minutes I might ping you [12:03:25] (03PS1) 10JMeybohm: admin: Add daimona shell access and restriced group [puppet] - 10https://gerrit.wikimedia.org/r/668382 (https://phabricator.wikimedia.org/T276351) [12:03:40] 10SRE, 10Data-Persistence-Backup, 10Goal: Create a first release of the media backups automation tools - https://phabricator.wikimedia.org/T276445 (10jcrespo) [12:03:43] Amir1: not sure if you saw my ping in -releng earlier, do you happen to know if lists.beta.wmflabs.org DNS record is needed? it points to an unassigned floating IP and there is a separate project for the mailman3 upgrade [12:03:51] 10SRE, 10Data-Persistence-Backup, 10Goal: Create a first release of the media backups automation tools - https://phabricator.wikimedia.org/T276445 (10jcrespo) p:05Triage→03High [12:04:39] (03CR) 10JMeybohm: [C: 04-1] "Needs 3 business day wait" [puppet] - 10https://gerrit.wikimedia.org/r/668382 (https://phabricator.wikimedia.org/T276351) (owner: 10JMeybohm) [12:04:41] Majavah: no it's not needed. [12:04:46] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1011 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:04:54] ack, thanks, I'll delete it [12:05:14] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted for Daimona - https://phabricator.wikimedia.org/T276351 (10JMeybohm) [12:05:30] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [12:05:49] Thank you! [12:05:59] I need to find my yubikey [12:06:00] ugh [12:06:11] (03PS1) 10Marostegui: mariadb: Productionize db1159 [puppet] - 10https://gerrit.wikimedia.org/r/668383 (https://phabricator.wikimedia.org/T258361) [12:06:21] haproxy alerts are expected [12:06:26] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted for Daimona - https://phabricator.wikimedia.org/T276351 (10JMeybohm) This will need an update of the access level in the NDA sheet when it's done. [12:06:35] (03Merged) 10jenkins-bot: betacluster: Switch udp2log to deployment-mwlog01 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668338 (https://phabricator.wikimedia.org/T276419) (owner: 10Majavah) [12:06:52] found it [12:07:06] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1159 [puppet] - 10https://gerrit.wikimedia.org/r/668383 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [12:07:17] (03PS1) 10Jakob: Revert "Update termbox to 2021-03-01-112916-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/668162 [12:07:19] (03PS2) 10Jcrespo: mediabackup: Initial setup for the swift media backup orchestator hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) [12:07:37] (03CR) 10Tarrow: [C: 03+2] Revert "Update termbox to 2021-03-01-112916-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/668162 (owner: 10Jakob) [12:07:55] Majavah: rebased on mwdeploy1002 [12:08:13] thank you, now just waiting for it to be updated on beta [12:08:31] (03Merged) 10jenkins-bot: Revert "Update termbox to 2021-03-01-112916-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/668162 (owner: 10Jakob) [12:08:32] I already tested it by live hacking and it should work, but still will confirm once it's there [12:08:43] nice to remove some Jessie hosts [12:10:10] (03CR) 10Jbond: [C: 04-1] admin: Add aex shell account and to gitlab-roots [puppet] - 10https://gerrit.wikimedia.org/r/668360 (https://phabricator.wikimedia.org/T275677) (owner: 10JMeybohm) [12:10:53] logs are flowing on the new host, thank you for clicking +2 for me [12:10:55] (03PS3) 10JMeybohm: admin: Add aex shell account and to gitlab-roots [puppet] - 10https://gerrit.wikimedia.org/r/668360 (https://phabricator.wikimedia.org/T275677) [12:10:59] !log jakob@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [12:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:45] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on db1115.eqiad.wmnet,dbmonitor1001.wikimedia.org with reason: Restart db1115 to fix memory leak [12:11:46] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db1115.eqiad.wmnet,dbmonitor1001.wikimedia.org with reason: Restart db1115 to fix memory leak [12:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:07] (03PS1) 10Majavah: scap: Swap betacluster udplog deployment-mwlog01 [puppet] - 10https://gerrit.wikimedia.org/r/668384 (https://phabricator.wikimedia.org/T276419) [12:14:30] RECOVERY - Check systemd state on kubernetes1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:14:42] RECOVERY - MariaDB memory on db1115 is OK: OK Memory 71% used https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [12:17:24] PROBLEM - MariaDB Replica IO: db_inventory on db2093 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1115.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1115.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:19:24] RECOVERY - MariaDB Replica IO: db_inventory on db2093 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:22:37] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [12:25:17] (03PS1) 10Majavah: udp2log: Swap beta cluster to deployment-mwlog01 [puppet] - 10https://gerrit.wikimedia.org/r/668386 (https://phabricator.wikimedia.org/T276419) [12:26:40] (03PS1) 10Jbond: icinga: rename wait_for_icinga_optimal to wait_for_optimal [software/spicerack] - 10https://gerrit.wikimedia.org/r/668387 [12:26:52] volans: ^^^^ [12:27:07] marostegui: any idea what's up with dbproxy1012? [12:27:25] jbond42: thx! [12:27:38] Amir1: Seems like I go it working now, will start my config deployment now. Wish me luck ;-) [12:27:42] marostegui: ohh. it's in front of db1117 [12:27:43] kormat: I am doing a transfer from db1117 to a new host, sometimes that cause the proxies to flap for db1117 [12:27:45] yeah [12:27:50] i'll ack it. [12:27:50] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/668387 (owner: 10Jbond) [12:28:10] even if it is not the affected port, I guess it gets saturated for the whole host [12:28:28] ACKNOWLEDGEMENT - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Kormat db1117 is busy https://wikitech.wikimedia.org/wiki/HAProxy [12:28:52] CFisch_WMDE: cool. Floor is yours [12:29:00] (03CR) 10WMDE-Fisch: [C: 03+2] "Deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668108 (https://phabricator.wikimedia.org/T276330) (owner: 10WMDE-Fisch) [12:29:25] volans: np [12:29:55] (03Merged) 10jenkins-bot: Remove conflicting gadget configuration for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668108 (https://phabricator.wikimedia.org/T276330) (owner: 10WMDE-Fisch) [12:30:17] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [12:31:36] (03CR) 10Tarrow: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/668388 (owner: 10Tarrow) [12:32:44] (03CR) 10Jakob: [C: 03+1] Fix hostname for accessing the repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/668388 (owner: 10Tarrow) [12:32:55] (03CR) 10Jakob: [C: 03+2] Fix hostname for accessing the repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/668388 (owner: 10Tarrow) [12:33:32] (03Merged) 10jenkins-bot: Fix hostname for accessing the repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/668388 (owner: 10Tarrow) [12:33:59] (03CR) 10Jbond: [C: 03+2] "PCC changes seem expected" [puppet] - 10https://gerrit.wikimedia.org/r/668386 (https://phabricator.wikimedia.org/T276419) (owner: 10Majavah) [12:34:23] (03CR) 10Jbond: [C: 03+2] "PCC changes seem expected" [puppet] - 10https://gerrit.wikimedia.org/r/668384 (https://phabricator.wikimedia.org/T276419) (owner: 10Majavah) [12:34:33] !log jakob@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [12:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:19] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1011 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:38:47] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [12:40:29] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@6fcbb9f]: (no justification provided) [12:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:42] !log wmde-fisch@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:668108|Remove conflicting gadget configuration for hewiki (T276330)]] (duration: 01m 12s) [12:40:43] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@6fcbb9f]: (no justification provided) (duration: 00m 14s) [12:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:48] T276330: Remove RefPreviews conflicting gadget configuration: CiteTooltip on hewiki - https://phabricator.wikimedia.org/T276330 [12:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:03] Done :-) [12:42:58] (03PS1) 10Tarrow: Revert "Fix hostname for accessing the repo" [deployment-charts] - 10https://gerrit.wikimedia.org/r/668163 [12:43:57] (03CR) 10Jakob: [C: 03+2] Revert "Fix hostname for accessing the repo" [deployment-charts] - 10https://gerrit.wikimedia.org/r/668163 (owner: 10Tarrow) [12:44:45] (03Merged) 10jenkins-bot: Revert "Fix hostname for accessing the repo" [deployment-charts] - 10https://gerrit.wikimedia.org/r/668163 (owner: 10Tarrow) [12:45:37] !log jakob@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [12:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:20] !log drain + reimage analytics10[61,62] to Debian Buster [12:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:52] (03CR) 10Muehlenhoff: [C: 03+2] profile::docker::registry: Update monitoring check to check Buster [puppet] - 10https://gerrit.wikimedia.org/r/668371 (owner: 10Muehlenhoff) [12:49:11] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-omega-eqiad on cloudelastic1005 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-omega-eqiad&var-instance=cloudelastic1005&panelId=37 [12:49:50] (03PS1) 10Marostegui: mariadb: Productionize db2146 [puppet] - 10https://gerrit.wikimedia.org/r/668390 (https://phabricator.wikimedia.org/T275633) [12:50:03] (03CR) 10Volans: [C: 03+2] icinga: rename wait_for_icinga_optimal to wait_for_optimal [software/spicerack] - 10https://gerrit.wikimedia.org/r/668387 (owner: 10Jbond) [12:51:51] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2146 [puppet] - 10https://gerrit.wikimedia.org/r/668390 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [12:53:51] (03PS1) 10MSantos: fix state_path for imposm engine [puppet] - 10https://gerrit.wikimedia.org/r/668391 [12:55:05] (03PS1) 10Majavah: arclamp: Switch beta cluster redis to deployment-webperf01 [puppet] - 10https://gerrit.wikimedia.org/r/668392 [12:55:17] (03PS2) 10MSantos: fix state_path for imposm engine [puppet] - 10https://gerrit.wikimedia.org/r/668391 [12:55:32] (03PS3) 10MSantos: fix state_path for imposm engine [puppet] - 10https://gerrit.wikimedia.org/r/668391 [12:55:59] (03PS2) 10Majavah: arclamp: Switch beta cluster redis to deployment-webperf01 [puppet] - 10https://gerrit.wikimedia.org/r/668392 (https://phabricator.wikimedia.org/T276419) [12:56:46] (03Merged) 10jenkins-bot: icinga: rename wait_for_icinga_optimal to wait_for_optimal [software/spicerack] - 10https://gerrit.wikimedia.org/r/668387 (owner: 10Jbond) [12:59:11] (03CR) 10Muehlenhoff: [C: 03+2] arclamp: Switch beta cluster redis to deployment-webperf01 [puppet] - 10https://gerrit.wikimedia.org/r/668392 (https://phabricator.wikimedia.org/T276419) (owner: 10Majavah) [13:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210304T1300) [13:03:37] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.49 [software/spicerack] - 10https://gerrit.wikimedia.org/r/668396 [13:04:28] (03PS1) 10Jbond: admin: restricted add Wolfgang as the group approver [puppet] - 10https://gerrit.wikimedia.org/r/668397 [13:06:35] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1061.eqiad.wmnet with reason: REIMAGE [13:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:23] (03PS1) 10Marostegui: instances.yaml: Add db2146 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/668400 (https://phabricator.wikimedia.org/T275633) [13:07:47] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1062.eqiad.wmnet with reason: REIMAGE [13:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:01] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2146 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/668400 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [13:08:36] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1061.eqiad.wmnet with reason: REIMAGE [13:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:38] (03CR) 10Jbond: "FYI the three day waiting period requirement has been dropped. however we do need the group owner to approve the request. currently we " [puppet] - 10https://gerrit.wikimedia.org/r/668382 (https://phabricator.wikimedia.org/T276351) (owner: 10JMeybohm) [13:10:24] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1062.eqiad.wmnet with reason: REIMAGE [13:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2146 to dbctl T275633', diff saved to https://phabricator.wikimedia.org/P14631 and previous config saved to /var/cache/conftool/dbconfig/20210304-131301-marostegui.json [13:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:08] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [13:18:06] (03CR) 10Wolfgang Kandek: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/668382 (https://phabricator.wikimedia.org/T276351) (owner: 10JMeybohm) [13:18:10] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.49 [software/spicerack] - 10https://gerrit.wikimedia.org/r/668396 (owner: 10Volans) [13:18:44] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28370/console" [puppet] - 10https://gerrit.wikimedia.org/r/668391 (owner: 10MSantos) [13:21:41] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] fix state_path for imposm engine [puppet] - 10https://gerrit.wikimedia.org/r/668391 (owner: 10MSantos) [13:23:06] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.49 [software/spicerack] - 10https://gerrit.wikimedia.org/r/668396 (owner: 10Volans) [13:25:26] (03PS1) 10Volans: Upstream release v0.0.49 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/668417 [13:29:57] !log installing libzstd security updates on Buster [13:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:38] !log drain + reimage analytics10[63,64] to Debian Buster [13:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:49] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.49 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/668417 (owner: 10Volans) [13:34:14] jouncebot: now [13:34:14] For the next 0 hour(s) and 25 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210304T1300) [13:35:10] !log restarting mw canaries for libzstd update [13:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:29] (03Merged) 10jenkins-bot: Upstream release v0.0.49 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/668417 (owner: 10Volans) [13:39:44] (03PS1) 10Urbanecm: Enable Growth features on sqwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668423 (https://phabricator.wikimedia.org/T275550) [13:44:12] !log uploaded spicerack_0.0.49 to apt.wikimedia.org buster-wikimedia [13:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:21] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet [13:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:18] (03PS1) 10Ottomata: Don't install a copy of R in a stacked user conda env [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/668425 (https://phabricator.wikimedia.org/T224658) [13:45:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2116', diff saved to https://phabricator.wikimedia.org/P14632 and previous config saved to /var/cache/conftool/dbconfig/20210304-134521-marostegui.json [13:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:59] RECOVERY - Host sretest1001 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [13:47:02] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted for Daimona - https://phabricator.wikimedia.org/T276351 (10JMeybohm) @wkandek can you act as approver for access to the restricted group (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/adm... [13:47:31] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted for Daimona - https://phabricator.wikimedia.org/T276351 (10JMeybohm) [13:48:14] (03PS3) 10Kormat: mariadb: Add section parameters: core::multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/668031 [13:48:35] RECOVERY - configured eth on sretest1001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [13:48:51] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet [13:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:51] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1063.eqiad.wmnet with reason: REIMAGE [13:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:09] (03PS4) 10JMeybohm: admin: Add olykalinichenko shell account and to gitlab-roots [puppet] - 10https://gerrit.wikimedia.org/r/668360 (https://phabricator.wikimedia.org/T275677) [13:52:47] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1064.eqiad.wmnet with reason: REIMAGE [13:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:03] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1063.eqiad.wmnet with reason: REIMAGE [13:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:14] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10JMeybohm) @OlyKalinichenkoSpeedAndFunction we need the shell username to be the same as your wikitech shell name so this is... [13:54:16] (03CR) 10Kormat: [C: 03+1] "Changes made in prod. I'll merge this now." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667240 (https://phabricator.wikimedia.org/T274170) (owner: 10Dzahn) [13:54:27] (03CR) 10Kormat: [C: 03+2] mariadb: prod-m5 grants: add mwmaint2002, rm mwmaint2001 [puppet] - 10https://gerrit.wikimedia.org/r/667240 (https://phabricator.wikimedia.org/T274170) (owner: 10Dzahn) [13:55:09] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1064.eqiad.wmnet with reason: REIMAGE [13:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:50] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted for Daimona - https://phabricator.wikimedia.org/T276351 (10Urbanecm) My own deployment access was approved by @greg some time back, I'm not sure whether the group owner changed in the meantime. [14:00:04] liw and longma: My dear minions, it's time we take the moon! Just kidding. Time for Mediawiki train - European+American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210304T1400). [14:00:30] (03PS1) 10Lars Wirzenius: all wikis to 1.36.0-wmf.33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668434 [14:00:32] (03CR) 10Lars Wirzenius: [C: 03+2] all wikis to 1.36.0-wmf.33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668434 (owner: 10Lars Wirzenius) [14:00:57] liw: fingers crossed :) [14:02:29] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.33 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668434 (owner: 10Lars Wirzenius) [14:02:32] I'm sure this will go well: Fugees is playing "Killing me softly" in my heaphones. [14:04:10] !log liw@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.33 [14:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:02] RECOVERY - Ensure local MW versions match expected deployment on deploy1001 is OK: OKAY: Not alerting due to fresh production wikiversions: 839 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:07:21] (03PS1) 10Ottomata: Exclude the readonly conda base env from the list of Jupyter profiles [puppet] - 10https://gerrit.wikimedia.org/r/668439 (https://phabricator.wikimedia.org/T224658) [14:08:08] (03PS5) 10JMeybohm: Remove old kubernetes staging master neon.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668011 (https://phabricator.wikimedia.org/T276305) [14:09:09] (03PS2) 10Ottomata: Exclude the readonly conda base env from the list of Jupyter profiles [puppet] - 10https://gerrit.wikimedia.org/r/668439 (https://phabricator.wikimedia.org/T224658) [14:12:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:14:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:15:31] !log jayme@cumin1001 START - Cookbook sre.hosts.decommission for hosts neon.eqiad.wmnet [14:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:54] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts neon.eqiad.wmnet [14:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:27] (03CR) 10JMeybohm: [C: 03+2] Remove old kubernetes staging master neon.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668011 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [14:18:52] !log jayme@cumin1001 START - Cookbook sre.hosts.decommission for hosts neon.eqiad.wmnet [14:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:35] (03PS4) 10Kormat: mariadb: Add section parameters: core::multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/668031 (https://phabricator.wikimedia.org/T275497) [14:19:37] (03PS1) 10Kormat: mariadb: Add section parameters: misc [puppet] - 10https://gerrit.wikimedia.org/r/668444 (https://phabricator.wikimedia.org/T275497) [14:21:52] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28371/console" [puppet] - 10https://gerrit.wikimedia.org/r/668444 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [14:23:17] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts neon.eqiad.wmnet [14:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:57] (03PS1) 10Ottomata: Rename CONDA_BASE_ENV_PATH to CONDA_BASE_ENV_PREFIX [puppet] - 10https://gerrit.wikimedia.org/r/668445 (https://phabricator.wikimedia.org/T224658) [14:24:40] (03PS1) 10Ottomata: Rename CONDA_BASE_ENV_PATH to CONDA_BASE_ENV_PREFIX [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/668446 (https://phabricator.wikimedia.org/T224658) [14:25:52] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/668360 (https://phabricator.wikimedia.org/T275677) (owner: 10JMeybohm) [14:26:21] (03CR) 10JMeybohm: [C: 03+2] admin: Add olykalinichenko shell account and to gitlab-roots [puppet] - 10https://gerrit.wikimedia.org/r/668360 (https://phabricator.wikimedia.org/T275677) (owner: 10JMeybohm) [14:30:18] !log jayme@cumin1001 START - Cookbook sre.dns.netbox [14:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:28] (03PS2) 10Kormat: mariadb: Add section parameters: misc [puppet] - 10https://gerrit.wikimedia.org/r/668444 (https://phabricator.wikimedia.org/T275497) [14:30:43] (03PS1) 10Jcrespo: dbbackups: Update backup metadata host db1080->db1159 [puppet] - 10https://gerrit.wikimedia.org/r/668449 (https://phabricator.wikimedia.org/T276448) [14:31:06] (03PS2) 10Ottomata: Rename CONDA_BASE_ENV_PATH to CONDA_BASE_ENV_PREFIX [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/668446 (https://phabricator.wikimedia.org/T224658) [14:31:37] (03CR) 10Jcrespo: [C: 04-1] "Blocked on actual master switch." [puppet] - 10https://gerrit.wikimedia.org/r/668449 (https://phabricator.wikimedia.org/T276448) (owner: 10Jcrespo) [14:31:50] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10JMeybohm) 05Open→03Resolved a:03JMeybohm [14:32:44] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28372/console" [puppet] - 10https://gerrit.wikimedia.org/r/668444 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [14:33:48] (03PS2) 10JMeybohm: admin: Add daimona shell access and restriced group [puppet] - 10https://gerrit.wikimedia.org/r/668382 (https://phabricator.wikimedia.org/T276351) [14:34:32] (03PS3) 10Ottomata: Rename CONDA_BASE_ENV_PATH to CONDA_BASE_ENV_PREFIX [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/668446 (https://phabricator.wikimedia.org/T224658) [14:34:41] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [14:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:17] !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:01] (03CR) 10Marostegui: "Thanks - I will merge once we switch it (but it won't happen in the next few weeks)" [puppet] - 10https://gerrit.wikimedia.org/r/668449 (https://phabricator.wikimedia.org/T276448) (owner: 10Jcrespo) [14:38:05] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:13] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1065.eqiad.wmnet with reason: REIMAGE [14:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:58] (03CR) 10Ottomata: [C: 03+2] Rename CONDA_BASE_ENV_PATH to CONDA_BASE_ENV_PREFIX [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/668446 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [14:39:00] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Rename CONDA_BASE_ENV_PATH to CONDA_BASE_ENV_PREFIX [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/668446 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [14:39:37] (03CR) 10Ottomata: [C: 03+2] Exclude the readonly conda base env from the list of Jupyter profiles [puppet] - 10https://gerrit.wikimedia.org/r/668439 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [14:39:51] (03PS2) 10Ottomata: Rename CONDA_BASE_ENV_PATH to CONDA_BASE_ENV_PREFIX [puppet] - 10https://gerrit.wikimedia.org/r/668445 (https://phabricator.wikimedia.org/T224658) [14:40:18] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1065.eqiad.wmnet with reason: REIMAGE [14:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:23] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Great! No thanks because you forced me to go re-read how complete works, which should be considered a professional harzard, more or less 😊" [puppet] - 10https://gerrit.wikimedia.org/r/668181 (owner: 10JMeybohm) [14:40:40] (03CR) 10Marostegui: [C: 03+1] mariadb: Add section parameters: misc [puppet] - 10https://gerrit.wikimedia.org/r/668444 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [14:42:08] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28373/console" [puppet] - 10https://gerrit.wikimedia.org/r/668031 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [14:42:39] (03CR) 10Marostegui: [C: 03+1] mariadb: Add section parameters: core::multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/668031 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [14:42:43] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/668181 (owner: 10JMeybohm) [14:43:08] (03CR) 10Kormat: [V: 03+1 C: 03+2] mariadb: Add section parameters: core::multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/668031 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [14:43:13] (03CR) 10Alexandros Kosiaris: [C: 03+1] "I am gonna venture a +1 here, not sure I want to know much more about bash autocomplete" [puppet] - 10https://gerrit.wikimedia.org/r/668181 (owner: 10JMeybohm) [14:43:21] ottomata: okay to merge "Exclude the readonly conda base env from the list of Jupyter profiles" ? [14:43:23] yes sorry [14:43:31] was about to do that! please merge [14:43:31] (03CR) 10Kormat: [V: 03+1 C: 03+2] mariadb: Add section parameters: misc [puppet] - 10https://gerrit.wikimedia.org/r/668444 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [14:43:39] ottomata: np, merged [14:44:07] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/28374/an-test-client1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/668445 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [14:49:54] (03CR) 10Jcrespo: [C: 04-1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/668449 (https://phabricator.wikimedia.org/T276448) (owner: 10Jcrespo) [14:51:05] 10SRE, 10ops-eqiad: mc1027.eqiad.wmnet is down, not powering back up - https://phabricator.wikimedia.org/T276415 (10Jclark-ctr) @jijiki I stopped by cage briefly this morning and looked at server. I could not get it to boot could be a bad cpu or main board. [14:51:25] 10SRE, 10ops-eqiad: mc1027.eqiad.wmnet is down, not powering back up - https://phabricator.wikimedia.org/T276415 (10Jclark-ctr) a:03Jclark-ctr [14:56:25] (03PS1) 10Ottomata: Add requested packages to base anaconda-wmf [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/668453 (https://phabricator.wikimedia.org/T271960) [15:03:20] 10Puppet, 10SRE: puppet admin module: Assigne approveres to unix groups - https://phabricator.wikimedia.org/T276465 (10jbond) p:05Triage→03Medium [15:07:11] 10Puppet, 10SRE: puppet admin module: Assigne approveres to unix groups - https://phabricator.wikimedia.org/T276465 (10jbond) [15:08:43] 10SRE, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, 10Platform Engineering (Icebox): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10akosiaris) [15:09:32] PROBLEM - PHP opcache health on mwdebug2001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:11:01] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1024.eqiad.wmnet [15:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:06] !log drain + reimage analytics106[6,7] to Debian Buster [15:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:44] 10Puppet, 10SRE: puppet admin module: Assigne approveres to unix groups - https://phabricator.wikimedia.org/T276465 (10Gilles) [15:15:46] (03PS2) 10Herron: assign mwlog2002 role::logging::mediawiki::udp2log [puppet] - 10https://gerrit.wikimedia.org/r/667911 (https://phabricator.wikimedia.org/T224565) [15:16:50] (03CR) 10Gehel: wdqs: expose wdqs1009 externally (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668173 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper) [15:17:19] (03PS1) 10Tarrow: Revert "Revert "Update termbox to 2021-03-01-112916-production"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/668164 [15:17:31] (03CR) 10Gehel: [C: 03+1] "LGTM, but I only have limited understanding of DNS (this looks simple enough)." [dns] - 10https://gerrit.wikimedia.org/r/668255 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper) [15:17:52] PROBLEM - Ensure local MW versions match expected deployment on deploy1001 is CRITICAL: CRITICAL: 839 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [15:19:18] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:34] (03CR) 10Herron: [C: 03+2] assign mwlog2002 role::logging::mediawiki::udp2log [puppet] - 10https://gerrit.wikimedia.org/r/667911 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [15:19:53] (03CR) 10Jakob: [C: 03+2] Revert "Revert "Update termbox to 2021-03-01-112916-production"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/668164 (owner: 10Tarrow) [15:20:15] mutante: isnt deploy1001 old, should the alert be off? [15:20:29] (03Merged) 10jenkins-bot: Revert "Revert "Update termbox to 2021-03-01-112916-production"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/668164 (owner: 10Tarrow) [15:21:41] !log jakob@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [15:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:04] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1066.eqiad.wmnet with reason: REIMAGE [15:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:08] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1067.eqiad.wmnet with reason: REIMAGE [15:26:08] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1066.eqiad.wmnet with reason: REIMAGE [15:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:14] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1067.eqiad.wmnet with reason: REIMAGE [15:28:25] !log jakob@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [15:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:00] (03PS2) 10Herron: assign mwlog1002 role::logging::mediawiki::udp2log [puppet] - 10https://gerrit.wikimedia.org/r/667912 (https://phabricator.wikimedia.org/T224565) [15:31:57] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted for Daimona - https://phabricator.wikimedia.org/T276351 (10wkandek) Approved. [15:32:17] (03PS1) 10Kormat: mariadb: Use section parameters: misc profiles. [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) [15:33:46] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Use section parameters: misc profiles. [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [15:34:27] (03PS1) 10Jbond: P:memcached: drop admin_groups as has no meening here [puppet] - 10https://gerrit.wikimedia.org/r/668465 [15:35:34] (03CR) 10JMeybohm: [C: 03+2] admin: Add daimona shell access and restriced group [puppet] - 10https://gerrit.wikimedia.org/r/668382 (https://phabricator.wikimedia.org/T276351) (owner: 10JMeybohm) [15:37:19] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted for Daimona - https://phabricator.wikimedia.org/T276351 (10JMeybohm) 05Open→03Resolved a:03JMeybohm Merged, thanks! [15:37:51] (03CR) 10Herron: [C: 03+2] assign mwlog1002 role::logging::mediawiki::udp2log [puppet] - 10https://gerrit.wikimedia.org/r/667912 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [15:37:57] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted for Daimona - https://phabricator.wikimedia.org/T276351 (10Daimona) \o/ thank you! [15:41:44] (03PS2) 10Kormat: mariadb: Use section parameters: misc profiles. [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) [15:42:48] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1024.eqiad.wmnet [15:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:01] (03PS1) 10Ottomata: Jupyter - never use webproxy for *.wmnet URLs and use system cacerts [puppet] - 10https://gerrit.wikimedia.org/r/668466 (https://phabricator.wikimedia.org/T224658) [15:46:57] (03CR) 10Jakob: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/668468 (owner: 10Jakob) [15:47:15] (03CR) 10Tarrow: [C: 03+2] termbox: wire test service to api-rw [deployment-charts] - 10https://gerrit.wikimedia.org/r/668468 (owner: 10Jakob) [15:48:35] (03Merged) 10jenkins-bot: termbox: wire test service to api-rw [deployment-charts] - 10https://gerrit.wikimedia.org/r/668468 (owner: 10Jakob) [15:51:34] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:07] !log jakob@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' . [15:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:35] (03PS1) 10Gergő Tisza: Add Link: use production link recommendation service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668470 (https://phabricator.wikimedia.org/T274198) [15:53:00] (03CR) 10Hashar: doc: script to restart php-fpm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666309 (https://phabricator.wikimedia.org/T275468) (owner: 10Hashar) [15:53:10] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10Papaul) a:05Papaul→03CDanis @CDanis this all complete on my side. The device is connected on scs1-a1 on port 47. Tested console using Cisco console cable with baud rate... [15:54:22] (03CR) 10Kosta Harlan: [C: 03+1] Add Link: use production link recommendation service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668470 (https://phabricator.wikimedia.org/T274198) (owner: 10Gergő Tisza) [15:54:26] (03PS3) 10Kormat: mariadb: Use section parameters: misc profiles. [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) [15:55:12] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1025.eqiad.wmnet [15:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:07] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 4 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28380/console" [puppet] - 10https://gerrit.wikimedia.org/r/668464 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [16:02:06] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1025.eqiad.wmnet [16:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:56] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1026.eqiad.wmnet [16:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:35] (03PS1) 10Marostegui: db2116: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/668480 [16:11:40] (03CR) 10Marostegui: [C: 03+2] db2116: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/668480 (owner: 10Marostegui) [16:12:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2145', diff saved to https://phabricator.wikimedia.org/P14635 and previous config saved to /var/cache/conftool/dbconfig/20210304-161226-marostegui.json [16:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:02] 10SRE, 10ops-codfw, 10netops: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10Papaul) a:05faidon→03Papaul @CDanis The old device is already set to decom in netbox. let me know when the new device is online so i can offline this device. [16:13:33] (03PS6) 10David Caro: wmcs.toolforge.etcd: Added cookbook to depool and remove a node [cookbooks] - 10https://gerrit.wikimedia.org/r/667183 (https://phabricator.wikimedia.org/T274497) [16:13:40] (03PS4) 10David Caro: wmcs.toolforge: add cookbook to create an instance of a prefix [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) [16:13:47] (03PS2) 10David Caro: wmcs.toolforge: add cookbook to add a new etcd node [cookbooks] - 10https://gerrit.wikimedia.org/r/668090 (https://phabricator.wikimedia.org/T274497) [16:13:49] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1026.eqiad.wmnet [16:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:34] (03CR) 10Jakob: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/668482 (owner: 10Jakob) [16:16:15] (03CR) 10Tarrow: [C: 03+2] termbox: revert to match current deploy version [deployment-charts] - 10https://gerrit.wikimedia.org/r/668482 (owner: 10Jakob) [16:17:33] robh: I put a link on the task [16:17:40] Can you screenshot the options you see [16:18:13] (03Merged) 10jenkins-bot: termbox: revert to match current deploy version [deployment-charts] - 10https://gerrit.wikimedia.org/r/668482 (owner: 10Jakob) [16:18:21] (03CR) 10Herron: [C: 03+1] "Thanks for this! There's a lot changing here but afaict it looks good. Here's a PCC run https://puppet-compiler.wmflabs.org/compiler1001" [puppet] - 10https://gerrit.wikimedia.org/r/668189 (https://phabricator.wikimedia.org/T273919) (owner: 10Cwhite) [16:18:43] 10SRE, 10ops-codfw, 10netops: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10Papaul) p:05High→03Lowest [16:19:20] 10SRE, 10serviceops, 10Patch-For-Review: move mwmaint2002 into production, replace mwmaint2001 - https://phabricator.wikimedia.org/T275905 (10Papaul) [16:19:24] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install mwmaint2002 - https://phabricator.wikimedia.org/T274170 (10Papaul) [16:19:34] (03CR) 10jerkins-bot: [V: 04-1] wmcs.toolforge: add cookbook to create an instance of a prefix [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [16:19:48] (03CR) 10jerkins-bot: [V: 04-1] wmcs.toolforge: add cookbook to add a new etcd node [cookbooks] - 10https://gerrit.wikimedia.org/r/668090 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [16:20:17] !log jakob@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' . [16:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:36] (03CR) 10Dduvall: [C: 03+1] "Not being that familiar with our existing prod apache configs, I've given this a once over for glaring issues and I don't see any. All the" (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/667886 (owner: 10Giuseppe Lavagetto) [16:22:13] (03CR) 10Jbond: [C: 03+1] "thanks optional nit otherwise lgtm" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/666309 (https://phabricator.wikimedia.org/T275468) (owner: 10Hashar) [16:23:14] !log jakob@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [16:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:11] (03CR) 10Dduvall: [C: 03+1] "Would it make sense to change the variable name to WMF_MAINTENANCE_OFFLINE to reflect that it is only used by mediawiki-config and has no " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667244 (https://phabricator.wikimedia.org/T238436) (owner: 10Ahmon Dancy) [16:26:54] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:31:56] (03CR) 10Dzahn: "thank you" [puppet] - 10https://gerrit.wikimedia.org/r/667240 (https://phabricator.wikimedia.org/T274170) (owner: 10Dzahn) [16:33:28] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1031.eqiad.wmnet [16:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:54] (03PS9) 10Jbond: utils/run_ci_localy.sh: Add a script to allow users to run CI from there laptops [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 (https://phabricator.wikimedia.org/T274338) [16:34:07] (03CR) 10jerkins-bot: [V: 04-1] utils/run_ci_localy.sh: Add a script to allow users to run CI from there laptops [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 (https://phabricator.wikimedia.org/T274338) (owner: 10Jbond) [16:35:42] 10SRE, 10ops-eqiad: mc1027.eqiad.wmnet is down, not powering back up - https://phabricator.wikimedia.org/T276415 (10jijiki) 05Open→03Resolved @Jclark-ctr thank you very much, I am closing this task since we have replacements on the way [16:36:47] (03CR) 10Hashar: doc: script to restart php-fpm (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/666309 (https://phabricator.wikimedia.org/T275468) (owner: 10Hashar) [16:36:53] (03PS5) 10Hashar: doc: script to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/666309 (https://phabricator.wikimedia.org/T275468) [16:38:41] (03PS10) 10Jbond: utils/run_ci_localy.sh: Add a script to allow users to run CI from there laptops [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 (https://phabricator.wikimedia.org/T274338) [16:38:53] (03CR) 10jerkins-bot: [V: 04-1] utils/run_ci_localy.sh: Add a script to allow users to run CI from there laptops [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 (https://phabricator.wikimedia.org/T274338) (owner: 10Jbond) [16:39:12] (03PS2) 10Ottomata: Jupyter - never use webproxy for *.wmnet URLs and use system cacerts [puppet] - 10https://gerrit.wikimedia.org/r/668466 (https://phabricator.wikimedia.org/T224658) [16:39:14] (03PS1) 10Ottomata: Spark JVMs inherit system http proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/668485 (https://phabricator.wikimedia.org/T224658) [16:39:22] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1031.eqiad.wmnet [16:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:11] (03CR) 10Jbond: utils/run_ci_localy.sh: Add a script to allow users to run CI from there laptops (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 (https://phabricator.wikimedia.org/T274338) (owner: 10Jbond) [16:40:33] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts: ` backup2003.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2021030... [16:41:33] (03PS11) 10Jbond: utils/run_ci_localy.sh: Add a script to allow users to run CI from there laptops [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 (https://phabricator.wikimedia.org/T274338) [16:43:06] (03PS1) 10JMeybohm: Enable egress networking for termbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/668486 [16:43:22] (03CR) 10Jbond: "@volans any idea why jenkins is complaining as far as i can see the CR is rebased on master?" [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 (https://phabricator.wikimedia.org/T274338) (owner: 10Jbond) [16:43:47] jbond42: This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset. [16:44:20] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Services, and 3 others: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10jijiki) [16:44:23] let me check [16:45:39] thanks [16:45:45] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10jcrespo) a:05Papaul→03jcrespo PXE worked after I: 1. Disabled PXE on the 1Gb device. 2. Enabled PXE on the 10Gb device 3. Rebooted 4. After reboot, the 10Gb one appeared on the boot... [16:46:06] (03CR) 10Thcipriani: [C: 03+1] "Can't speak to conf specifics, but test case lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/659426 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [16:46:12] (03CR) 10JMeybohm: [C: 03+2] Enable egress networking for termbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/668486 (owner: 10JMeybohm) [16:46:16] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 (https://phabricator.wikimedia.org/T274338) (owner: 10Jbond) [16:46:22] (03CR) 10Tarrow: [C: 03+2] Enable egress networking for termbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/668486 (owner: 10JMeybohm) [16:47:25] (03Merged) 10jenkins-bot: Enable egress networking for termbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/668486 (owner: 10JMeybohm) [16:47:49] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [16:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:00] (03PS12) 10Volans: utils/run_ci_localy.sh: run CI locally [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 (https://phabricator.wikimedia.org/T274338) (owner: 10Jbond) [16:49:14] (03CR) 10jerkins-bot: [V: 04-1] utils/run_ci_localy.sh: run CI locally [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 (https://phabricator.wikimedia.org/T274338) (owner: 10Jbond) [16:49:46] jbond42: no idea, either ping releng or try to remove the change-id from teh commit message and send it as a new patch... sorry can't dig into it right now [16:52:12] (03PS9) 10Razzi: wikireplicas: Add basic configuration for clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/661529 (https://phabricator.wikimedia.org/T269211) [16:52:14] (03CR) 10David Caro: utils/run_ci_localy.sh: run CI locally (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 (https://phabricator.wikimedia.org/T274338) (owner: 10Jbond) [16:52:25] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1068.eqiad.wmnet with reason: REIMAGE [16:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:40] !log tarrow@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [16:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:57] (03PS2) 10Ottomata: Spark JVMs inherit system http settings [puppet] - 10https://gerrit.wikimedia.org/r/668485 (https://phabricator.wikimedia.org/T224658) [16:54:10] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1069.eqiad.wmnet with reason: REIMAGE [16:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:17] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:28] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1068.eqiad.wmnet with reason: REIMAGE [16:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:20] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1069.eqiad.wmnet with reason: REIMAGE [16:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:38] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['backup2003.codfw.wmnet'] ` Of which those **FAILED**: ` ['backup2003.codfw.wmnet'] ` [16:56:48] !log tarrow@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' . [16:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:12] (03CR) 10Jbond: utils/run_ci_localy.sh: run CI locally (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 (https://phabricator.wikimedia.org/T274338) (owner: 10Jbond) [16:59:10] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1032.eqiad.wmnet [16:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:05] jbond42 and cdanis: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210304T1700). [17:00:59] (03PS1) 10Bstorm: wikireplicas: depool labsdb1009 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/668490 (https://phabricator.wikimedia.org/T276124) [17:01:24] (03CR) 10David Caro: utils/run_ci_localy.sh: run CI locally (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 (https://phabricator.wikimedia.org/T274338) (owner: 10Jbond) [17:01:42] (03CR) 10Bstorm: "Something is holding the views and not quite letting me make changes." [puppet] - 10https://gerrit.wikimedia.org/r/668490 (https://phabricator.wikimedia.org/T276124) (owner: 10Bstorm) [17:01:52] etherpad down for everyone or just me? [17:01:58] oh, there it goes! [17:05:35] (03PS6) 10Ahmon Dancy: wmf-config/CommonSettings.php: Add WMF_MAINTENANCE_OFFLINE handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667244 (https://phabricator.wikimedia.org/T238436) [17:05:52] (03CR) 10Ahmon Dancy: "> Patch Set 5:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667244 (https://phabricator.wikimedia.org/T238436) (owner: 10Ahmon Dancy) [17:05:57] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1032.eqiad.wmnet [17:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:04] (03CR) 10Bstorm: [C: 03+2] wikireplicas: depool labsdb1009 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/668490 (https://phabricator.wikimedia.org/T276124) (owner: 10Bstorm) [17:08:55] PROBLEM - Maps HTTPS on maps1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [17:09:38] (03PS1) 10Razzi: wikireplicas: give analytics_multiinstance role to clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/668494 (https://phabricator.wikimedia.org/T269211) [17:11:23] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on deploy1001.eqiad.wmnet with reason: decom [17:11:23] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on deploy1001.eqiad.wmnet with reason: decom [17:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:48] (03PS2) 10Phamhi: wikireplica: depool clouddb1015 [puppet] - 10https://gerrit.wikimedia.org/r/667915 (https://phabricator.wikimedia.org/T273281) [17:12:32] RhinosF1: yes, the alert should be off, fixed [17:12:52] Ty mutante [17:13:04] (03CR) 10Hashar: "But it did work for me locally! :-\ Then if it is broken for sure lets just dismiss my change, afterall it is just for some UI glitch ;)" [puppet] - 10https://gerrit.wikimedia.org/r/668172 (owner: 10Hashar) [17:14:28] (03PS2) 10Gergő Tisza: [beta] Add Link: use production link recommendation service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668470 (https://phabricator.wikimedia.org/T274198) [17:15:09] 10SRE, 10serviceops, 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10jijiki) [17:15:30] (03Abandoned) 10Hashar: logspam-watch: redraw when terminal size changes [puppet] - 10https://gerrit.wikimedia.org/r/668172 (owner: 10Hashar) [17:15:59] 10SRE, 10serviceops, 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10jijiki) I want to >>! In T271967#6877704, @Joe wrote: > Can I ask how do we intend to perform the transition from non-tls to tls in detail? I see a series of pitfalls with our... [17:16:57] PROBLEM - puppet last run on maps1009 is CRITICAL: CRITICAL: Puppet has been disabled for longer than 86400 seconds, message: hnowlan: resyncing imposm from scratch - hnowlan, last run 1 day ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:18:37] (03PS1) 10Bstorm: Revert "wikireplicas: depool labsdb1009 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/668165 [17:20:00] (03CR) 10Bstorm: [C: 03+2] Revert "wikireplicas: depool labsdb1009 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/668165 (owner: 10Bstorm) [17:20:17] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:28] (03CR) 10Dzahn: [C: 03+2] package_builder: convert cowbuilder cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/667008 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [17:21:28] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:21:36] (03CR) 10Phamhi: [C: 03+2] wikireplica: depool clouddb1015 [puppet] - 10https://gerrit.wikimedia.org/r/667915 (https://phabricator.wikimedia.org/T273281) (owner: 10Phamhi) [17:21:45] (03PS3) 10Phamhi: wikireplica: depool clouddb1015 [puppet] - 10https://gerrit.wikimedia.org/r/667915 (https://phabricator.wikimedia.org/T273281) [17:23:03] phamhi: you can merge both [17:23:10] (03PS1) 10Bstorm: wikireplicas: depool labsdb1010 for view changes [puppet] - 10https://gerrit.wikimedia.org/r/668497 (https://phabricator.wikimedia.org/T276124) [17:23:59] mutane: thanks [17:24:14] 10SRE, 10Analytics-Clusters: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10razzi) [17:24:31] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10jcrespo) Sadly, while the installer now works correctly, after install, local disk drive boot fails, and goes back to network boot: ` Booting from Hard drive C: Booting from BRCM MBA S... [17:25:04] phamhi: np, it will make you type "multiple" [17:25:07] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:32] mutante xD haha, thanks [17:25:36] (03PS2) 10Bstorm: wikireplicas: depool labsdb1010 for view changes [puppet] - 10https://gerrit.wikimedia.org/r/668497 (https://phabricator.wikimedia.org/T276124) [17:25:44] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Puppet last ran 22 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:26:04] (03CR) 10Dzahn: "thank you very much Jbond. waiting or cloud sign-off" [puppet] - 10https://gerrit.wikimedia.org/r/665462 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [17:27:22] 10SRE, 10Analytics-Radar, 10Machine-Learning-Team: Kubeflow on stat machines - https://phabricator.wikimedia.org/T275551 (10razzi) [17:28:11] (03PS3) 10Bstorm: wikireplicas: depool labsdb1010 for view changes [puppet] - 10https://gerrit.wikimedia.org/r/668497 (https://phabricator.wikimedia.org/T276124) [17:31:00] (03CR) 10Dzahn: "[deneb:~] $ sudo systemctl list-timers | grep cow" [puppet] - 10https://gerrit.wikimedia.org/r/667008 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [17:31:02] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:31:40] (03CR) 10Bstorm: [C: 03+2] "Synced up with @Phamhi and merging" [puppet] - 10https://gerrit.wikimedia.org/r/668497 (https://phabricator.wikimedia.org/T276124) (owner: 10Bstorm) [17:32:12] PROBLEM - MegaRAID on analytics1066 is CRITICAL: CRITICAL: 12 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:32:54] (03CR) 10Dzahn: "oops, need to remove the "> /dev/null" part from the command further up." [puppet] - 10https://gerrit.wikimedia.org/r/667008 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [17:33:00] (03PS1) 10Jcrespo: Revert "install_Server: Apply custom/backup-format.cfg to backup[12]003" [puppet] - 10https://gerrit.wikimedia.org/r/668506 [17:34:00] (03PS2) 10Jcrespo: Revert "install_Server: Apply custom/backup-format.cfg to backup[12]003" [puppet] - 10https://gerrit.wikimedia.org/r/668506 [17:35:30] (03CR) 10Jcrespo: [C: 03+2] Revert "install_Server: Apply custom/backup-format.cfg to backup[12]003" [puppet] - 10https://gerrit.wikimedia.org/r/668506 (owner: 10Jcrespo) [17:36:49] (03PS1) 10Dzahn: package_builder: remove /dev/null redirection from cowbuilder update command [puppet] - 10https://gerrit.wikimedia.org/r/668500 (https://phabricator.wikimedia.org/T273673) [17:37:00] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:37:18] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/668500" [puppet] - 10https://gerrit.wikimedia.org/r/667008 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [17:37:32] (03CR) 10Dzahn: [C: 03+2] package_builder: remove /dev/null redirection from cowbuilder update command [puppet] - 10https://gerrit.wikimedia.org/r/668500 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [17:39:20] !log [deneb:~] $ sudo systemctl start cowbuilder_update_jessie-amd64 [17:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:46] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts: ` backup2003.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2021030... [17:40:10] (03CR) 10Dzahn: "[deneb:~] $ sudo systemctl start cowbuilder_update_jessie-amd64" [puppet] - 10https://gerrit.wikimedia.org/r/668500 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [17:40:15] (03CR) 10Dzahn: "[deneb:~] $ sudo systemctl start cowbuilder_update_jessie-amd64" [puppet] - 10https://gerrit.wikimedia.org/r/667008 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [17:40:22] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:42:13] ^ that is because package_builder on deneb now uses timers instead of cron jobs [17:42:17] working [17:47:37] (03PS1) 10Volans: sre.hosts.decommission: temporary fix for Netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/668505 (https://phabricator.wikimedia.org/T274689) [17:49:57] (03CR) 10Dzahn: phabricator::tools: replace cron jobs with timers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [17:50:13] (03PS1) 10Jbond: Revert "Revert "ldap::config::labs: replace hiera_hash with lookup"" [puppet] - 10https://gerrit.wikimedia.org/r/668507 [17:51:29] (03PS1) 10Dzahn: package_builder: remove absented cron code [puppet] - 10https://gerrit.wikimedia.org/r/668527 (https://phabricator.wikimedia.org/T273673) [17:51:57] (03PS2) 10Dzahn: package_builder: remove absented cron code [puppet] - 10https://gerrit.wikimedia.org/r/668527 (https://phabricator.wikimedia.org/T273673) [17:52:52] (03CR) 10Dzahn: [C: 03+2] package_builder: remove absented cron code [puppet] - 10https://gerrit.wikimedia.org/r/668527 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [17:53:31] (03PS2) 10Jbond: ldap::config::labs: replace hiera_hash with lookup [puppet] - 10https://gerrit.wikimedia.org/r/668507 (https://phabricator.wikimedia.org/T209953) [17:53:35] 10SRE, 10ops-eqiad, 10Analytics-Radar: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10razzi) [17:54:06] (03PS3) 10Jbond: ldap::config::labs: replace hiera_hash with lookup [puppet] - 10https://gerrit.wikimedia.org/r/668507 (https://phabricator.wikimedia.org/T209953) [17:55:11] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28387/console" [puppet] - 10https://gerrit.wikimedia.org/r/668507 (https://phabricator.wikimedia.org/T209953) (owner: 10Jbond) [17:56:07] (03CR) 10Dzahn: phabricator::tools: replace cron jobs with timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [17:56:09] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['backup2003.codfw.wmnet'] ` Of which those **FAILED**: ` ['backup2003.codfw.wmnet'] ` [17:56:13] (03CR) 10Jbond: "> Patch Set 7:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665461 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [17:57:12] (03CR) 10Dzahn: ldap::config::labs: replace hiera_hash with lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665461 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [17:58:15] (03CR) 10Elukey: [C: 03+1] sre.hosts.decommission: temporary fix for Netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/668505 (https://phabricator.wikimedia.org/T274689) (owner: 10Volans) [17:58:25] (03CR) 10Dzahn: phabricator::tools: replace cron jobs with timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [18:00:04] chrisalbon and accraze: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210304T1800). [18:01:44] (03PS1) 10Bstorm: Revert "wikireplicas: depool labsdb1010 for view changes" [puppet] - 10https://gerrit.wikimedia.org/r/668508 [18:03:23] (03CR) 10Dzahn: "looks good, though I'm pretty sure last time it affected only one host, cloudweb2001-dev, and that isn't in the list at https://puppet-com" [puppet] - 10https://gerrit.wikimedia.org/r/668507 (https://phabricator.wikimedia.org/T209953) (owner: 10Jbond) [18:04:07] (03CR) 10Bstorm: [C: 03+2] Revert "wikireplicas: depool labsdb1010 for view changes" [puppet] - 10https://gerrit.wikimedia.org/r/668508 (owner: 10Bstorm) [18:04:40] (03PS2) 10Herron: elk: send icinga events to a separate partition/index [puppet] - 10https://gerrit.wikimedia.org/r/667917 [18:05:15] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28388/console" [puppet] - 10https://gerrit.wikimedia.org/r/668507 (https://phabricator.wikimedia.org/T209953) (owner: 10Jbond) [18:06:03] (03CR) 10Jbond: [V: 03+1] "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/668507 (https://phabricator.wikimedia.org/T209953) (owner: 10Jbond) [18:09:18] (03CR) 10Dzahn: [C: 03+2] "thanks!:)" [puppet] - 10https://gerrit.wikimedia.org/r/668507 (https://phabricator.wikimedia.org/T209953) (owner: 10Jbond) [18:09:37] !jouncebot now [18:09:37] a Python reminder bot for deployments. see https://wikitech.wikimedia.org/wiki/Tool:Jouncebot [18:09:47] jouncebot: now [18:09:47] For the next 0 hour(s) and 50 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210304T1800) [18:09:48] jouncebot: next [18:09:48] In 0 hour(s) and 50 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210304T1900) [18:09:56] Thanks rzl. :-) [18:10:07] o7 [18:10:14] (03PS1) 10Jforrester: Fix use of $this->getConfig() for configuration [extensions/FlaggedRevs] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668509 (https://phabricator.wikimedia.org/T276386) [18:10:56] (03CR) 10Elukey: [C: 03+1] P:memcached: drop admin_groups as has no meening here [puppet] - 10https://gerrit.wikimedia.org/r/668465 (owner: 10Jbond) [18:11:34] (03CR) 10Dzahn: "confirmed noop on cloudweb-dev2001 (LVS change on every puppet run there but unrelated), cloudstore1009, mwmaint1002.." [puppet] - 10https://gerrit.wikimedia.org/r/668507 (https://phabricator.wikimedia.org/T209953) (owner: 10Jbond) [18:19:51] (03CR) 10Herron: "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667917 (owner: 10Herron) [18:21:06] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10jcrespo) I found the issue, it was the same as with the NIC, but with Drives: only one disk can be set as "bootable", so it was trying to boot from the HW raid, not the SW raid. I changed... [18:22:49] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts: ` backup2003.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2021030... [18:25:01] !log jynus@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2003.codfw.wmnet with reason: REIMAGE [18:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:21] (03PS1) 10Jcrespo: Revert "Revert "install_Server: Apply custom/backup-format.cfg to backup[12]003"" [puppet] - 10https://gerrit.wikimedia.org/r/668510 [18:26:58] !log jynus@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on backup2003.codfw.wmnet with reason: REIMAGE [18:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:58] icinga downtime has been failing for me a few times in a row [18:28:44] maybe some of the timing changed recently [18:31:08] 10SRE, 10Parsoid-Tests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) [18:33:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:35:05] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['backup2003.codfw.wmnet'] ` and were **ALL** successful. [18:35:48] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:36:24] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10jcrespo) [18:36:37] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10jcrespo) 05Open→03Resolved ` ssh backup2003.codfw.wmnet Linux backup2003 4.19.0-14-amd64 #1 SMP Debian 4.19.171-2 (2021-01-30) x86_64 Debian GNU/Linux 10 (buster) backup2003 is a Host... [18:37:25] (03CR) 10Jcrespo: [C: 03+2] Revert "Revert "install_Server: Apply custom/backup-format.cfg to backup[12]003"" [puppet] - 10https://gerrit.wikimedia.org/r/668510 (owner: 10Jcrespo) [18:42:57] 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): Add Matthew Williams to analytics-privatedata-users - https://phabricator.wikimedia.org/T275671 (10Ottomata) Approved! [18:44:22] (03PS1) 10Bstorm: wikireplicas: depool labsdb1011 for view changes [puppet] - 10https://gerrit.wikimedia.org/r/668535 (https://phabricator.wikimedia.org/T276124) [18:48:21] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-druid100[345] - https://phabricator.wikimedia.org/T274163 (10Cmjohnson) [18:48:47] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-druid100[345] - https://phabricator.wikimedia.org/T274163 (10Cmjohnson) These just need the on-site setup. Planning on doing this tomorrow [18:50:38] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:50:42] (03CR) 10Bstorm: [C: 03+2] wikireplicas: depool labsdb1011 for view changes [puppet] - 10https://gerrit.wikimedia.org/r/668535 (https://phabricator.wikimedia.org/T276124) (owner: 10Bstorm) [18:51:46] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:52:03] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-druid100[345] - https://phabricator.wikimedia.org/T274163 (10elukey) @Cmjohnson these needs to be in the Analytics VLAN (double checking since I see "internal VLAN" in the task description) [18:59:08] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210304T1900). [19:00:04] James_F: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:01:51] I can deploy. [19:02:20] (03CR) 10Jforrester: [C: 03+2] Fix use of $this->getConfig() for configuration [extensions/FlaggedRevs] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668509 (https://phabricator.wikimedia.org/T276386) (owner: 10Jforrester) [19:02:32] 10SRE, 10Maps (Tilerator): Investigate Swift as a storage backend for maps tiles - https://phabricator.wikimedia.org/T149885 (10MSantos) 05Open→03Resolved a:03Jgiannelos This has to be considered and you can find out more about the investigation at T272843#6822224. A decision to proceed with Swift as ti... [19:02:39] 10SRE, 10Maps (Tilerator): Externalize tile storage for maps - https://phabricator.wikimedia.org/T196474 (10MSantos) [19:03:38] 10SRE, 10Maps (Tilerator): Externalize tile storage for maps - https://phabricator.wikimedia.org/T196474 (10MSantos) [19:06:31] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:57] (03PS10) 10Razzi: wikireplicas: Add basic configuration for clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/661529 (https://phabricator.wikimedia.org/T269211) [19:07:53] (03Merged) 10jenkins-bot: Fix use of $this->getConfig() for configuration [extensions/FlaggedRevs] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668509 (https://phabricator.wikimedia.org/T276386) (owner: 10Jforrester) [19:11:46] !log jforrester@deploy1002 Synchronized php-1.36.0-wmf.33/extensions/FlaggedRevs/frontend/specialpages/reports/ProblemChanges.php: T276386 Fix fatal calls to getConfig (duration: 01m 12s) [19:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:52] T276386: FlaggedRevs triggering: GlobalVarConfig::get: undefined option: '$wgFeed' - https://phabricator.wikimedia.org/T276386 [19:12:10] RECOVERY - PHP opcache health on mwdebug2001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:12:10] OK, backport done. [19:13:43] James_F: clear for me to do deployments? [19:14:17] Urbanecm: Go for it. [19:14:20] thanks [19:14:33] (03PS2) 10Urbanecm: Enable Growth features on sqwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668423 (https://phabricator.wikimedia.org/T275550) [19:14:37] (03CR) 10Urbanecm: [C: 03+2] Enable Growth features on sqwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668423 (https://phabricator.wikimedia.org/T275550) (owner: 10Urbanecm) [19:15:24] (03Merged) 10jenkins-bot: Enable Growth features on sqwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668423 (https://phabricator.wikimedia.org/T275550) (owner: 10Urbanecm) [19:16:09] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging100[123] - https://phabricator.wikimedia.org/T273778 (10Cmjohnson) [19:16:41] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging100[123] - https://phabricator.wikimedia.org/T273778 (10Cmjohnson) These need idrac setup, planning on doing this 3/5 [19:18:54] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 377bc4fcfd8719281776661eae2297ac1242dae6: Enable Growth features on sqwiki in stealth mode (T275550; 1/3) (duration: 00m 57s) [19:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:02] T275550: Deploy Growth features on Albanian Wikipedia - https://phabricator.wikimedia.org/T275550 [19:20:17] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: 377bc4fcfd8719281776661eae2297ac1242dae6: Enable Growth features on sqwiki in stealth mode (T275550; 2/3) (duration: 00m 57s) [19:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:31] !log urbanecm@deploy1002 Synchronized wmf-config/config/sqwiki.yaml: 377bc4fcfd8719281776661eae2297ac1242dae6: Enable Growth features on sqwiki in stealth mode (T275550; 3/3) (duration: 00m 57s) [19:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:04] (03PS1) 10Urbanecm: Enable Growth features on hiwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668537 (https://phabricator.wikimedia.org/T276450) [19:28:45] (03CR) 10Urbanecm: [C: 03+2] Enable Growth features on hiwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668537 (https://phabricator.wikimedia.org/T276450) (owner: 10Urbanecm) [19:29:41] (03Merged) 10jenkins-bot: Enable Growth features on hiwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668537 (https://phabricator.wikimedia.org/T276450) (owner: 10Urbanecm) [19:33:13] !log restarted apache and php7.0-fpm on doc1001 due to staleness [19:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:03] Urbanecm: can you throw https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/668470/ on top? it's beta-only [19:38:33] (03PS3) 10Urbanecm: [beta] Add Link: use production link recommendation service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668470 (https://phabricator.wikimedia.org/T274198) (owner: 10Gergő Tisza) [19:38:34] certainly [19:38:41] (03CR) 10Urbanecm: [C: 03+2] "noop for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668470 (https://phabricator.wikimedia.org/T274198) (owner: 10Gergő Tisza) [19:39:28] (03Merged) 10jenkins-bot: [beta] Add Link: use production link recommendation service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668470 (https://phabricator.wikimedia.org/T274198) (owner: 10Gergő Tisza) [19:40:37] tgr_: done [19:40:47] thx [19:40:54] any time [19:44:29] (03PS1) 10Ryan Kemper: wdqs: new query-preview microsite [puppet] - 10https://gerrit.wikimedia.org/r/668543 [19:44:52] (03PS2) 10Ryan Kemper: wdqs: new query-preview microsite [puppet] - 10https://gerrit.wikimedia.org/r/668543 (https://phabricator.wikimedia.org/T266470) [19:46:05] (03PS1) 10Urbanecm: cleanup: Remove help panel URL from Help homepage module [extensions/GrowthExperiments] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668511 (https://phabricator.wikimedia.org/T276450) [19:46:16] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10Papaul) @jcrespo what you can try to do is first create a HW RAID 0 on the first SSD disk then another HW RAID 0 on the second SSD disk once that done, create a HW RAID 6 on the other 24... [19:46:16] (03CR) 10Urbanecm: [C: 03+2] cleanup: Remove help panel URL from Help homepage module [extensions/GrowthExperiments] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668511 (https://phabricator.wikimedia.org/T276450) (owner: 10Urbanecm) [19:48:02] (03PS3) 10Ryan Kemper: wdqs: new query-preview microsite [puppet] - 10https://gerrit.wikimedia.org/r/668543 (https://phabricator.wikimedia.org/T266470) [19:50:39] anybody working on wdqs2008? Netbox is reporting that it is missing in puppetdb [19:51:24] (03PS1) 10Andrew Bogott: wmcs-drain-hypervisor: add timeout and retries [puppet] - 10https://gerrit.wikimedia.org/r/668545 (https://phabricator.wikimedia.org/T276344) [19:54:12] (03CR) 10Jforrester: "This should be fine, now that wikitech is a Cluster wiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668196 (https://phabricator.wikimedia.org/T125941) (owner: 10Hashar) [20:00:04] liw and longma: Dear deployers, time to do the Mediawiki train - European+American Version (secondary timeslot) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210304T2000). [20:01:33] still waiting for a .wmf merge [20:03:52] (03Merged) 10jenkins-bot: cleanup: Remove help panel URL from Help homepage module [extensions/GrowthExperiments] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668511 (https://phabricator.wikimedia.org/T276450) (owner: 10Urbanecm) [20:03:57] finally [20:04:50] (03CR) 10Dduvall: [C: 03+1] wmf-config/CommonSettings.php: Add WMF_MAINTENANCE_OFFLINE handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667244 (https://phabricator.wikimedia.org/T238436) (owner: 10Ahmon Dancy) [20:05:46] (03CR) 10Legoktm: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667243 (owner: 10Ahmon Dancy) [20:06:07] (03CR) 10Legoktm: [C: 03+1] wmf-config/CommonSettings.php: Add WMF_MAINTENANCE_OFFLINE handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667244 (https://phabricator.wikimedia.org/T238436) (owner: 10Ahmon Dancy) [20:08:43] !log urbanecm@deploy1002 Synchronized php-1.36.0-wmf.33/extensions/GrowthExperiments/includes/HomepageModules/Help.php: 8cc65e3fd0b4a75599171b619108584526784853: cleanup: Remove help panel URL from Help homepage module (T276450; T273118) (duration: 00m 58s) [20:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:52] T276450: Deploy Growth features on Hindi Wikipedia - https://phabricator.wikimedia.org/T276450 [20:08:52] T273118: Help panel: Remove dependency on Help Desk title existing - https://phabricator.wikimedia.org/T273118 [20:10:10] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: c6b04cb1bc0b56823f96c59c93bd88f331f7d261: Enable Growth features on hiwiki in stealth mode (T276450; 1/3) (duration: 00m 57s) [20:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:25] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: c6b04cb1bc0b56823f96c59c93bd88f331f7d261: Enable Growth features on hiwiki in stealth mode (T276450; 2/3) (duration: 00m 57s) [20:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:44] !log urbanecm@deploy1002 Synchronized wmf-config/config/hiwiki.yaml: c6b04cb1bc0b56823f96c59c93bd88f331f7d261: Enable Growth features on hiwiki in stealth mode (T276450; 3/3) (duration: 00m 58s) [20:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:04] * Urbanecm is done [20:14:04] (03CR) 10Legoktm: Rsync private mediawiki files to releases server (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) (owner: 10Jeena Huneidi) [20:15:18] 10SRE, 10Language-Team, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10cscott) I'd recommend just temporarily bumping the limit for hewikisource for now. However, not that this is not equitable ei... [20:15:42] 10SRE, 10Language-Team, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10cscott) [20:17:06] (03CR) 10Legoktm: mailman3: Add hyperkitty (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/667367 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup) [20:17:12] (03CR) 10Legoktm: [C: 04-1] mailman3: Add hyperkitty [puppet] - 10https://gerrit.wikimedia.org/r/667367 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup) [20:17:49] (03PS8) 10Legoktm: mailman3: Start apache2 for web [puppet] - 10https://gerrit.wikimedia.org/r/657950 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup) [20:18:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:20:36] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:20:41] (03CR) 10Legoktm: [C: 03+2] mailman3: Start apache2 for web [puppet] - 10https://gerrit.wikimedia.org/r/657950 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup) [20:26:51] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/668543 (https://phabricator.wikimedia.org/T266470) (owner: 10Ryan Kemper) [20:43:56] (03CR) 10Cwhite: elk: send icinga events to a separate partition/index (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667917 (owner: 10Herron) [20:47:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:49:40] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:56:31] (03CR) 10Ladsgroup: phabricator::tools: replace cron jobs with timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [20:56:48] (03PS1) 10Mholloway: WikimediaEvents: Bump session_tick sampling rate to 5% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668553 (https://phabricator.wikimedia.org/T276502) [21:00:58] (03PS1) 10Bstorm: wikireplicas: fixing a filter for actor table [puppet] - 10https://gerrit.wikimedia.org/r/668555 (https://phabricator.wikimedia.org/T276124) [21:01:29] (03PS1) 10Bstorm: Revert "wikireplicas: depool labsdb1011 for view changes" [puppet] - 10https://gerrit.wikimedia.org/r/668512 [21:04:54] (03PS3) 10Herron: elk: send icinga events to a separate partition/index [puppet] - 10https://gerrit.wikimedia.org/r/667917 [21:07:23] (03CR) 10SBassett: [C: 03+1] "(per discussion on task)" [puppet] - 10https://gerrit.wikimedia.org/r/668555 (https://phabricator.wikimedia.org/T276124) (owner: 10Bstorm) [21:07:36] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1032 - https://phabricator.wikimedia.org/T272209 (10wiki_willy) Hi @fgiunchedi - let us know when you have the decom task for ms-be1034 submitted per our conversation on IRC....then we can pull one of the drives for this. Thanks, Willy >>! In T272209#6878412, @fgiunc... [21:07:38] (03CR) 10jerkins-bot: [V: 04-1] elk: send icinga events to a separate partition/index [puppet] - 10https://gerrit.wikimedia.org/r/667917 (owner: 10Herron) [21:08:06] (03CR) 10Bstorm: [C: 03+2] wikireplicas: fixing a filter for actor table [puppet] - 10https://gerrit.wikimedia.org/r/668555 (https://phabricator.wikimedia.org/T276124) (owner: 10Bstorm) [21:12:03] (03CR) 10Bstorm: [C: 03+2] Revert "wikireplicas: depool labsdb1011 for view changes" [puppet] - 10https://gerrit.wikimedia.org/r/668512 (owner: 10Bstorm) [21:13:21] (03PS4) 10Herron: elk: send icinga events to a separate partition/index [puppet] - 10https://gerrit.wikimedia.org/r/667917 [21:15:33] (03CR) 10Herron: elk: send icinga events to a separate partition/index (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667917 (owner: 10Herron) [21:19:01] (03PS4) 10Razzi: kafka: Disable alert for absolute max lag value [puppet] - 10https://gerrit.wikimedia.org/r/667724 (https://phabricator.wikimedia.org/T273702) [21:20:54] (03CR) 10Razzi: [C: 03+2] kafka: Disable alert for absolute max lag value [puppet] - 10https://gerrit.wikimedia.org/r/667724 (https://phabricator.wikimedia.org/T273702) (owner: 10Razzi) [21:23:16] (03PS1) 10Dduvall: Remove use of DB_NONE from SendBulkEmails [extensions/WikimediaMaintenance] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668513 [21:24:27] (03CR) 10Ladsgroup: mailman3: Add hyperkitty (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/667367 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup) [21:24:29] (03PS2) 10Ladsgroup: mailman3: Add hyperkitty [puppet] - 10https://gerrit.wikimedia.org/r/667367 (https://phabricator.wikimedia.org/T256542) [21:25:39] (03PS1) 10Dduvall: maintenance: mergeMessageFileList should be DB_NONE [core] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668514 (https://phabricator.wikimedia.org/T260827) [21:26:07] (03PS1) 10Dduvall: maintenance: Avoid missing l10n cache error in mergeMessageFileList [core] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668515 [21:26:25] (03PS1) 10Dduvall: maintenance: rebuildLocalisationCache should be DB_NONE if possible [core] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668516 (https://phabricator.wikimedia.org/T260827) [21:28:52] 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephmon2004-dev - https://phabricator.wikimedia.org/T276509 (10RobH) [21:29:21] 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephmon2004-dev - https://phabricator.wikimedia.org/T276509 (10RobH) [21:33:53] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 16 down 2 https://wikitech.wikimedia.org/wiki/HAProxy [21:37:59] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 18 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [21:42:30] (03PS1) 10Phamhi: Revert "wikireplica: depool clouddb1015" [puppet] - 10https://gerrit.wikimedia.org/r/668517 [21:43:22] (03PS2) 10Phamhi: Revert "wikireplica: depool clouddb1015" [puppet] - 10https://gerrit.wikimedia.org/r/668517 [21:43:52] (03CR) 10Phamhi: [C: 03+2] Revert "wikireplica: depool clouddb1015" [puppet] - 10https://gerrit.wikimedia.org/r/668517 (owner: 10Phamhi) [21:45:55] (03CR) 10Ottomata: [C: 03+2] Add requested packages to base anaconda-wmf [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/668453 (https://phabricator.wikimedia.org/T271960) (owner: 10Ottomata) [21:45:57] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add requested packages to base anaconda-wmf [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/668453 (https://phabricator.wikimedia.org/T271960) (owner: 10Ottomata) [21:46:26] (03CR) 10Zabe: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668553 (https://phabricator.wikimedia.org/T276502) (owner: 10Mholloway) [21:46:53] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:48:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:51:32] (03Restored) 10Brennen Bearnes: logspam-watch: redraw when terminal size changes [puppet] - 10https://gerrit.wikimedia.org/r/668172 (owner: 10Hashar) [21:51:40] (03PS2) 10Brennen Bearnes: logspam-watch: redraw when terminal size changes [puppet] - 10https://gerrit.wikimedia.org/r/668172 (owner: 10Hashar) [21:52:45] (03CR) 10Brennen Bearnes: "Per earlier discussion with Ahmon and hashar - setting ticks instead of calling the whole display function seems like it works pretty well" [puppet] - 10https://gerrit.wikimedia.org/r/668172 (owner: 10Hashar) [21:58:02] (03Abandoned) 10Dduvall: Remove use of DB_NONE from SendBulkEmails [extensions/WikimediaMaintenance] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/668513 (owner: 10Dduvall) [22:05:11] (03PS1) 10Phamhi: wikireplica: depool clouddb1016 [puppet] - 10https://gerrit.wikimedia.org/r/668563 [22:34:02] (03PS5) 10Herron: elk: send icinga events to a separate partition/index [puppet] - 10https://gerrit.wikimedia.org/r/667917 [22:36:46] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/667917 (owner: 10Herron) [22:44:27] (03PS1) 10Ottomata: Add activate.d and deactivate.d env_vars.sh [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/668566 (https://phabricator.wikimedia.org/T272313) [22:47:02] (03PS3) 10Ottomata: Jupyter - never use webproxy for *.wmnet URLs and make Java use system cacerts [puppet] - 10https://gerrit.wikimedia.org/r/668466 (https://phabricator.wikimedia.org/T224658) [22:49:12] 10SRE, 10SRE-Access-Requests: wikidata.org delegated Full Google Search Console access for abaso@wikimedia.org - https://phabricator.wikimedia.org/T275240 (10dr0ptp4kt) Yes, it works. Thank you! [22:50:48] (03CR) 10Bstorm: [C: 03+1] "1016 is primary over on" [puppet] - 10https://gerrit.wikimedia.org/r/668563 (owner: 10Phamhi) [22:51:19] (03CR) 10Bstorm: [C: 03+1] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/668563 (owner: 10Phamhi) [22:57:59] (03PS1) 10Andrew Bogott: labs_lvm: check for available space before partitioning [puppet] - 10https://gerrit.wikimedia.org/r/668567 (https://phabricator.wikimedia.org/T272114) [22:58:25] (03CR) 10jerkins-bot: [V: 04-1] labs_lvm: check for available space before partitioning [puppet] - 10https://gerrit.wikimedia.org/r/668567 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [23:07:33] 10SRE, 10ops-eqiad, 10User-ArielGlenn: Interface errors on asw2-b-eqiad:ge-8/0/6 (dumpsdata1001) - https://phabricator.wikimedia.org/T273714 (10wiki_willy) Looks like the errors have cleared up from the past week. (thanks for checking @Papaul) @ArielGlenn - you ok if we close this task out? Thanks, Willy [23:12:14] (03CR) 10Legoktm: mailman3: Add hyperkitty (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667367 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup) [23:12:16] (03PS2) 10Andrew Bogott: labs_lvm: check for available space before partitioning [puppet] - 10https://gerrit.wikimedia.org/r/668567 (https://phabricator.wikimedia.org/T272114) [23:12:42] (03CR) 10jerkins-bot: [V: 04-1] labs_lvm: check for available space before partitioning [puppet] - 10https://gerrit.wikimedia.org/r/668567 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [23:13:43] (03PS3) 10Andrew Bogott: labs_lvm: check for available space before partitioning [puppet] - 10https://gerrit.wikimedia.org/r/668567 (https://phabricator.wikimedia.org/T272114) [23:23:49] (03CR) 10Ladsgroup: mailman3: Add hyperkitty (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667367 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup) [23:29:54] (03CR) 10Legoktm: mailman3: Add hyperkitty (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667367 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup) [23:30:40] (03CR) 10Legoktm: [C: 03+2] mailman3: Add hyperkitty [puppet] - 10https://gerrit.wikimedia.org/r/667367 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup) [23:39:10] !log legoktm@cumin1001 START - Cookbook sre.ganeti.makevm for new host registry1004.eqiad.wmnet [23:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:00] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review, 10User-Ladsgroup: Puppetize mailman3 web and hyperkitty (mailman archiver) - https://phabricator.wikimedia.org/T256542 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup [23:40:03] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Puppetize mailman3 - https://phabricator.wikimedia.org/T256536 (10Ladsgroup) [23:42:23] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Puppetize mailman3 - https://phabricator.wikimedia.org/T256536 (10Ladsgroup) What's left: MTA integration. For production-ready puppet (TLS termination and acme, logging, monitoring, SpamAssassin, etc.). I'll make a separate ticket. [23:55:36] !log legoktm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host registry1004.eqiad.wmnet [23:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:45] (03PS1) 10Legoktm: install_server: Add registry1004.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/668570 (https://phabricator.wikimedia.org/T276380)