[00:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210309T0000) [00:00:04] Jdlrobson: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:08] o/ present [00:00:27] i can deploy today [00:00:43] (03PS1) 10Razzi: motd: Use heredoc to allow expanding description with apostrophe [puppet] - 10https://gerrit.wikimedia.org/r/670013 (https://phabricator.wikimedia.org/T276868) [00:01:02] (03CR) 10Urbanecm: [C: 03+2] Logo updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669998 (https://phabricator.wikimedia.org/T273085) (owner: 10Jdlrobson) [00:01:15] (03CR) 10Urbanecm: [C: 03+2] Enable modern Vector on incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669999 (https://phabricator.wikimedia.org/T275479) (owner: 10Jdlrobson) [00:01:59] (03Merged) 10jenkins-bot: Logo updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669998 (https://phabricator.wikimedia.org/T273085) (owner: 10Jdlrobson) [00:02:03] (03Merged) 10jenkins-bot: Enable modern Vector on incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669999 (https://phabricator.wikimedia.org/T275479) (owner: 10Jdlrobson) [00:03:16] Jdlrobson: both of them are on mwdebug1001, let me know if you want to test them separately [00:03:40] Urbanecm: testing together is fine [00:03:41] and on it [00:03:53] thanks [00:05:29] Urbanecm: perfect [00:05:32] sync away! [00:05:58] cool [00:06:01] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28455/console" [puppet] - 10https://gerrit.wikimedia.org/r/670013 (https://phabricator.wikimedia.org/T276868) (owner: 10Razzi) [00:07:17] (03CR) 10Razzi: "Hi @Bstorm, here's a fun bug fix related to the role::wmcs::db::wikireplicas::dedicated::analytics_multiinstance description having an apo" [puppet] - 10https://gerrit.wikimedia.org/r/670013 (https://phabricator.wikimedia.org/T276868) (owner: 10Razzi) [00:08:26] !log urbanecm@deploy1002 Synchronized static/images/mobile/copyright/: ce82e0cd6015b362f9acc8e90d300cf88738cf98: Logo updates (T273085) (duration: 00m 58s) [00:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:34] T273085: Deploy more logos for new DIP pilot wikis - https://phabricator.wikimedia.org/T273085 [00:09:46] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: ce82e0cd6015b362f9acc8e90d300cf88738cf98: Logo updates (T273085) (duration: 00m 58s) [00:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:42] !log urbanecm@deploy1002 Synchronized dblists/desktop-improvements.dblist: 0d260eda2d62ae053310ee978201b1a031522d59: Enable modern Vector on incubator (T275479; 1/2) (duration: 01m 01s) [00:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:49] T275479: Deploy modern vector to wikimedia incubator - https://phabricator.wikimedia.org/T275479 [00:11:55] thanks Urbanecm :) [00:12:05] any time [00:12:08] anything else? [00:12:59] !log urbanecm@deploy1002 Synchronized wmf-config/config/incubatorwiki.yaml: 0d260eda2d62ae053310ee978201b1a031522d59: Enable modern Vector on incubator (T275479; 2/2) (duration: 00m 57s) [00:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:03] (03CR) 10Andrew Bogott: "> Patch Set 11:" [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [00:20:03] Urbanecm: Nope that's it [00:20:07] cool [00:20:23] (sorry i was just passing it to designer) [00:22:44] (03CR) 10Bstorm: "> Patch Set 11:" [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [00:23:26] np [00:24:34] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [00:41:36] (03CR) 10Andrew Bogott: "> Patch Set 11:" [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [00:50:36] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [00:58:28] !log krinkle@mwmaint1002 Ran invalidateUserSesssions.php for one user [00:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:13:18] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [01:18:06] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [01:49:56] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:52:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:07:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.34 [core] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670019 [02:09:01] (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.34 [core] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670019 (https://phabricator.wikimedia.org/T274938) (owner: 10TrainBranchBot) [05:16:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1175 for table check T276742', diff saved to https://phabricator.wikimedia.org/P14675 and previous config saved to /var/cache/conftool/dbconfig/20210309-051646-marostegui.json [05:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:55] T276742: Check for errors on all tables on some hosts - https://phabricator.wikimedia.org/T276742 [05:33:57] (03PS1) 10Marostegui: dbproxy1019: Depool clouddb1014:3312, clouddb1014:3317 [puppet] - 10https://gerrit.wikimedia.org/r/670030 (https://phabricator.wikimedia.org/T269211) [05:35:54] (03CR) 10Marostegui: [C: 03+2] dbproxy1019: Depool clouddb1014:3312, clouddb1014:3317 [puppet] - 10https://gerrit.wikimedia.org/r/670030 (https://phabricator.wikimedia.org/T269211) (owner: 10Marostegui) [05:37:41] !log Stop mysql on clouddb1014:3312, 3317 to transfer its data to cloudb1021 [05:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:56] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 16 down 2 https://wikitech.wikimedia.org/wiki/HAProxy [05:47:10] ^expected [06:01:35] 10SRE, 10Patch-For-Review: Role with quote in description causes bash syntax error - https://phabricator.wikimedia.org/T276868 (10Aklapper) [06:02:47] (03PS1) 10Marostegui: install_server: Do not reimage clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/670034 (https://phabricator.wikimedia.org/T269211) [06:19:01] !log Deploy schema change on s6 codfw (there will be lag on codfw) T276150 T276156 [06:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:11] T276150: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 [06:19:12] T276156: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 [06:19:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:22:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:26:58] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=gerrit-metrics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:27:17] <_joe_> see ^^ [06:29:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:30:11] <_joe_> !log restarting gerrit on gerrit1001, using 48 GB of heap [06:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:30] <_joe_> it will take some time, but what better time than now? [06:33:13] (03CR) 10Giuseppe Lavagetto: [C: 03+2] systemd::timer::job: correctly quote environment variables [puppet] - 10https://gerrit.wikimedia.org/r/669753 (owner: 10Giuseppe Lavagetto) [06:39:50] (03Abandoned) 10Giuseppe Lavagetto: mediawiki::web: fix position of template [puppet] - 10https://gerrit.wikimedia.org/r/513265 (owner: 10Giuseppe Lavagetto) [06:40:59] (03Abandoned) 10Giuseppe Lavagetto: Add netlink-based Ipvsmanager implementation [debs/pybal] - 10https://gerrit.wikimedia.org/r/355417 (owner: 10Giuseppe Lavagetto) [06:43:06] (03PS1) 10Marostegui: clouddb1021: Change buffer pool sizes [puppet] - 10https://gerrit.wikimedia.org/r/670046 (https://phabricator.wikimedia.org/T269211) [06:43:46] (03Abandoned) 10Giuseppe Lavagetto: profile::discovery::client: fix services file [puppet] - 10https://gerrit.wikimedia.org/r/545526 (owner: 10Giuseppe Lavagetto) [06:44:02] (03CR) 10Marostegui: [C: 03+2] clouddb1021: Change buffer pool sizes [puppet] - 10https://gerrit.wikimedia.org/r/670046 (https://phabricator.wikimedia.org/T269211) (owner: 10Marostegui) [06:44:34] (03PS2) 10Giuseppe Lavagetto: wmflib: fix the documentation of the role() function [puppet] - 10https://gerrit.wikimedia.org/r/598462 [06:48:44] (03CR) 10Elukey: [C: 03+1] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/670034 (https://phabricator.wikimedia.org/T269211) (owner: 10Marostegui) [06:49:28] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/670034 (https://phabricator.wikimedia.org/T269211) (owner: 10Marostegui) [06:50:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] wmflib: fix the documentation of the role() function [puppet] - 10https://gerrit.wikimedia.org/r/598462 (owner: 10Giuseppe Lavagetto) [06:52:52] (03PS2) 10Giuseppe Lavagetto: wmflib: remove deprecated $::_roles variable from the role() function [puppet] - 10https://gerrit.wikimedia.org/r/598463 [06:58:17] (03PS3) 10Giuseppe Lavagetto: wmflib: remove deprecated $::_roles variable from the role() function [puppet] - 10https://gerrit.wikimedia.org/r/598463 [06:58:25] !log elukey@cumin1001 START - Cookbook sre.dns.netbox [06:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:11] !log drain + reimage an-worker109[4,5] to Buster [07:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:19] !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:04] (03PS4) 10Giuseppe Lavagetto: wmflib: remove deprecated $::_roles variable from the role() function [puppet] - 10https://gerrit.wikimedia.org/r/598463 [07:08:16] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28459/console" [puppet] - 10https://gerrit.wikimedia.org/r/598463 (owner: 10Giuseppe Lavagetto) [07:24:57] !log swift eqiad-prod: add weight to ms-be106[0-3] - T268435 [07:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:13] T268435: Add ms-be106[0-3] to swift - https://phabricator.wikimedia.org/T268435 [07:36:34] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1094.eqiad.wmnet with reason: REIMAGE [07:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:23] (03PS1) 10Filippo Giunchedi: Decom ms-be[2016-2027] [puppet] - 10https://gerrit.wikimedia.org/r/670086 (https://phabricator.wikimedia.org/T272837) [07:38:10] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1095.eqiad.wmnet with reason: REIMAGE [07:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:39] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1094.eqiad.wmnet with reason: REIMAGE [07:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:14] (03CR) 10Filippo Giunchedi: [C: 03+2] Decom ms-be[2016-2027] [puppet] - 10https://gerrit.wikimedia.org/r/670086 (https://phabricator.wikimedia.org/T272837) (owner: 10Filippo Giunchedi) [07:40:31] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1095.eqiad.wmnet with reason: REIMAGE [07:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:55] (03PS1) 10Marostegui: clouddb1021: Add s2 and s7 to clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/670088 (https://phabricator.wikimedia.org/T269211) [07:42:25] _joe_: merged your change too [07:44:04] <_joe_> godog: uhhh the doc change for role? yes [07:44:06] <_joe_> sorry :/ [07:44:38] <_joe_> I was running a full catalog compile on the subsequent patch and it's taking longer than expected [07:44:56] _joe_: yeah the doc one, no problem [07:47:31] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:48:28] (03PS2) 10Muehlenhoff: Add profile::base::linux510 for cloudgw and cloudnet [puppet] - 10https://gerrit.wikimedia.org/r/668087 [07:49:53] (03CR) 10Elukey: [C: 03+1] clouddb1021: Add s2 and s7 to clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/670088 (https://phabricator.wikimedia.org/T269211) (owner: 10Marostegui) [07:50:16] (03CR) 10Marostegui: [C: 03+2] clouddb1021: Add s2 and s7 to clouddb1021 [puppet] - 10https://gerrit.wikimedia.org/r/670088 (https://phabricator.wikimedia.org/T269211) (owner: 10Marostegui) [07:50:27] RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.098 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:50:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:50:41] !log restarted blazegraph on wdqs1004 and depooled it to catchup lag [07:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:05] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 751 bytes in 1.047 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:57:39] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 18 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [07:59:32] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission for hosts ms-be2016.codfw.wmnet [07:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:37] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] "Ran a full catalog compilation, nothing explodes:https://puppet-compiler.wmflabs.org/compiler1002/28460/" [puppet] - 10https://gerrit.wikimedia.org/r/598463 (owner: 10Giuseppe Lavagetto) [08:04:49] (03PS1) 10Marostegui: Revert "dbproxy1019: Depool clouddb1014:3312, clouddb1014:3317" [puppet] - 10https://gerrit.wikimedia.org/r/670106 [08:04:58] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) @herron getting back to this so we can add an OKR for Q4 :) We could do the follo... [08:06:09] (03CR) 10Filippo Giunchedi: [C: 03+1] "Not tested but LGTM" [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/669894 (https://phabricator.wikimedia.org/T276595) (owner: 10Cwhite) [08:06:18] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1019: Depool clouddb1014:3312, clouddb1014:3317" [puppet] - 10https://gerrit.wikimedia.org/r/670106 (owner: 10Marostegui) [08:09:28] (03Abandoned) 10Giuseppe Lavagetto: upload-frontend: ban a specific url with no referer nor UA [puppet] - 10https://gerrit.wikimedia.org/r/663004 (https://phabricator.wikimedia.org/T273741) (owner: 10Giuseppe Lavagetto) [08:09:51] (03PS1) 10Marostegui: dbproxy1019: Depool clouddb1015:3314, clouddb1015:3316 [puppet] - 10https://gerrit.wikimedia.org/r/670092 (https://phabricator.wikimedia.org/T269211) [08:10:35] (03CR) 10Marostegui: [C: 03+2] dbproxy1019: Depool clouddb1015:3314, clouddb1015:3316 [puppet] - 10https://gerrit.wikimedia.org/r/670092 (https://phabricator.wikimedia.org/T269211) (owner: 10Marostegui) [08:10:37] (03Abandoned) 10Giuseppe Lavagetto: scb: add service proxy, use it in the applications. [puppet] - 10https://gerrit.wikimedia.org/r/612461 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [08:11:19] (03Abandoned) 10Giuseppe Lavagetto: Remove user/group settings from php-fpm's configuration [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/651450 (owner: 10Giuseppe Lavagetto) [08:12:16] !log Stop mysql on clouddb1015:3314, 3316 [08:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:49] (03CR) 10Giuseppe Lavagetto: "recheck" [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/639021 (owner: 10Giuseppe Lavagetto) [08:17:47] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10MoritzMuehlenhoff) [08:17:56] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, thanks for taking care of this !" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/669968 (owner: 10Cwhite) [08:18:39] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 16 down 2 https://wikitech.wikimedia.org/wiki/HAProxy [08:25:24] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Update changelog [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/639021 (owner: 10Giuseppe Lavagetto) [08:25:35] ^expected [08:26:48] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-be2016.codfw.wmnet [08:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:01] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [08:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:14] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [08:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:23] (03PS1) 10ArielGlenn: commons entity dumps: use mediainfo for the name in the config file [puppet] - 10https://gerrit.wikimedia.org/r/670095 [08:38:12] (03CR) 10ArielGlenn: [C: 03+2] commons entity dumps: use mediainfo for the name in the config file [puppet] - 10https://gerrit.wikimedia.org/r/670095 (owner: 10ArielGlenn) [08:46:36] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [08:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:52] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [08:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:58] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission for hosts ms-be[2017-2019].codfw.wmnet [08:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:52] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1092.eqiad.wmnet with reason: REIMAGE [08:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:26] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1093.eqiad.wmnet with reason: REIMAGE [08:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:56] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1092.eqiad.wmnet with reason: REIMAGE [08:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:50] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1093.eqiad.wmnet with reason: REIMAGE [08:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:45] (03PS1) 10JMeybohm: admin_ng: Add GlobalNetworkPolicies to allow pod-to-pod traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/670098 [08:59:35] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-be[2017-2019].codfw.wmnet [08:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,netbox_device_statistics} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:00:12] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:01:06] !log installing Linux 4.9.258 updates on Stretch hosts [09:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:09:34] (03PS1) 10Kosta Harlan: Homepage: Check welcome notice seen preference [extensions/GrowthExperiments] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670107 (https://phabricator.wikimedia.org/T272754) [09:10:07] (03PS1) 10Marostegui: clouddb1021: Add s4 and s6 [puppet] - 10https://gerrit.wikimedia.org/r/670100 (https://phabricator.wikimedia.org/T269211) [09:13:57] (03CR) 10Marostegui: [C: 03+1] wikireplica: depool clouddb1017 [puppet] - 10https://gerrit.wikimedia.org/r/668803 (owner: 10Phamhi) [09:14:45] !log drain + reimage analytics1076 and an-worker1112 to Buster [09:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:27:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:31:00] (03PS2) 10Urbanecm: Homepage: Check welcome notice seen preference [extensions/GrowthExperiments] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670107 (https://phabricator.wikimedia.org/T272754) (owner: 10Kosta Harlan) [09:31:48] (03CR) 10Urbanecm: [C: 03+2] "brnach not yet pulled to deployment host, merging to assure deployment" [extensions/GrowthExperiments] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670107 (https://phabricator.wikimedia.org/T272754) (owner: 10Kosta Harlan) [09:32:17] (03PS1) 10Urbanecm: Make help panel fallback to help desk if no mentor is available [extensions/GrowthExperiments] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670108 (https://phabricator.wikimedia.org/T275908) [09:32:25] (03CR) 10Urbanecm: [C: 03+2] "branch not yet pulled to deployment host, merging to assure deployment" [extensions/GrowthExperiments] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670108 (https://phabricator.wikimedia.org/T275908) (owner: 10Urbanecm) [09:33:04] (03CR) 10Elukey: [C: 03+1] clouddb1021: Add s4 and s6 [puppet] - 10https://gerrit.wikimedia.org/r/670100 (https://phabricator.wikimedia.org/T269211) (owner: 10Marostegui) [09:36:54] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1076.eqiad.wmnet with reason: REIMAGE [09:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:35] (03PS1) 10Urbanecm: Make help panel fallback to help desk if no mentor is available [extensions/GrowthExperiments] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/670109 (https://phabricator.wikimedia.org/T275908) [09:39:03] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1076.eqiad.wmnet with reason: REIMAGE [09:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:25] (03CR) 10JMeybohm: [C: 04-1] modules/roles: Add k8s config for ML team machines (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [09:40:51] (03Abandoned) 10JMeybohm: Demo change on how to support switchable second staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/661961 (https://phabricator.wikimedia.org/T269835) (owner: 10JMeybohm) [09:40:53] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1112.eqiad.wmnet with reason: REIMAGE [09:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:58] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10jbond) >>! In T274461#6895107, @MoritzMuehlenhoff wrote: > @Sergey.Trofimovsky.SF : You can now log into idp01.ss... [09:42:57] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1112.eqiad.wmnet with reason: REIMAGE [09:43:00] (03CR) 10Elukey: modules/roles: Add k8s config for ML team machines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [09:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:00] !log Reboot db2071 for kernel upgrade (stretch) [09:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:27] (03Merged) 10jenkins-bot: Homepage: Check welcome notice seen preference [extensions/GrowthExperiments] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670107 (https://phabricator.wikimedia.org/T272754) (owner: 10Kosta Harlan) [09:44:37] !log Reboot db20712for kernel upgrade (stretch) [09:44:39] gah [09:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:44] !log Reboot db2072 for kernel upgrade (stretch) [09:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:28] (03Merged) 10jenkins-bot: Make help panel fallback to help desk if no mentor is available [extensions/GrowthExperiments] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670108 (https://phabricator.wikimedia.org/T275908) (owner: 10Urbanecm) [09:46:59] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host puppetboard2001.codfw.wmnet [09:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "ping me if you want me to merge this one. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/668087 (owner: 10Muehlenhoff) [09:49:27] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:38] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:10] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2001.codfw.wmnet [09:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:48] !log Reboot db2073 for kernel upgrade (stretch) [09:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:30] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host puppetboard1001.eqiad.wmnet [09:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "page created! https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/restricted.bastion.wmcloud.org" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/668746 (owner: 10Elukey) [09:54:31] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1001.eqiad.wmnet [09:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:23] PROBLEM - Host db2073 is DOWN: PING CRITICAL - Packet loss = 100% [09:57:13] ^ me [10:00:31] !log installing busybox security updates [10:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:22] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission for hosts ms-be[2020-2027].codfw.wmnet [10:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:32] (03PS7) 10Kormat: mariadb: Use section params: remaining profiles. [puppet] - 10https://gerrit.wikimedia.org/r/669845 [10:08:57] (03CR) 10JMeybohm: [C: 04-1] modules/roles: Add k8s config for ML team machines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [10:10:58] (03PS14) 10Klausman: modules/roles: Add k8s config for ML team machines [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) [10:11:02] (03CR) 10Klausman: modules/roles: Add k8s config for ML team machines (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [10:11:30] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 7 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28461/console" [puppet] - 10https://gerrit.wikimedia.org/r/669845 (owner: 10Kormat) [10:14:07] !log installing libbsd security updates [10:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:35] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite2003.codfw.wmnet [10:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:13] 10SRE, 10MW-on-K8s, 10serviceops: Figure out appropriate readiness and liveness probes - https://phabricator.wikimedia.org/T276908 (10Joe) [10:17:32] (03PS8) 10Kormat: mariadb: Use section params: remaining profiles. [puppet] - 10https://gerrit.wikimedia.org/r/669845 (https://phabricator.wikimedia.org/T275497) [10:22:34] 10SRE, 10MW-on-K8s, 10serviceops: Figure out appropriate readiness and liveness probes - https://phabricator.wikimedia.org/T276908 (10Joe) Ideally, the liveness probe needs to check if the container is running (more or less), while the readiness probe should check that the service is still responding. What... [10:23:05] !log installing gdisk security updates [10:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:19] (03PS1) 10Arturo Borrero Gonzalez: openstack: wmcs-drain-hypervisor: detect bogus migration [puppet] - 10https://gerrit.wikimedia.org/r/670129 [10:23:26] 10SRE, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) When doing a reboot for T273280 I just ran into this issue with db2073. db2072 (which as a newer firmware rebooted just fine). [10:23:55] (03CR) 10jerkins-bot: [V: 04-1] openstack: wmcs-drain-hypervisor: detect bogus migration [puppet] - 10https://gerrit.wikimedia.org/r/670129 (owner: 10Arturo Borrero Gonzalez) [10:25:02] 10ops-codfw, 10DBA: Upgrade firmware on db2073 - https://phabricator.wikimedia.org/T276909 (10Marostegui) [10:25:21] 10ops-codfw, 10DBA: Upgrade firmware on db2073 - https://phabricator.wikimedia.org/T276909 (10Marostegui) p:05Triage→03Medium [10:30:01] (03PS1) 10Marostegui: db2073,db2095: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/670130 (https://phabricator.wikimedia.org/T276909) [10:30:59] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite2003.codfw.wmnet [10:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:52] !log upgrading perf on stretch hosts [10:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:08] (03CR) 10Marostegui: [C: 03+2] db2073,db2095: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/670130 (https://phabricator.wikimedia.org/T276909) (owner: 10Marostegui) [10:35:05] I'm rebooting prometheus hosts in codfw/eqiad, there will be some expected alerts [10:35:20] (03PS2) 10Arturo Borrero Gonzalez: openstack: wmcs-drain-hypervisor: detect bogus migration [puppet] - 10https://gerrit.wikimedia.org/r/670129 [10:36:35] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus2003.codfw.wmnet [10:36:36] (03PS3) 10Arturo Borrero Gonzalez: openstack: wmcs-drain-hypervisor: detect bogus migration [puppet] - 10https://gerrit.wikimedia.org/r/670129 [10:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:30] (03PS4) 10Arturo Borrero Gonzalez: openstack: wmcs-drain-hypervisor: detect bogus migration [puppet] - 10https://gerrit.wikimedia.org/r/670129 [10:43:06] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-be[2020-2027].codfw.wmnet [10:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:32] 10ops-codfw, 10User-fgiunchedi: Decom ms-be[2016-2027] - https://phabricator.wikimedia.org/T272837 (10fgiunchedi) [10:44:34] 10ops-codfw, 10User-fgiunchedi: Decom ms-be[2016-2027] - https://phabricator.wikimedia.org/T272837 (10fgiunchedi) a:05fgiunchedi→03Papaul @papaul all yours [10:45:55] !log hnowlan@cumin1001 START - Cookbook sre.dns.netbox [10:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:07] (03CR) 10Marostegui: [C: 03+2] clouddb1021: Add s4 and s6 [puppet] - 10https://gerrit.wikimedia.org/r/670100 (https://phabricator.wikimedia.org/T269211) (owner: 10Marostegui) [10:47:06] (03PS1) 10David Caro: wmcs.backups: Skip VMs in build status [puppet] - 10https://gerrit.wikimedia.org/r/670133 (https://phabricator.wikimedia.org/T276910) [10:47:17] jouncebot: next [10:47:17] In 1 hour(s) and 12 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210309T1200) [10:49:57] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:23] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2003.codfw.wmnet [10:52:27] (03PS2) 10DCausse: Add a note for the elasticsearch image in releng/dev-images [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/669720 [10:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:40] (03CR) 10DCausse: Add a note for the elasticsearch image in releng/dev-images (031 comment) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/669720 (owner: 10DCausse) [10:53:05] (03CR) 10Jakob: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/670134 (owner: 10Jakob) [10:53:36] !log started to import lexemes on wdqs1009 (T276784) [10:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:42] T276784: Recover lexemes on wdqs1009 - https://phabricator.wikimedia.org/T276784 [10:54:18] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1037.eqiad.wmnet [10:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:36] (03PS5) 10Arturo Borrero Gonzalez: openstack: wmcs-drain-hypervisor: detect bogus migration [puppet] - 10https://gerrit.wikimedia.org/r/670129 [10:55:06] (03CR) 10JMeybohm: modules/roles: Add k8s config for ML team machines (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [10:56:20] !log installing mariadb-10.1 updates for stretch (distro version with libs/tools only, not wmf-mariadb) [10:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:53] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:00:17] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:01:06] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1037.eqiad.wmnet [11:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:33] (03PS1) 10Jbond: P:pki::client: make the mutual tls certificates optional [puppet] - 10https://gerrit.wikimedia.org/r/670137 [11:04:58] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host webperf2001.codfw.wmnet [11:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:08] (03CR) 10Jbond: [C: 03+2] P:pki::client: make the mutual tls certificates optional [puppet] - 10https://gerrit.wikimedia.org/r/670137 (owner: 10Jbond) [11:05:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: wmcs-drain-hypervisor: detect bogus migration [puppet] - 10https://gerrit.wikimedia.org/r/670129 (owner: 10Arturo Borrero Gonzalez) [11:06:04] (03CR) 10JMeybohm: [C: 04-2] "This needs to also allow cross cluster traffic, at least for prod-prod and staging-staging." [deployment-charts] - 10https://gerrit.wikimedia.org/r/670098 (owner: 10JMeybohm) [11:07:41] ACKNOWLEDGEMENT - cassandra-a SSL 10.192.16.98:7001 on restbase2019 is CRITICAL: SSL CRITICAL - Certificate restbase2019-a valid until 2021-04-07 15:35:49 +0000 (expires in 29 days) Hnowlan Replacing certs 2021-03-10 https://phabricator.wikimedia.org/T120662 [11:07:41] ACKNOWLEDGEMENT - cassandra-b SSL 10.192.16.99:7001 on restbase2019 is CRITICAL: SSL CRITICAL - Certificate restbase2019-b valid until 2021-04-07 15:35:50 +0000 (expires in 29 days) Hnowlan Replacing certs 2021-03-10 https://phabricator.wikimedia.org/T120662 [11:07:41] ACKNOWLEDGEMENT - cassandra-c SSL 10.192.16.100:7001 on restbase2019 is CRITICAL: SSL CRITICAL - Certificate restbase2019-c valid until 2021-04-07 15:35:51 +0000 (expires in 29 days) Hnowlan Replacing certs 2021-03-10 https://phabricator.wikimedia.org/T120662 [11:07:41] ACKNOWLEDGEMENT - cassandra-a SSL 10.192.32.119:7001 on restbase2020 is CRITICAL: SSL CRITICAL - Certificate restbase2020-a valid until 2021-04-07 15:35:52 +0000 (expires in 29 days) Hnowlan Replacing certs 2021-03-10 https://phabricator.wikimedia.org/T120662 [11:07:41] ACKNOWLEDGEMENT - cassandra-b SSL 10.192.32.120:7001 on restbase2020 is CRITICAL: SSL CRITICAL - Certificate restbase2020-b valid until 2021-04-07 15:35:54 +0000 (expires in 29 days) Hnowlan Replacing certs 2021-03-10 https://phabricator.wikimedia.org/T120662 [11:07:41] ACKNOWLEDGEMENT - cassandra-c SSL 10.192.32.121:7001 on restbase2020 is CRITICAL: SSL CRITICAL - Certificate restbase2020-c valid until 2021-04-07 15:35:55 +0000 (expires in 29 days) Hnowlan Replacing certs 2021-03-10 https://phabricator.wikimedia.org/T120662 [11:09:05] (03PS1) 10MSantos: wikifeeds: bump to 2021-03-08-191840-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/670138 [11:09:41] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2001.codfw.wmnet [11:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:06] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host webperf2002.codfw.wmnet [11:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:17] RECOVERY - Maps HTTPS on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1348 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [11:11:04] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2002.codfw.wmnet [11:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:29] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host webperf1002.eqiad.wmnet [11:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:15] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 18 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [11:15:49] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf1002.eqiad.wmnet [11:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:34] (03CR) 10MSantos: [C: 03+2] wikifeeds: bump to 2021-03-08-191840-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/670138 (owner: 10MSantos) [11:16:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:17:13] (03PS1) 10Marostegui: Revert "dbproxy1019: Depool clouddb1015:3314, clouddb1015:3316" [puppet] - 10https://gerrit.wikimedia.org/r/670112 [11:17:33] (03Merged) 10jenkins-bot: wikifeeds: bump to 2021-03-08-191840-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/670138 (owner: 10MSantos) [11:18:16] (03CR) 10David Caro: [C: 03+1] "LGTM, just curious, do we know when/how that might happen?" [puppet] - 10https://gerrit.wikimedia.org/r/670129 (owner: 10Arturo Borrero Gonzalez) [11:18:20] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host webperf1001.eqiad.wmnet [11:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:51] (03PS1) 10MSantos: proton: bump to 2021-03-09-110455-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/670140 [11:19:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:20:43] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf1001.eqiad.wmnet [11:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:31] (03CR) 10MSantos: [C: 03+2] proton: bump to 2021-03-09-110455-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/670140 (owner: 10MSantos) [11:22:40] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1019: Depool clouddb1015:3314, clouddb1015:3316" [puppet] - 10https://gerrit.wikimedia.org/r/670112 (owner: 10Marostegui) [11:22:42] (03Merged) 10jenkins-bot: proton: bump to 2021-03-09-110455-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/670140 (owner: 10MSantos) [11:22:49] RECOVERY - puppet last run on maps1009 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:23:17] (03PS15) 10Klausman: modules/roles: Add k8s config for ML team machines [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) [11:23:23] (03CR) 10Klausman: modules/roles: Add k8s config for ML team machines (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [11:24:23] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [11:25:28] (03PS1) 10MSantos: mobileapps: bump to 2021-03-08-191346-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/670141 [11:25:35] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus2004.codfw.wmnet [11:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:40] increased latency since around 11:18 [11:26:04] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host doc1001.eqiad.wmnet [11:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:53] PROBLEM - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 5.38e+05 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [11:27:25] there was a reboot then, maybe only a measuring artifact? [11:27:27] (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2021-03-08-191346-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/670141 (owner: 10MSantos) [11:28:21] no, it is traffic-driven, there is an increase in requests at the same time [11:28:25] (03Merged) 10jenkins-bot: mobileapps: bump to 2021-03-08-191346-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/670141 (owner: 10MSantos) [11:28:44] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=26&from=1615278520599&orgId=1&to=1615289320599&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200 [11:28:45] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [11:28:55] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host doc1001.eqiad.wmnet [11:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:35] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:29:38] (03CR) 10Kosta Harlan: Add a note for the elasticsearch image in releng/dev-images (031 comment) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/669720 (owner: 10DCausse) [11:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:54] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:06] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host mwdebug1003.eqiad.wmnet [11:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:29] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host mw1307.eqiad.wmnet [11:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:34] (03PS1) 10MSantos: push-notificatins: bump to 2021-03-09-111134-publish [deployment-charts] - 10https://gerrit.wikimedia.org/r/670148 [11:41:54] (03PS2) 10MSantos: push-notifications: bump to 2021-03-09-111134-publish [deployment-charts] - 10https://gerrit.wikimedia.org/r/670148 [11:42:51] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwdebug1003.eqiad.wmnet [11:42:55] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2004.codfw.wmnet [11:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:05] (03CR) 10MSantos: [C: 03+2] push-notifications: bump to 2021-03-09-111134-publish [deployment-charts] - 10https://gerrit.wikimedia.org/r/670148 (owner: 10MSantos) [11:43:42] (03Merged) 10jenkins-bot: push-notifications: bump to 2021-03-09-111134-publish [deployment-charts] - 10https://gerrit.wikimedia.org/r/670148 (owner: 10MSantos) [11:45:08] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mw1307.eqiad.wmnet [11:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:33] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [11:52:21] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:38] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:59] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1010.eqiad.wmnet with reason: REIMAGE [11:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:04] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1010.eqiad.wmnet with reason: REIMAGE [11:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:44] !log restart envoy on mw1276 [11:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:52] (03CR) 10Urbanecm: [C: 03+2] Make help panel fallback to help desk if no mentor is available [extensions/GrowthExperiments] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/670109 (https://phabricator.wikimedia.org/T275908) (owner: 10Urbanecm) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: How many deployers does it take to do European mid-day backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210309T1200). [12:00:04] No GERRIT patches in the queue for this window AFAICS. [12:00:11] I'll deploy the backport I just +2'ed [12:00:22] !log Upgrade db2076 kernel [12:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:51] (03CR) 10Alexandros Kosiaris: [C: 03+1] hieradata: enable ipv6 on envoy services mw canaries [puppet] - 10https://gerrit.wikimedia.org/r/669878 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli) [12:03:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1173 for schema change', diff saved to https://phabricator.wikimedia.org/P14682 and previous config saved to /var/cache/conftool/dbconfig/20210309-120326-marostegui.json [12:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:04] !log Upgrade db2077 kernel [12:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:38] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 5: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/670129 (owner: 10Arturo Borrero Gonzalez) [12:12:45] (03Merged) 10jenkins-bot: Make help panel fallback to help desk if no mentor is available [extensions/GrowthExperiments] (wmf/1.36.0-wmf.33) - 10https://gerrit.wikimedia.org/r/670109 (https://phabricator.wikimedia.org/T275908) (owner: 10Urbanecm) [12:13:33] !log Upgrade db2080 kernel [12:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:16] 10SRE, 10InternetArchiveBot, 10Traffic, 10Platform Team Workboards (Clinic Duty Team): IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10jijiki) @Cyberpower678 here is a sample: https://grafana.wikimedia.org/d/5E7tdiGWz/xxxx-effie?viewPanel=38&orgId=1&from=16... [12:16:16] !log urbanecm@deploy1002 Synchronized php-1.36.0-wmf.33/extensions/GrowthExperiments/: dbd6f0cb299bcfb6648b351e1476100fe669cc58: Make help panel fallback to help desk if no mentor is available (T275908; T273782) (duration: 01m 01s) [12:16:22] * Urbanecm done [12:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:24] T275908: Scale: default to mentorship questions from help panel - https://phabricator.wikimedia.org/T275908 [12:16:24] T273782: Help desk errors out when GEHelpPanelAskMentor is true and there are no mentors available - https://phabricator.wikimedia.org/T273782 [12:18:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 30%: 10', diff saved to https://phabricator.wikimedia.org/P14683 and previous config saved to /var/cache/conftool/dbconfig/20210309-121849-root.json [12:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1166 entirely', diff saved to https://phabricator.wikimedia.org/P14684 and previous config saved to /var/cache/conftool/dbconfig/20210309-121913-marostegui.json [12:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 30%: 10', diff saved to https://phabricator.wikimedia.org/P14685 and previous config saved to /var/cache/conftool/dbconfig/20210309-121924-root.json [12:19:26] (03CR) 10JMeybohm: modules/roles: Add k8s config for ML team machines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [12:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:42] (03CR) 10Klausman: modules/roles: Add k8s config for ML team machines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [12:26:05] !log Upgrade db2094 kernel [12:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:05] !log Upgrade db2084 kernel [12:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:36] !log regenerating interfaces and reimaging aqs101[1-5] [12:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:16] !log hnowlan@cumin1001 START - Cookbook sre.dns.netbox [12:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:26] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1038.eqiad.wmnet [12:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 60%: 10', diff saved to https://phabricator.wikimedia.org/P14686 and previous config saved to /var/cache/conftool/dbconfig/20210309-123427-root.json [12:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:18] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:31] (03CR) 10Muehlenhoff: [C: 04-1] "Started a mail thread for discussion, -1 until we have an outcome" [puppet] - 10https://gerrit.wikimedia.org/r/668397 (owner: 10Jbond) [12:41:00] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host mw1402.eqiad.wmnet [12:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:08] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host mw1403.eqiad.wmnet [12:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:22] (03PS16) 10Klausman: modules/roles: Add k8s config for ML team machines [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) [12:48:17] 10SRE, 10Desktop Improvements, 10Traffic, 10Bengali-Sites, and 4 others: CDN cache revalidation on several wikis for desktop improvements deployment pt 2 - https://phabricator.wikimedia.org/T274784 (10ovasileva) 05Open→03Resolved a:03ovasileva >>! In T274784#6893902, @BBlack wrote: > ^ There was a la... [12:49:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14687 and previous config saved to /var/cache/conftool/dbconfig/20210309-124931-root.json [12:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1168 for schema change', diff saved to https://phabricator.wikimedia.org/P14688 and previous config saved to /var/cache/conftool/dbconfig/20210309-125007-marostegui.json [12:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:25] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mw1402.eqiad.wmnet [12:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:46] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mw1403.eqiad.wmnet [12:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:30] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1011.eqiad.wmnet with reason: REIMAGE [12:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:46] (03PS3) 10DCausse: Add a note for the elasticsearch image in releng/dev-images [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/669720 [12:58:07] (03CR) 10DCausse: Add a note for the elasticsearch image in releng/dev-images (031 comment) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/669720 (owner: 10DCausse) [12:59:32] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1011.eqiad.wmnet with reason: REIMAGE [12:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:56] !log drain + reimage an-worker1103 to Buster [13:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 30%: 10', diff saved to https://phabricator.wikimedia.org/P14689 and previous config saved to /var/cache/conftool/dbconfig/20210309-130116-root.json [13:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:27] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1013.eqiad.wmnet with reason: REIMAGE [13:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:37] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1013.eqiad.wmnet with reason: REIMAGE [13:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:10] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1103.eqiad.wmnet with reason: REIMAGE [13:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:21] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1103.eqiad.wmnet with reason: REIMAGE [13:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:52] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus1003.eqiad.wmnet [13:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:45] (03PS2) 10JMeybohm: admin_ng: Add GlobalNetworkPolicies to allow pod-to-pod traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/670098 [13:13:18] (03PS1) 10Klausman: ml-k8s: Add dummy controllermanager_tokens [labs/private] - 10https://gerrit.wikimedia.org/r/670165 [13:13:47] (03PS3) 10JMeybohm: admin_ng: Add GlobalNetworkPolicies to allow pod-to-pod traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/670098 [13:15:26] (03PS2) 10Klausman: ml-k8s: Add dummy controllermanager_tokens [labs/private] - 10https://gerrit.wikimedia.org/r/670165 [13:16:07] (03CR) 10Klausman: [C: 03+2] ml-k8s: Add dummy controllermanager_tokens [labs/private] - 10https://gerrit.wikimedia.org/r/670165 (owner: 10Klausman) [13:16:15] (03CR) 10Klausman: [V: 03+2 C: 03+2] ml-k8s: Add dummy controllermanager_tokens [labs/private] - 10https://gerrit.wikimedia.org/r/670165 (owner: 10Klausman) [13:16:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 60%: 10', diff saved to https://phabricator.wikimedia.org/P14690 and previous config saved to /var/cache/conftool/dbconfig/20210309-131620-root.json [13:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1198:3316 for schema change', diff saved to https://phabricator.wikimedia.org/P14691 and previous config saved to /var/cache/conftool/dbconfig/20210309-131652-marostegui.json [13:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:40] (03PS1) 10JMeybohm: Move k8s controllermanager_token to common staging [labs/private] - 10https://gerrit.wikimedia.org/r/670167 (https://phabricator.wikimedia.org/T276305) [13:18:23] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Move k8s controllermanager_token to common staging [labs/private] - 10https://gerrit.wikimedia.org/r/670167 (https://phabricator.wikimedia.org/T276305) (owner: 10JMeybohm) [13:27:01] !log reimage an-worker1102 and an-worker1080 (hdfs journal node) to Buster [13:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:05] (03CR) 10Muehlenhoff: [C: 03+1] "Good catch, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/668657 (owner: 10Jcrespo) [13:28:08] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1003.eqiad.wmnet [13:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:13] (03CR) 10Muehlenhoff: [C: 03+2] base: Fix bug introduced on check microcode by 7a99982 & 4a436bd [puppet] - 10https://gerrit.wikimedia.org/r/668657 (owner: 10Jcrespo) [13:30:55] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [13:31:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14692 and previous config saved to /var/cache/conftool/dbconfig/20210309-133124-root.json [13:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:15] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [13:34:29] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on cloudvirt1038.eqiad.wmnet with reason: HW issue [13:34:29] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on cloudvirt1038.eqiad.wmnet with reason: HW issue [13:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:05] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus1004.eqiad.wmnet [13:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:27] (03PS1) 10Elukey: Add fake secrets for ml-ctrl.svc endpoints [labs/private] - 10https://gerrit.wikimedia.org/r/670169 [13:40:21] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10aborrero) p:05Triage→03High a:03RobH Hey @RobH does this sound like something we have seen before in our fleet? [13:40:36] (03CR) 10Klausman: [C: 03+1] Add fake secrets for ml-ctrl.svc endpoints [labs/private] - 10https://gerrit.wikimedia.org/r/670169 (owner: 10Elukey) [13:40:39] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10aborrero) [13:40:48] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake secrets for ml-ctrl.svc endpoints [labs/private] - 10https://gerrit.wikimedia.org/r/670169 (owner: 10Elukey) [13:43:19] (03PS17) 10Klausman: modules/roles: Add k8s config for ML team machines [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) [13:44:09] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28470/console" [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [13:44:28] (03PS1) 10Muehlenhoff: Simplify microcode check [puppet] - 10https://gerrit.wikimedia.org/r/670171 [13:45:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 10%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P14693 and previous config saved to /var/cache/conftool/dbconfig/20210309-134522-root.json [13:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:17] (03CR) 10jerkins-bot: [V: 04-1] Simplify microcode check [puppet] - 10https://gerrit.wikimedia.org/r/670171 (owner: 10Muehlenhoff) [13:48:08] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1080.eqiad.wmnet with reason: REIMAGE [13:48:38] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1102.eqiad.wmnet with reason: REIMAGE [13:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1080.eqiad.wmnet with reason: REIMAGE [13:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:07] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1102.eqiad.wmnet with reason: REIMAGE [13:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:52] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1004.eqiad.wmnet [13:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:29] (03PS2) 10Muehlenhoff: Simplify microcode check [puppet] - 10https://gerrit.wikimedia.org/r/670171 [13:57:30] (03PS1) 10Jcrespo: dbbackups: Limit concurrency of es backups on codfw, too [puppet] - 10https://gerrit.wikimedia.org/r/670174 (https://phabricator.wikimedia.org/T138562) [13:58:02] (03PS2) 10Jcrespo: dbbackups: Limit concurrency of es backups on codfw, too [puppet] - 10https://gerrit.wikimedia.org/r/670174 (https://phabricator.wikimedia.org/T138562) [13:58:33] (03CR) 10Jcrespo: "I just realized this templates are under the mariadb hierarchy. I will move them under dbbackups for consistency on a separate patch." [puppet] - 10https://gerrit.wikimedia.org/r/670174 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [13:58:40] (03CR) 10Marostegui: "just curious, what's the default?" [puppet] - 10https://gerrit.wikimedia.org/r/670174 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [14:00:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 30%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P14694 and previous config saved to /var/cache/conftool/dbconfig/20210309-140025-root.json [14:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:55] (03CR) 10Klausman: [V: 03+1 C: 03+2] modules/roles: Add k8s config for ML team machines [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [14:01:50] (03CR) 10Klausman: [V: 03+1 C: 03+2] modules/roles: Add k8s config for ML team machines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/668075 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [14:02:35] (03CR) 10Noa wmde: [C: 03+1] Update termbox to 2021-03-01-112916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/670134 (owner: 10Jakob) [14:03:59] (03CR) 10Jakob: [C: 03+2] Update termbox to 2021-03-01-112916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/670134 (owner: 10Jakob) [14:04:45] (03Merged) 10jenkins-bot: Update termbox to 2021-03-01-112916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/670134 (owner: 10Jakob) [14:08:33] !log jakob@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [14:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:43] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-fe2005.codfw.wmnet [14:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:46] !log installing intel-microcode updates on stretch [14:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:59] !log jakob@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' . [14:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:55] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2005.codfw.wmnet [14:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:25] !log jakob@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'production' . [14:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 60%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P14696 and previous config saved to /var/cache/conftool/dbconfig/20210309-141529-root.json [14:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:17:26] !log jakob@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' . [14:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:02] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:36] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:20:54] (03CR) 10Andrew Bogott: "We refresh the list of VMs on the hypervisor in each retry loop so if we miss a VM it'll get picked up on the next retry regardless." [puppet] - 10https://gerrit.wikimedia.org/r/670129 (owner: 10Arturo Borrero Gonzalez) [14:21:21] PROBLEM - Check systemd state on ml-serve-ctrl1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:54] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-fe2006.codfw.wmnet [14:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:31] (03PS1) 10Ottomata: Finalize Editing team schema ingestion migration to Event Platform [puppet] - 10https://gerrit.wikimedia.org/r/670179 (https://phabricator.wikimedia.org/T267343) [14:25:56] (03PS1) 10Whym: Fix obsolete comments on wgCheckUserLogLogins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670180 (https://phabricator.wikimedia.org/T253802) [14:25:57] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2006.codfw.wmnet [14:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:17] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1012.eqiad.wmnet with reason: REIMAGE [14:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:38] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1014.eqiad.wmnet with reason: REIMAGE [14:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:44] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-fe2007.codfw.wmnet [14:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:50] (03CR) 10Ottomata: [C: 03+2] Finalize Editing team schema ingestion migration to Event Platform [puppet] - 10https://gerrit.wikimedia.org/r/670179 (https://phabricator.wikimedia.org/T267343) (owner: 10Ottomata) [14:28:18] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1012.eqiad.wmnet with reason: REIMAGE [14:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:38] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1015.eqiad.wmnet with reason: REIMAGE [14:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:52] !log drain + reimage an-worker1090/89 to Buster [14:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:17] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1014.eqiad.wmnet with reason: REIMAGE [14:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 100%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P14697 and previous config saved to /var/cache/conftool/dbconfig/20210309-143033-root.json [14:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:47] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2007.codfw.wmnet [14:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:17] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1015.eqiad.wmnet with reason: REIMAGE [14:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:46] !log volker-e@deploy1002 Started deploy [design/style-guide@deee49c]: Deploy design/style-guide: deee49c index: Add links to our design process and work guides (#446) [14:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:52] !log volker-e@deploy1002 Finished deploy [design/style-guide@deee49c]: Deploy design/style-guide: deee49c index: Add links to our design process and work guides (#446) (duration: 00m 06s) [14:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3316 for schema change', diff saved to https://phabricator.wikimedia.org/P14698 and previous config saved to /var/cache/conftool/dbconfig/20210309-143453-marostegui.json [14:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:01] (03CR) 10Andrew Bogott: [C: 03+1] wmcs.backups: Skip VMs in build status [puppet] - 10https://gerrit.wikimedia.org/r/670133 (https://phabricator.wikimedia.org/T276910) (owner: 10David Caro) [14:38:06] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-fe2008.codfw.wmnet [14:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:52] (03CR) 10Phamhi: [C: 03+2] wikireplica: depool clouddb1017 [puppet] - 10https://gerrit.wikimedia.org/r/668803 (owner: 10Phamhi) [14:39:56] (03PS2) 10Phamhi: wikireplica: depool clouddb1017 [puppet] - 10https://gerrit.wikimedia.org/r/668803 [14:41:07] (03CR) 10Andrew Bogott: [C: 03+1] "I don't at all remember why but I can confirm that on a new working VM this dir doesn't exist." [puppet] - 10https://gerrit.wikimedia.org/r/666668 (https://phabricator.wikimedia.org/T276501) (owner: 10Filippo Giunchedi) [14:41:52] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2008.codfw.wmnet [14:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:38] 10SRE, 10DNS, 10Traffic, 10serviceops, and 3 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10wkandek) gerrit.wikimedia.org lives on a second IP address on gerrit1001. Should we follow that model here as well? [14:46:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:46:43] (03PS1) 10Jbond: P:pki::multirootca: use apache for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/670185 [14:48:07] (03CR) 10jerkins-bot: [V: 04-1] P:pki::multirootca: use apache for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/670185 (owner: 10Jbond) [14:49:40] (03PS2) 10Jbond: P:pki::multirootca: use apache for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/670185 [14:50:02] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:50:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:51:31] PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 16 down 2 https://wikitech.wikimedia.org/wiki/HAProxy [14:52:03] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1090.eqiad.wmnet with reason: REIMAGE [14:52:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 10%: Repooling db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P14699 and previous config saved to /var/cache/conftool/dbconfig/20210309-145205-root.json [14:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:09] phamhi: you working with clouddb1017? [14:53:20] marostegui yes [14:53:43] RECOVERY - Check systemd state on ml-serve-ctrl1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:55] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1089.eqiad.wmnet with reason: REIMAGE [14:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1090.eqiad.wmnet with reason: REIMAGE [14:54:12] phamhi: Thanks, I saw the proxy alert and I wasn't sure :-). Would you mind doing a !log before the reboots? [14:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:25] forgot..will do ..thanks [14:55:57] RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 18 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [14:56:07] (03CR) 10Jcrespo: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/670174 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [14:56:24] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1089.eqiad.wmnet with reason: REIMAGE [14:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:43] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-fe1005.eqiad.wmnet [14:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:20] PROBLEM - Check systemd state on ml-serve-ctrl1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:06] (03PS1) 10Phamhi: Revert "wikireplica: depool clouddb1017" [puppet] - 10https://gerrit.wikimedia.org/r/670113 [15:00:39] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1005.eqiad.wmnet [15:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:56] (03CR) 10Phamhi: [C: 03+2] Revert "wikireplica: depool clouddb1017" [puppet] - 10https://gerrit.wikimedia.org/r/670113 (owner: 10Phamhi) [15:05:22] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: stop symlinking puppet client ssl directory [puppet] - 10https://gerrit.wikimedia.org/r/666668 (https://phabricator.wikimedia.org/T276501) (owner: 10Filippo Giunchedi) [15:05:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 20%: 10', diff saved to https://phabricator.wikimedia.org/P14700 and previous config saved to /var/cache/conftool/dbconfig/20210309-150558-root.json [15:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 30%: Repooling db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P14701 and previous config saved to /var/cache/conftool/dbconfig/20210309-150708-root.json [15:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:20] (03PS4) 10Mforns: WikimediaEvents: Bump session_tick sampling rate to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668553 (https://phabricator.wikimedia.org/T276502) (owner: 10Mholloway) [15:08:46] (03PS1) 10Phamhi: wikireplica: depool clouddb1018 [puppet] - 10https://gerrit.wikimedia.org/r/670190 [15:10:34] (03CR) 10Bstorm: "> Patch Set 1: -Verified" [puppet] - 10https://gerrit.wikimedia.org/r/670013 (https://phabricator.wikimedia.org/T276868) (owner: 10Razzi) [15:11:09] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-fe1006.eqiad.wmnet [15:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:31] (03PS1) 10Ottomata: Declare KaiOS / Inuka event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670191 (https://phabricator.wikimedia.org/T267344) [15:14:05] (03CR) 10Ottomata: [C: 03+2] WikimediaEvents: Bump session_tick sampling rate to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668553 (https://phabricator.wikimedia.org/T276502) (owner: 10Mholloway) [15:15:03] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1006.eqiad.wmnet [15:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:24] (03Merged) 10jenkins-bot: WikimediaEvents: Bump session_tick sampling rate to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668553 (https://phabricator.wikimedia.org/T276502) (owner: 10Mholloway) [15:15:54] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-fe1007.eqiad.wmnet [15:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:39] (03PS1) 10Klausman: files/ssl/: Update ML k8s certs [puppet] - 10https://gerrit.wikimedia.org/r/670196 (https://phabricator.wikimedia.org/T272918) [15:16:59] (03PS1) 10Hnowlan: aqs: make aqs1010 a separate AQS cluster [puppet] - 10https://gerrit.wikimedia.org/r/670197 (https://phabricator.wikimedia.org/T257572) [15:17:40] (03CR) 10Klausman: [C: 03+2] files/ssl/: Update ML k8s certs [puppet] - 10https://gerrit.wikimedia.org/r/670196 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [15:18:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:18:50] !log reimage analytics1072 (hadoop hdfs journal node) to buster [15:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:01] (03CR) 10Jbond: [C: 03+2] P:pki::multirootca: use apache for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/670185 (owner: 10Jbond) [15:19:03] (03CR) 10Bstorm: [C: 03+1] motd: Use heredoc to allow expanding description with apostrophe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670013 (https://phabricator.wikimedia.org/T276868) (owner: 10Razzi) [15:19:43] (03CR) 10Bstorm: [C: 03+1] motd: Use heredoc to allow expanding description with apostrophe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670013 (https://phabricator.wikimedia.org/T276868) (owner: 10Razzi) [15:20:51] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: WikimediaEvents: Bump session_tick sampling rate to 10% (duration: 00m 58s) [15:20:53] (03PS2) 10Ottomata: Declare KaiOS / Inuka event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670191 (https://phabricator.wikimedia.org/T267344) [15:20:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 30%: 10', diff saved to https://phabricator.wikimedia.org/P14702 and previous config saved to /var/cache/conftool/dbconfig/20210309-152102-root.json [15:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 60%: Repooling db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P14703 and previous config saved to /var/cache/conftool/dbconfig/20210309-152212-root.json [15:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:30] (03PS1) 10Klausman: manifest: Move ml-serve1002 insetup -> ml_k8s::master [puppet] - 10https://gerrit.wikimedia.org/r/670198 [15:25:46] (03PS2) 10Klausman: manifest: Move ml-serve1002 insetup -> ml_k8s::master [puppet] - 10https://gerrit.wikimedia.org/r/670198 (https://phabricator.wikimedia.org/T272918) [15:26:03] (03CR) 10Klausman: [C: 03+2] manifest: Move ml-serve1002 insetup -> ml_k8s::master [puppet] - 10https://gerrit.wikimedia.org/r/670198 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [15:26:06] (03CR) 10Klausman: [V: 03+2 C: 03+2] manifest: Move ml-serve1002 insetup -> ml_k8s::master [puppet] - 10https://gerrit.wikimedia.org/r/670198 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [15:26:17] (03CR) 10Ottomata: [C: 03+2] Declare KaiOS / Inuka event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670191 (https://phabricator.wikimedia.org/T267344) (owner: 10Ottomata) [15:27:23] (03PS1) 10Jbond: P:pki:multi: add class { 'sslcert::dhparam': } [puppet] - 10https://gerrit.wikimedia.org/r/670200 [15:27:59] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Declare KaiOS / Inuka event streams - T267344 T267345 T267346 (duration: 00m 58s) [15:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:08] T267346: KaiOSAppFirstRun Event Platform Migration - https://phabricator.wikimedia.org/T267346 [15:28:08] T267345: KaiOSAppFeedback Event Platform Migration - https://phabricator.wikimedia.org/T267345 [15:28:09] T267344: InukaPageView Event Platform Migration - https://phabricator.wikimedia.org/T267344 [15:28:09] RECOVERY - Check systemd state on ml-serve-ctrl1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:28:36] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1007.eqiad.wmnet [15:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:13] (03CR) 10Bstorm: "Just to confirm: what's clouddb? Is that the wikireplicas?" [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [15:29:35] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ms-fe1008.eqiad.wmnet [15:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:05] (03CR) 10Bstorm: "I also don't know what that grant is for either way 😊" [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [15:30:19] (03CR) 10Jbond: [C: 03+2] P:pki:multi: add class { 'sslcert::dhparam': } [puppet] - 10https://gerrit.wikimedia.org/r/670200 (owner: 10Jbond) [15:33:22] (03PS1) 10Andrew Bogott: labs_lvm: give a unique name to the pv-free exec [puppet] - 10https://gerrit.wikimedia.org/r/670207 [15:34:02] (03CR) 10jerkins-bot: [V: 04-1] labs_lvm: give a unique name to the pv-free exec [puppet] - 10https://gerrit.wikimedia.org/r/670207 (owner: 10Andrew Bogott) [15:34:40] (03PS1) 10Jbond: pki::multirootca: add headers module [puppet] - 10https://gerrit.wikimedia.org/r/670208 [15:35:17] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1008.eqiad.wmnet [15:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:44] (03PS2) 10Andrew Bogott: labs_lvm: give a unique name to the pv-free exec [puppet] - 10https://gerrit.wikimedia.org/r/670207 [15:36:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 40%: 10', diff saved to https://phabricator.wikimedia.org/P14704 and previous config saved to /var/cache/conftool/dbconfig/20210309-153605-root.json [15:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:15] (03CR) 10Andrew Bogott: [C: 03+2] labs_lvm: give a unique name to the pv-free exec [puppet] - 10https://gerrit.wikimedia.org/r/670207 (owner: 10Andrew Bogott) [15:37:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 100%: Repooling db1096:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P14705 and previous config saved to /var/cache/conftool/dbconfig/20210309-153715-root.json [15:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:51] (03CR) 10Jbond: [C: 03+1] motd: Use heredoc to allow expanding description with apostrophe [puppet] - 10https://gerrit.wikimedia.org/r/670013 (https://phabricator.wikimedia.org/T276868) (owner: 10Razzi) [15:38:02] (03PS1) 10Phamhi: wikireplica: depool clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/670209 [15:38:13] (03CR) 10Jbond: [C: 03+2] pki::multirootca: add headers module [puppet] - 10https://gerrit.wikimedia.org/r/670208 (owner: 10Jbond) [15:39:51] (03PS4) 10JMeybohm: admin_ng: Add GlobalNetworkPolicies to allow pod-to-pod traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/670098 [15:40:57] (03PS1) 10Andrew Bogott: Followup to I01027c61fbeafa3ae688af6bee79a2e044785d5e [puppet] - 10https://gerrit.wikimedia.org/r/670210 [15:41:42] (03CR) 10Andrew Bogott: [C: 03+2] Followup to I01027c61fbeafa3ae688af6bee79a2e044785d5e [puppet] - 10https://gerrit.wikimedia.org/r/670210 (owner: 10Andrew Bogott) [15:42:40] (03PS1) 10Muehlenhoff: Temporarily switch to deb.debian.org for sodium reboot [puppet] - 10https://gerrit.wikimedia.org/r/670211 [15:42:43] (03PS1) 10Jbond: P:pki::multirootca: correct typo cert -> ca [puppet] - 10https://gerrit.wikimedia.org/r/670212 [15:43:21] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1072.eqiad.wmnet with reason: REIMAGE [15:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:33] (03PS2) 10Muehlenhoff: Temporarily switch to deb.debian.org for sodium reboot [puppet] - 10https://gerrit.wikimedia.org/r/670211 [15:44:45] (03CR) 10Jbond: [C: 03+2] P:pki::multirootca: correct typo cert -> ca [puppet] - 10https://gerrit.wikimedia.org/r/670212 (owner: 10Jbond) [15:44:55] (03PS2) 10Jbond: P:pki::multirootca: correct typo cert -> ca [puppet] - 10https://gerrit.wikimedia.org/r/670212 [15:45:24] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1072.eqiad.wmnet with reason: REIMAGE [15:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:58] (03PS1) 10Klausman: manifest: Mov ML k8s machines in codfw to prod [puppet] - 10https://gerrit.wikimedia.org/r/670214 (https://phabricator.wikimedia.org/T272918) [15:49:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Aside from the aggregation comments inline, I don't see much point in splitting them by cluster, we can probably have them common all over" (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/670098 (owner: 10JMeybohm) [15:51:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 60%: 10', diff saved to https://phabricator.wikimedia.org/P14706 and previous config saved to /var/cache/conftool/dbconfig/20210309-155109-root.json [15:51:13] (03CR) 10Klausman: [C: 03+2] manifest: Mov ML k8s machines in codfw to prod [puppet] - 10https://gerrit.wikimedia.org/r/670214 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [15:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:44] (03PS1) 10Jcrespo: dbbackups: Move files and templates to dbbackups hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/670217 (https://phabricator.wikimedia.org/T138562) [15:53:20] (03PS2) 10Jcrespo: dbbackups: Move files and templates to dbbackups hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/670217 (https://phabricator.wikimedia.org/T138562) [15:54:22] (03CR) 10David Caro: [C: 03+2] wmcs.backups: Skip VMs in build status [puppet] - 10https://gerrit.wikimedia.org/r/670133 (https://phabricator.wikimedia.org/T276910) (owner: 10David Caro) [15:54:46] (03PS1) 10Jbond: P:pki::multirootca: use correct private key [puppet] - 10https://gerrit.wikimedia.org/r/670219 [15:56:57] !log imported prometheus-ircd-exporter 0.2 to apt.wikimedia.org T224579 [15:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:04] T224579: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 [15:57:21] (03CR) 10Jbond: [C: 03+2] P:pki::multirootca: use correct private key [puppet] - 10https://gerrit.wikimedia.org/r/670219 (owner: 10Jbond) [15:58:16] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler1002/28471/" [puppet] - 10https://gerrit.wikimedia.org/r/670217 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [15:58:22] (03CR) 10Herron: [C: 03+1] profile: add scap log duplication and ecs mutations [puppet] - 10https://gerrit.wikimedia.org/r/659426 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:58:45] (03PS1) 10Giuseppe Lavagetto: [WiP] Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) [15:58:54] 10SRE, 10ops-codfw, 10User-fgiunchedi: Decom ms-be[2016-2027] - https://phabricator.wikimedia.org/T272837 (10Papaul) p:05Triage→03Medium [15:59:08] 10SRE, 10ops-codfw, 10User-fgiunchedi: Decom ms-be[2016-2027] - https://phabricator.wikimedia.org/T272837 (10Papaul) @fgiunchedi thank you [15:59:37] (03CR) 10Muehlenhoff: [C: 03+2] Add profile::base::linux510 for cloudgw and cloudnet [puppet] - 10https://gerrit.wikimedia.org/r/668087 (owner: 10Muehlenhoff) [16:00:50] (03PS1) 10Phamhi: wikireplica: depool clouddb1020 [puppet] - 10https://gerrit.wikimedia.org/r/670221 [16:03:33] (03PS5) 10JMeybohm: admin_ng: Add GlobalNetworkPolicies to allow pod-to-pod traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/670098 [16:04:26] (03CR) 10JMeybohm: "> Patch Set 4: Code-Review-1" (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/670098 (owner: 10JMeybohm) [16:05:07] (03PS6) 10JMeybohm: admin_ng: Add GlobalNetworkPolicies to allow pod-to-pod traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/670098 [16:05:54] mholloway: following up on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/668719 - this could use a backport? [16:06:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 80%: 10', diff saved to https://phabricator.wikimedia.org/P14707 and previous config saved to /var/cache/conftool/dbconfig/20210309-160613-root.json [16:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10RobH) a:05RobH→03Cmjohnson I've not seen this, but unless it happens twice it doesn't count! So, steps to fix: [[ https://netbox.wikimedia.org/dcim/devices/2746/ |... [16:10:13] brennen: yes, please! [16:10:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10RobH) [16:10:51] (03PS1) 10Jbond: cloud - heira: listen on localhost [puppet] - 10https://gerrit.wikimedia.org/r/670223 [16:11:48] mholloway: cool - i haven't yet prepped or synced the wmf.34 train branch anywhere, so i'll backport that after the branch commit merges shortly. [16:13:53] brennen: sounds great, thanks you! [16:13:56] *thank [16:14:17] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: enable ipv6 on envoy services mw canaries [puppet] - 10https://gerrit.wikimedia.org/r/669878 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli) [16:14:28] sure thing [16:15:57] (03CR) 10Jbond: [C: 03+2] cloud - heira: listen on localhost [puppet] - 10https://gerrit.wikimedia.org/r/670223 (owner: 10Jbond) [16:17:29] (03CR) 10Alexandros Kosiaris: [C: 03+1] admin_ng: Add GlobalNetworkPolicies to allow pod-to-pod traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/670098 (owner: 10JMeybohm) [16:18:02] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Add GlobalNetworkPolicies to allow pod-to-pod traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/670098 (owner: 10JMeybohm) [16:18:44] (03Merged) 10jenkins-bot: admin_ng: Add GlobalNetworkPolicies to allow pod-to-pod traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/670098 (owner: 10JMeybohm) [16:20:13] (03PS1) 10Phuedx: ULS: Remove unused ULSEventLogging variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670224 (https://phabricator.wikimedia.org/T275894) [16:21:00] (03CR) 10Phuedx: [C: 04-2] "DNM until 15th March." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670224 (https://phabricator.wikimedia.org/T275894) (owner: 10Phuedx) [16:21:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14708 and previous config saved to /var/cache/conftool/dbconfig/20210309-162116-root.json [16:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:39] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: ms-be1022.eqiad.wmnet, wdqs1010.eqiad.wmnet, maps1009.eqiad.wmnet, es1027.eqiad.wmnet, ms-be1019.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [16:27:19] 10SRE: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10MoritzMuehlenhoff) >>! In T224579#6891782, @MoritzMuehlenhoff wrote: >>>! In T224579#6887456, @Majavah wrote: >> Hi, is this going to happen? Beta cluster has also an IRC server running Jessie, and in an effort of gett... [16:27:39] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1027.eqiad.wmnet [16:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:04] !log 1.36.0-wmf.34 was branched at e175899921535f83e168145cbe942489475607db for T274938 [16:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:12] T274938: 1.36.0-wmf.34 deployment blockers - https://phabricator.wikimedia.org/T274938 [16:31:28] (03CR) 10Brennen Bearnes: [C: 03+2] Branch commit for wmf/1.36.0-wmf.34 [core] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670019 (https://phabricator.wikimedia.org/T274938) (owner: 10TrainBranchBot) [16:31:34] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [16:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:33] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [16:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:06] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1027.eqiad.wmnet [16:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:35:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:37:24] (03PS1) 10Jbond: P:pki::intermediate: add additional TLS clients [puppet] - 10https://gerrit.wikimedia.org/r/670228 [16:39:55] (03CR) 10Jbond: [C: 03+2] P:pki::intermediate: add additional TLS clients [puppet] - 10https://gerrit.wikimedia.org/r/670228 (owner: 10Jbond) [16:40:05] !log reimage analytics1077 to buster [16:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:30] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [16:41:32] (03PS1) 10MSantos: Map tiles for 3rd parties: allow consultant to access maps [puppet] - 10https://gerrit.wikimedia.org/r/670229 (https://phabricator.wikimedia.org/T276317) [16:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:57] (03PS1) 10Filippo Giunchedi: Sketch of Performance team alerts [alerts] - 10https://gerrit.wikimedia.org/r/670230 (https://phabricator.wikimedia.org/T272977) [16:43:59] (03PS1) 10Filippo Giunchedi: Run tests for alerts [alerts] - 10https://gerrit.wikimedia.org/r/670231 (https://phabricator.wikimedia.org/T272977) [16:45:59] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:23] (03PS2) 10Filippo Giunchedi: Run tests for alerts [alerts] - 10https://gerrit.wikimedia.org/r/670231 (https://phabricator.wikimedia.org/T272977) [16:49:08] (03PS3) 10Filippo Giunchedi: Run tests for alerts [alerts] - 10https://gerrit.wikimedia.org/r/670231 (https://phabricator.wikimedia.org/T272977) [16:50:47] (03PS4) 10Filippo Giunchedi: Run tests for alerts [alerts] - 10https://gerrit.wikimedia.org/r/670231 (https://phabricator.wikimedia.org/T272977) [16:54:27] (03PS1) 10Volans: netbox: refctor unit tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/670233 [16:54:29] (03PS1) 10Volans: netbox: fix object type returned for status [software/spicerack] - 10https://gerrit.wikimedia.org/r/670234 [16:54:32] (03PS1) 10Volans: netbox: add NetboxServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/670235 (https://phabricator.wikimedia.org/T205885) [16:55:04] jouncebot: now [16:55:04] No deployments scheduled for the next 0 hour(s) and 4 minute(s) [16:55:07] jouncebot: next [16:55:07] In 0 hour(s) and 4 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210309T1700) [16:55:30] (03PS1) 10Urbanecm: sqwiki: Fix deployment of Growth features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670236 [16:56:05] (03CR) 10Urbanecm: [C: 03+2] sqwiki: Fix deployment of Growth features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670236 (owner: 10Urbanecm) [16:59:00] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1077.eqiad.wmnet with reason: REIMAGE [16:59:02] (03Merged) 10jenkins-bot: sqwiki: Fix deployment of Growth features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670236 (owner: 10Urbanecm) [16:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] jbond42 and cdanis: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210309T1700). Please do the needful. [17:00:29] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 3119d7a703a38b328fa634db64b2929d54829884: sqwiki: Fix deployment of Growth features (duration: 01m 00s) [17:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:50] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-druid100[345] - https://phabricator.wikimedia.org/T274163 (10Cmjohnson) a:05Cmjohnson→03RobH @robh try it now, the cable was unplugged. [17:01:11] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1077.eqiad.wmnet with reason: REIMAGE [17:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:09] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.34 [core] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670019 (https://phabricator.wikimedia.org/T274938) (owner: 10TrainBranchBot) [17:03:11] (03PS1) 10Jbond: pki:intermediate: use correct cert bundle [puppet] - 10https://gerrit.wikimedia.org/r/670237 [17:04:47] (03CR) 10Jbond: [C: 03+2] pki:intermediate: use correct cert bundle [puppet] - 10https://gerrit.wikimedia.org/r/670237 (owner: 10Jbond) [17:07:13] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup100[12] - https://phabricator.wikimedia.org/T274206 (10Cmjohnson) a:05Cmjohnson→03RobH the ports were reversed, kafka-logging1002 was in 42 and ms-backup1002 was in 41. I swapped them and all should be good now. [17:07:43] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging100[123] - https://phabricator.wikimedia.org/T273778 (10Cmjohnson) a:05Cmjohnson→03RobH kafka-logging1001 mgmt is working using the correct mgmt password moved the cables to the correct switch port for kl1002 [17:07:49] (03PS1) 10Brennen Bearnes: Session tick: Add data QA flag if the user is in group data-qa [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670115 (https://phabricator.wikimedia.org/T276515) [17:10:02] (03CR) 10Elukey: [C: 03+1] "LGTM! Let's just run pcc for aqs1010 and say aqs1004 to see if anything unwanted is added to the existing nodes :)" [puppet] - 10https://gerrit.wikimedia.org/r/670197 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [17:10:05] (03CR) 10Brennen Bearnes: [C: 03+2] Session tick: Add data QA flag if the user is in group data-qa [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670115 (https://phabricator.wikimedia.org/T276515) (owner: 10Brennen Bearnes) [17:14:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Cmjohnson) [17:14:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Cmjohnson) a:05Cmjohnson→03RobH Robh: didn't work, try updating f/w please. [17:15:18] (03CR) 10Huji: [C: 03+1] "LGTM. Please schedule for deployment via https://wikitech.wikimedia.org/wiki/Deployments" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670180 (https://phabricator.wikimedia.org/T253802) (owner: 10Whym) [17:16:08] (03Merged) 10jenkins-bot: Session tick: Add data QA flag if the user is in group data-qa [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670115 (https://phabricator.wikimedia.org/T276515) (owner: 10Brennen Bearnes) [17:16:10] (03CR) 10jerkins-bot: [V: 04-1] netbox: refctor unit tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/670233 (owner: 10Volans) [17:18:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:21:37] (03PS2) 10Volans: netbox: refctor unit tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/670233 [17:21:39] (03PS2) 10Volans: netbox: fix object type returned for status [software/spicerack] - 10https://gerrit.wikimedia.org/r/670234 [17:21:41] (03PS2) 10Volans: netbox: add NetboxServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/670235 (https://phabricator.wikimedia.org/T205885) [17:21:43] (03PS1) 10Volans: doc: move ClusterShell URL to HTTPS [software/spicerack] - 10https://gerrit.wikimedia.org/r/670243 [17:24:00] 10SRE, 10Icinga: icinga login case mismatch - https://phabricator.wikimedia.org/T275920 (10Phamhi) [17:25:18] (03PS2) 10Mstyles: add new updater job properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/667034 (https://phabricator.wikimedia.org/T273095) [17:28:15] hey all, having issues in beta right now. there seems to have been a live migration of deployment-db05 and now we're seeing "[ 2886.337845] EXT4-fs error (device vda3): ext4_validate_block_bitmap" in the horizon console log [17:29:51] mysql is dropping us and ssh connections are intermittently dropped as well [17:30:12] seems a possible root cause of the post-merge pile up in CI [17:34:43] (03PS1) 10Jbond: P:profile::client: support tls-remote-ca [puppet] - 10https://gerrit.wikimedia.org/r/670248 [17:36:33] seems the filesystem on db05 is corrupt. dancy and i are currently looking into it and thinking promoting deployment-db06 to master might be the way to go. we could use guidance from an sre however [17:36:39] (03CR) 10Jbond: [C: 03+2] P:profile::client: support tls-remote-ca [puppet] - 10https://gerrit.wikimedia.org/r/670248 (owner: 10Jbond) [17:37:29] marxarelli: I was able to sssh to deployment-db05 just fine. [17:37:49] it's intermittent [17:37:55] aha [17:39:55] (03PS2) 10Hnowlan: aqs: make aqs1010 a separate AQS cluster [puppet] - 10https://gerrit.wikimedia.org/r/670197 (https://phabricator.wikimedia.org/T257572) [17:40:01] (copying my thought from -releng) [17:40:02] deployment-db05 is current database master and the hypervisor it was on was having some trouble today, a.rturo live migrated it [17:40:10] given T268628 we could just build a new replica and promote db06 into beta master [17:40:11] T268628: Upgrade deployment-prep-db hosts to buster/MariaDB 10.4 - https://phabricator.wikimedia.org/T268628 [17:40:34] yeah, sounds like a good thing to do Majavah [17:40:47] only problem is that beta is almost out of quota [17:41:01] it has some shutdown Jessie VMs that I migrated to Buster last week but haven't deleted yet [17:41:46] Can they be deleted? If i guess quota could be increased? [17:41:46] (03PS1) 10Jbond: cfssl::cert: use client config not root [puppet] - 10https://gerrit.wikimedia.org/r/670249 [17:41:51] Majavah: I can delete 'em if they're off [17:41:57] we might as well delete db05 since it's corrupt, right? [17:42:16] Urbanecm: start with deployment-memc[04-07] [17:42:18] we should first depool it [17:42:35] (03CR) 10Jbond: [C: 03+2] cfssl::cert: use client config not root [puppet] - 10https://gerrit.wikimedia.org/r/670249 (owner: 10Jbond) [17:42:48] I can't get hands on just yet but will in a moment [17:44:56] should we make a task for this to coordinate !logs etc? [17:45:16] Majavah: yup [17:45:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:45:28] {{doing}} [17:49:21] Majavah: ty! [17:49:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:50:00] !log rebooting db2073 for firmware upgrade [17:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:13] marxarelli: Urbanecm: T276968 [17:50:15] T276968: deployment-db05 disk issues - https://phabricator.wikimedia.org/T276968 [17:54:43] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Papaul) servers are ready for racking waiting on space in A3 and A4 [17:57:00] (03PS1) 10Jbond: pki::client: allow user to override remote ca [puppet] - 10https://gerrit.wikimedia.org/r/670250 [17:57:02] (03PS1) 10Jbond: hieradata - cloud: foobar-client fix ca path [puppet] - 10https://gerrit.wikimedia.org/r/670251 [17:58:55] (03CR) 10Jbond: [C: 03+2] pki::client: allow user to override remote ca [puppet] - 10https://gerrit.wikimedia.org/r/670250 (owner: 10Jbond) [17:59:13] RECOVERY - Host db2073 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [18:00:05] chrisalbon and accraze: May I have your attention please! Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210309T1800) [18:00:07] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1085.eqiad.wmnet with reason: REIMAGE [18:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:11] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28474/console" [puppet] - 10https://gerrit.wikimedia.org/r/670197 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [18:02:17] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1085.eqiad.wmnet with reason: REIMAGE [18:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:39] !log deleting shut down memc* deployment-prep instances to free up quota for replacement db instances (T276968) [18:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:45] T276968: deployment-db05 disk issues - https://phabricator.wikimedia.org/T276968 [18:02:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:03:13] marxarelli: wrong sal? beta cluster should be on releng sal afaik [18:03:40] k. poor beta :) [18:03:44] !log mbsantos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [18:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:09] (03PS1) 10Jbond: P:pki:client: correct tls param [puppet] - 10https://gerrit.wikimedia.org/r/670253 [18:04:15] (03PS1) 10Brennen Bearnes: testwikis wikis to 1.36.0-wmf.34 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670254 [18:04:17] (03CR) 10Brennen Bearnes: [C: 03+2] testwikis wikis to 1.36.0-wmf.34 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670254 (owner: 10Brennen Bearnes) [18:05:29] !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [18:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:41] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.34 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670254 (owner: 10Brennen Bearnes) [18:06:50] (03CR) 10Jbond: [C: 03+2] P:pki:client: correct tls param [puppet] - 10https://gerrit.wikimedia.org/r/670253 (owner: 10Jbond) [18:09:23] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:09:25] (03CR) 10Jbond: [C: 03+2] hieradata - cloud: foobar-client fix ca path [puppet] - 10https://gerrit.wikimedia.org/r/670251 (owner: 10Jbond) [18:09:27] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:09:41] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:10:17] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:10:41] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [18:10:47] !log mbsantos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [18:10:49] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:49] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:12:06] (03PS1) 10Jbond: p::pki::client: enable tls_remote_ca [puppet] - 10https://gerrit.wikimedia.org/r/670255 [18:12:11] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [18:12:12] !log brennen@deploy1002 Started scap: testwikis wikis to 1.36.0-wmf.34 [18:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:19] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:12:19] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:12:25] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:12:37] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:13:36] (03CR) 10Jbond: [C: 03+2] p::pki::client: enable tls_remote_ca [puppet] - 10https://gerrit.wikimedia.org/r/670255 (owner: 10Jbond) [18:13:47] !log mbsantos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [18:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:37] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/i18n/pcs (Get i18n strings for the Page Content Service) is CRITICAL: Test Get i18n strings for the Page Content Service returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/description/{title} (Get description for test page) is CRITICAL: Test Get description for test page returned the unexpected status 500 (expecting: 20 [18:14:37] page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/mobil [18:14:37] } (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 500 (expecting: 200): /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview [18:14:37] est page returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [18:16:17] !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' . [18:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:39] anybody checking the restbase failures? [18:16:51] https://logstash.wikimedia.org/app/dashboards#/view/restbase?_g=h@03cd538&_a=h@bda63c3 shows a lot of errors [18:17:05] hnowlan: --^ if you have an idea who should handle these [18:17:37] a lot of "HTTPError: Cannot read property 'en' of undefined" etc.. [18:17:46] restbase returning 500s [18:17:53] Majavah: fyi it's fine to ask WMCS for an increased quota on a temporary basis to migrate stuff [18:18:04] elukey: maybe related to mobileapps deployment I just did? checking [18:18:29] hi thesocialdev ! the errors started at around 18:05 UTC [18:18:50] seems matching more or less [18:19:23] also I just noticed that there is a mixture of issues [18:19:36] 429 (throttling or similar) and then 500s [18:19:45] the latter may be related to your deployment [18:19:52] yeah, mobileapps restarts tend to stress restbase apart from any coding errors in the service [18:20:24] at least in my experience, anyway. has that been true lately when you've deployed, thesocialdev ? [18:21:08] the 429 yes, but the 500s are not stopping [18:21:08] an example of 429 should be [18:21:10] https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-restbase-2021.03.09?id=_bMsGHgBGiM4niWINxqE [18:21:19] that seems from VE ? [18:21:27] 10SRE, 10Analytics: Set up cross DC topic mirroring for Kafka logging clusters - https://phabricator.wikimedia.org/T276972 (10Ottomata) [18:21:30] anyway, the 500s take the priority [18:21:36] 10SRE, 10Analytics: Set up cross DC topic mirroring for Kafka logging clusters - https://phabricator.wikimedia.org/T276972 (10Ottomata) Note: Recent versions of Kafka have a new robust version of MirrorMaker, [[ https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0 | MirrorMaker 2 ]]. O... [18:21:39] thesocialdev: maybe let's rollback if it is the case? [18:21:40] the 429s are most likely a red herring [18:22:03] hnowlan: yep yep but they keep happening, same signature, I have seen them during the past days [18:22:12] so worth to investigate later on [18:22:20] there's a fix in restbase for them at the moment, just needs to be deployed [18:22:26] ahhhhh! [18:22:30] TIL, nice [18:22:39] or a fix for *one* source of them, heh [18:22:44] 10SRE, 10Analytics, 10observability: Set up cross DC topic mirroring for Kafka logging clusters - https://phabricator.wikimedia.org/T276972 (10Ottomata) [18:23:45] which one is that? out of curiosity [18:23:55] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/i18n/pcs (Get i18n strings for the Page Content Service) is CRITICAL: Test Get i18n strings for the Page Content Service returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/description/{title} (Get description for test page) is CRITICAL: Test Get description for test page returned the unexpected status 500 (expecting: 20 [18:23:55] page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/mobil [18:23:55] } (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 500 (expecting: 200): /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview [18:23:55] est page returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [18:23:58] the 500s are to the mobileapps endpoint [18:24:08] the service on port 6012 [18:24:41] ^ less important than everything else we're looking at, but that alert text is very silly [18:24:46] thesocialdev: I really think that we should rollback [18:25:10] Maybe related to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/667138 ? [18:25:20] Rolling back [18:25:37] 10SRE, 10ops-codfw, 10DBA: Upgrade firmware on db2073 - https://phabricator.wikimedia.org/T276909 (10Papaul) 05Open→03Resolved Complete Before ` BIOS 2.4.3 IDRC 2.40 ` After BIOS 2.12 IDRAC 2,75 [18:25:40] 10SRE, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Papaul) [18:26:06] !log reimage an-worker1087 to buster [18:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:13] (03PS1) 10Ahmon Dancy: Merge branch 'master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/670258 [18:27:27] (03CR) 10jerkins-bot: [V: 04-1] Merge branch 'master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/670258 (owner: 10Ahmon Dancy) [18:28:09] (03PS1) 10MSantos: Revert "mobileapps: Enable egress network policy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/670121 [18:28:28] (03PS2) 10MSantos: Revert "mobileapps: Enable egress network policy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/670121 [18:29:05] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [18:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:19] (03PS1) 10MSantos: Revert "mobileapps: bump to 2021-03-08-191346-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/670122 [18:29:37] (03CR) 10MSantos: [C: 03+2] Revert "mobileapps: Enable egress network policy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/670121 (owner: 10MSantos) [18:30:14] (03CR) 10MSantos: [C: 03+2] Revert "mobileapps: bump to 2021-03-08-191346-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/670122 (owner: 10MSantos) [18:30:25] (03Merged) 10jenkins-bot: Revert "mobileapps: Enable egress network policy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/670121 (owner: 10MSantos) [18:30:27] (03PS2) 10Ahmon Dancy: Merge branch 'master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/670258 [18:30:36] (03CR) 10Hnowlan: [C: 03+1] Revert "mobileapps: bump to 2021-03-08-191346-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/670122 (owner: 10MSantos) [18:30:45] db07 is now alive with its puppet cert signed [18:31:19] (03PS2) 10MSantos: Revert "mobileapps: bump to 2021-03-08-191346-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/670122 [18:32:36] (03CR) 10MSantos: [C: 03+2] Revert "mobileapps: bump to 2021-03-08-191346-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/670122 (owner: 10MSantos) [18:34:00] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:42] (03Merged) 10jenkins-bot: Revert "mobileapps: bump to 2021-03-08-191346-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/670122 (owner: 10MSantos) [18:35:08] 10SRE, 10ops-codfw, 10User-fgiunchedi: Decom ms-be[2016-2027] - https://phabricator.wikimedia.org/T272837 (10Papaul) [18:35:39] !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [18:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:48] !log mbsantos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [18:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:33] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [18:40:25] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [18:41:04] goood :) [18:41:10] thanks a lot thesocialdev [18:41:18] restbase is happy too [18:41:35] (going afk) [18:41:38] cool [18:42:41] if I get the go I'll deploy a version of restbase to resolve the 429 issues tomorrow [18:43:12] thesocialdev: i think this is the faulty change: https://gerrit.wikimedia.org/r/c/mediawiki/services/mobileapps/+/651178 [18:43:38] mholloway: good timing, at the same time I created the revert [18:44:50] if you look at the mobileapps dashboard in kibana, the errors are all coming from code in lib/wikiLanguage.js that assumes `languageMappingCache` is defined [18:45:28] (& is an array, etc.) [18:46:37] that said, i wasn't actually able to reproduce any of the logged errors locally [18:47:15] !log re-pool wdqs1004 [18:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:27] anyway, nice work getting things calmed down! [18:48:08] mholloway: thanks for pointing that! I have a theory, will investigate to see if it makes sense [18:49:11] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1087.eqiad.wmnet with reason: REIMAGE [18:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:00] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1087.eqiad.wmnet with reason: REIMAGE [18:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:10] !log brennen@deploy1002 Finished scap: testwikis wikis to 1.36.0-wmf.34 (duration: 47m 25s) [18:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:58] (03CR) 10Dzahn: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [18:59:17] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [18:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210309T1900) [19:03:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10RobH) Ok, Updating the firmware in this order: idrac, network card (only pcie device installed), bios. [19:04:39] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:41] !log brennen@deploy1002 Pruned MediaWiki: 1.36.0-wmf.31 (duration: 03m 34s) [19:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:41] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:23:07] PROBLEM - Check systemd state on labsdb1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:23:51] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:24:09] PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 18 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [19:24:29] 10SRE, 10Icinga: icinga login case mismatch - https://phabricator.wikimedia.org/T275920 (10Dzahn) T256656 T165795 [19:25:07] RECOVERY - Stale file for node-exporter textfile in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [19:30:15] 10SRE, 10Icinga: icinga login case mismatch - https://phabricator.wikimedia.org/T275920 (10Dzahn) Icinga itself can't be blamed because the login in front of Icinga does not come from Icinga. It's us who slapped that in front of it at one point in the past because of some security vulnerability and then we ne... [19:32:03] RECOVERY - Check systemd state on labsdb1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:33:43] (03PS12) 10Andrew Bogott: cloud-vps: Add a new resource to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) [19:33:45] (03PS4) 10Andrew Bogott: Move prepare_cinder_volume.py into the cinderutils module [puppet] - 10https://gerrit.wikimedia.org/r/669896 [19:33:48] (03PS4) 10Andrew Bogott: Add role::labs::cindermount::srv [puppet] - 10https://gerrit.wikimedia.org/r/669958 (https://phabricator.wikimedia.org/T269511) [19:34:08] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps: Add a new resource to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [19:35:16] (03PS13) 10Andrew Bogott: cloud-vps: Add a new resource to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) [19:35:18] (03PS5) 10Andrew Bogott: Move prepare_cinder_volume.py into the cinderutils module [puppet] - 10https://gerrit.wikimedia.org/r/669896 [19:35:20] (03PS5) 10Andrew Bogott: Add role::labs::cindermount::srv [puppet] - 10https://gerrit.wikimedia.org/r/669958 (https://phabricator.wikimedia.org/T269511) [19:35:23] (03CR) 10jerkins-bot: [V: 04-1] Add role::labs::cindermount::srv [puppet] - 10https://gerrit.wikimedia.org/r/669958 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [19:45:33] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:47:36] (03CR) 10Bstorm: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [19:58:07] (03PS1) 10Majavah: betacluster: replace db05 with db07 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670273 (https://phabricator.wikimedia.org/T276968) [19:59:45] (03CR) 10Dzahn: "Ah, gotcha. thanks Brooke. Manuel if you could take a look" [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [20:00:04] brennen and liw: How many deployers does it take to do Mediawiki train - American+European Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210309T2000). [20:01:49] rolling to group0 momentarily. [20:02:30] (03PS2) 10Majavah: betacluster: replace db05 with db07, add db08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670273 (https://phabricator.wikimedia.org/T276968) [20:04:04] (03PS1) 10Brennen Bearnes: group0 wikis to 1.36.0-wmf.34 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670274 [20:04:06] (03CR) 10Brennen Bearnes: [C: 03+2] group0 wikis to 1.36.0-wmf.34 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670274 (owner: 10Brennen Bearnes) [20:05:01] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.34 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670274 (owner: 10Brennen Bearnes) [20:05:33] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:06:32] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.34 [20:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:22] !log train status: 1.36.0-wmf.32 (T274938) on group0 at 20:06:32 UTC; logs initially quiet. [20:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:29] T274938: 1.36.0-wmf.34 deployment blockers - https://phabricator.wikimedia.org/T274938 [20:13:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10RobH) [20:15:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10RobH) a:05RobH→03Cmjohnson Updated firmwares, but error persists. Bios now 2.10.0 idrac now 4.40.00 nic now 21.60.22.11 & 21.60.16 I've updated the checklist on t... [20:17:11] PROBLEM - Check systemd state on labsdb1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:20:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:20:34] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/P14711" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/666979 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [20:21:45] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:21:54] (03PS1) 10Bstorm: wikireplicas: depool labsdb1009 due to instability [puppet] - 10https://gerrit.wikimedia.org/r/670276 (https://phabricator.wikimedia.org/T276980) [20:22:36] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10Sergey.Trofimovsky.SF) >>! In T274461#6895448, @jbond wrote: > I just wanted to note that the SSO project provi... [20:23:22] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1091.eqiad.wmnet with reason: REIMAGE [20:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:42] (03PS1) 10Urbanecm: beta: Switch beta to read only on mediawiki level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670277 [20:24:48] (03CR) 10CDanis: Map tiles for 3rd parties: allow consultant to access maps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670229 (https://phabricator.wikimedia.org/T276317) (owner: 10MSantos) [20:25:03] !log downtimed labsdb1009 so it doesn't keep paging T276980 [20:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:10] T276980: mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 [20:25:33] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1091.eqiad.wmnet with reason: REIMAGE [20:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:01] (03PS2) 10Urbanecm: beta: Switch beta to read only on mediawiki level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670277 (https://phabricator.wikimedia.org/T276968) [20:26:34] (03CR) 10Majavah: [C: 03+1] beta: Switch beta to read only on mediawiki level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670277 (https://phabricator.wikimedia.org/T276968) (owner: 10Urbanecm) [20:26:42] (03CR) 10Urbanecm: [C: 03+2] "UBN" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670277 (https://phabricator.wikimedia.org/T276968) (owner: 10Urbanecm) [20:26:58] (03PS2) 10Bstorm: wikireplicas: depool labsdb1009 due to instability [puppet] - 10https://gerrit.wikimedia.org/r/670276 (https://phabricator.wikimedia.org/T276980) [20:27:43] (03Merged) 10jenkins-bot: beta: Switch beta to read only on mediawiki level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670277 (https://phabricator.wikimedia.org/T276968) (owner: 10Urbanecm) [20:29:31] (03PS14) 10Andrew Bogott: cloud-vps: Add a new resource to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) [20:29:33] (03PS6) 10Andrew Bogott: Move prepare_cinder_volume.py into the cinderutils module [puppet] - 10https://gerrit.wikimedia.org/r/669896 [20:29:35] (03PS6) 10Andrew Bogott: Add role::labs::cindermount::srv [puppet] - 10https://gerrit.wikimedia.org/r/669958 (https://phabricator.wikimedia.org/T269511) [20:29:37] (03PS1) 10Andrew Bogott: Update prepare_cinder_volume.py to support mounting formatted volumes [puppet] - 10https://gerrit.wikimedia.org/r/670278 (https://phabricator.wikimedia.org/T269511) [20:30:20] 10SRE, 10DNS, 10Traffic, 10serviceops, and 3 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10Dzahn) >>! In T276170#6896563, @wkandek wrote: > gerrit.wikimedia.org lives on a second IP address on gerrit1001. Should we follow that model here as well? It would be appropriate in the "exa... [20:31:57] (03CR) 10Bstorm: [C: 03+2] wikireplicas: depool labsdb1009 due to instability [puppet] - 10https://gerrit.wikimedia.org/r/670276 (https://phabricator.wikimedia.org/T276980) (owner: 10Bstorm) [20:32:02] (03PS1) 10RobH: correcting partman for kafka-logging [puppet] - 10https://gerrit.wikimedia.org/r/670280 (https://phabricator.wikimedia.org/T273778) [20:32:55] (03CR) 10RobH: [C: 03+2] correcting partman for kafka-logging [puppet] - 10https://gerrit.wikimedia.org/r/670280 (https://phabricator.wikimedia.org/T273778) (owner: 10RobH) [20:33:41] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=- method=POST https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [20:34:57] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [20:36:59] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging100[123] - https://phabricator.wikimedia.org/T273778 (10RobH) [20:40:07] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging100[123] - https://phabricator.wikimedia.org/T273778 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['kafka-logging1002.eqiad.wmnet', 'kafka-logging1003.eqiad.wmnet'] ` The... [20:40:49] RECOVERY - Check systemd state on labsdb1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:41:35] !log depooled labsdb1009 T276980 [20:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:43] T276980: mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 [20:42:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:42:28] !log reimaged an-worker1091 to buster [20:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:45:56] (03PS2) 10Andrew Bogott: Update prepare_cinder_volume.py to support mounting formatted volumes [puppet] - 10https://gerrit.wikimedia.org/r/670278 (https://phabricator.wikimedia.org/T269511) [20:45:58] (03PS15) 10Andrew Bogott: cloud-vps: Add a new resource to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) [20:46:00] (03PS7) 10Andrew Bogott: Move prepare_cinder_volume.py into the cinderutils module [puppet] - 10https://gerrit.wikimedia.org/r/669896 [20:46:02] (03PS7) 10Andrew Bogott: Add role::labs::cindermount::srv [puppet] - 10https://gerrit.wikimedia.org/r/669958 (https://phabricator.wikimedia.org/T269511) [20:47:27] RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 17 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [20:47:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:49:39] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:50:29] (03CR) 10Andrew Bogott: "@bstorm, this is now worth another look. This class now supports mounting already-formatted volumes (e.g. if they were detached from anoth" [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [20:59:27] (03PS1) 10Dzahn: builder/docker: break out docker ferm rules into own profile [puppet] - 10https://gerrit.wikimedia.org/r/670286 (https://phabricator.wikimedia.org/T276869) [21:00:40] (03CR) 10jerkins-bot: [V: 04-1] builder/docker: break out docker ferm rules into own profile [puppet] - 10https://gerrit.wikimedia.org/r/670286 (https://phabricator.wikimedia.org/T276869) (owner: 10Dzahn) [21:01:45] (03PS1) 10CDanis: bump eventgate-logging-external & attach geoip [deployment-charts] - 10https://gerrit.wikimedia.org/r/670288 (https://phabricator.wikimedia.org/T263496) [21:07:21] (03PS1) 10Dzahn: releases: include profile::docker::ferm in releases role [puppet] - 10https://gerrit.wikimedia.org/r/670289 (https://phabricator.wikimedia.org/T276869) [21:07:28] (03CR) 10Bstorm: "> Patch Set 15:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [21:11:42] (03PS1) 10RobH: kafka-logging partman updates [puppet] - 10https://gerrit.wikimedia.org/r/670290 (https://phabricator.wikimedia.org/T273778) [21:12:15] (03PS2) 10RobH: kafka-logging partman updates [puppet] - 10https://gerrit.wikimedia.org/r/670290 (https://phabricator.wikimedia.org/T273778) [21:12:42] (03CR) 10RobH: [C: 03+2] kafka-logging partman updates [puppet] - 10https://gerrit.wikimedia.org/r/670290 (https://phabricator.wikimedia.org/T273778) (owner: 10RobH) [21:17:32] (03PS2) 10Dzahn: builder/docker: break out docker ferm rules into own profile [puppet] - 10https://gerrit.wikimedia.org/r/670286 (https://phabricator.wikimedia.org/T276869) [21:22:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install kafka-logging100[123] - https://phabricator.wikimedia.org/T273778 (10RobH) I'm having partman issues since this needs to use a new recipe for hw raid 10 flat filesystem. flat.cfg didn't work, so went custom but still have... [21:25:20] (03CR) 10Bstorm: "> Patch Set 15:" [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [21:25:24] (03PS1) 10Razzi: Update cache config for superset 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/670293 (https://phabricator.wikimedia.org/T273850) [21:29:44] (03PS2) 10Razzi: Update cache config for superset 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/670293 (https://phabricator.wikimedia.org/T273850) [21:29:49] (03CR) 10Bstorm: "OH! Now the comments in the other patch make more sense. I was not reading my emails in the right order." [puppet] - 10https://gerrit.wikimedia.org/r/670278 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [21:31:31] (03CR) 10Bstorm: [C: 03+1] Update prepare_cinder_volume.py to support mounting formatted volumes [puppet] - 10https://gerrit.wikimedia.org/r/670278 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [21:31:38] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28475/console" [puppet] - 10https://gerrit.wikimedia.org/r/670293 (https://phabricator.wikimedia.org/T273850) (owner: 10Razzi) [21:31:58] (03CR) 10Razzi: [V: 03+1 C: 03+2] Update cache config for superset 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/670293 (https://phabricator.wikimedia.org/T273850) (owner: 10Razzi) [21:32:04] (03PS2) 10Bartosz Dziewoński: Enable DiscussionTools' beta feature for newtopictool on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669893 (https://phabricator.wikimedia.org/T275827) [21:32:06] (03PS2) 10Bartosz Dziewoński: Enable DiscussionTools' beta features on frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669895 (https://phabricator.wikimedia.org/T276189) [21:32:08] (03PS1) 10Bartosz Dziewoński: Disable DiscussionTools Reply Tool A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670294 (https://phabricator.wikimedia.org/T276967) [21:39:39] (03CR) 10Bstorm: "I finally noticed Ibb4b8cd1bbe5772e906. That answered one of my questions in the comments and like 90% of my comments on this patch set." [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [21:44:21] (03PS3) 10Dzahn: builder/docker: break out docker ferm rules into own profile [puppet] - 10https://gerrit.wikimedia.org/r/670286 (https://phabricator.wikimedia.org/T276869) [21:50:37] (03CR) 10Dzahn: [V: 03+1] "noop on deneb https://puppet-compiler.wmflabs.org/compiler1003/28476/deneb.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/670286 (https://phabricator.wikimedia.org/T276869) (owner: 10Dzahn) [21:53:39] (03CR) 10Andrew Bogott: cloud-vps: Add a new resource to detect and format available cinder volumes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [21:55:03] (03PS6) 10CRusnov: netbox, profile::netbox: Switch to CAS authentication [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) [21:55:05] (03CR) 10CRusnov: netbox, profile::netbox: Switch to CAS authentication (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [21:56:51] (03CR) 10jerkins-bot: [V: 04-1] netbox, profile::netbox: Switch to CAS authentication [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [21:58:03] (03PS1) 10Razzi: Disable superset caching on staging [puppet] - 10https://gerrit.wikimedia.org/r/670309 (https://phabricator.wikimedia.org/T273850) [21:58:13] (03PS7) 10CRusnov: netbox, profile::netbox: Switch to CAS authentication [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) [21:59:09] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28478/console" [puppet] - 10https://gerrit.wikimedia.org/r/670309 (https://phabricator.wikimedia.org/T273850) (owner: 10Razzi) [21:59:21] (03CR) 10Razzi: [V: 03+1 C: 03+2] Disable superset caching on staging [puppet] - 10https://gerrit.wikimedia.org/r/670309 (https://phabricator.wikimedia.org/T273850) (owner: 10Razzi) [22:03:35] (03PS16) 10Andrew Bogott: cloud-vps: Add a new resource to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) [22:03:37] (03PS8) 10Andrew Bogott: Move prepare_cinder_volume.py into the cinderutils module [puppet] - 10https://gerrit.wikimedia.org/r/669896 [22:03:39] (03PS8) 10Andrew Bogott: Add role::labs::cindermount::srv [puppet] - 10https://gerrit.wikimedia.org/r/669958 (https://phabricator.wikimedia.org/T269511) [22:04:08] !log phab1001 - manually running phab public task dumd script after making changes to redirect stdout [22:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:51] PROBLEM - Host cloudvirt1038.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [22:21:01] (03CR) 1020after4: [C: 03+1] profile: add scap log duplication and ecs mutations [puppet] - 10https://gerrit.wikimedia.org/r/659426 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [22:24:54] RECOVERY - Host cloudvirt1038.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.22 ms [22:36:20] 10SRE, 10InternetArchiveBot, 10Traffic, 10Platform Team Workboards (Clinic Duty Team): IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10jijiki) And today as well, so we can say that this happens every day. {F34148488} [22:39:07] (03PS2) 10CRusnov: Add enhanced RemoteUserBackend [software/netbox] - 10https://gerrit.wikimedia.org/r/668574 (https://phabricator.wikimedia.org/T244849) [22:39:09] (03CR) 10CRusnov: Add enhanced RemoteUserBackend (033 comments) [software/netbox] - 10https://gerrit.wikimedia.org/r/668574 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [22:44:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10Jclark-ctr) a:03Cmjohnson [22:44:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10Jclark-ctr) @Cmjohnson cloudgw1001 c8 u29 ports13/19 cableid #5322 cloudgw1002. d5 u21 ports 23/35 cableid #5320. [22:47:50] (03CR) 10Bstorm: [C: 03+1] "Cool stuff! Thanks for addressing my random concerns." [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [22:55:12] 10SRE, 10ops-codfw, 10User-fgiunchedi: Decom ms-be[2016-2027] - https://phabricator.wikimedia.org/T272837 (10Papaul) [22:59:40] (03PS3) 10CRusnov: Add enhanced RemoteUserBackend [software/netbox] - 10https://gerrit.wikimedia.org/r/668574 (https://phabricator.wikimedia.org/T244849) [22:59:42] (03CR) 10CRusnov: Add enhanced RemoteUserBackend (033 comments) [software/netbox] - 10https://gerrit.wikimedia.org/r/668574 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [23:01:28] (03PS8) 10CRusnov: netbox, profile::netbox: Switch to CAS authentication [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) [23:02:21] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install backup1003 - https://phabricator.wikimedia.org/T274184 (10Jclark-ctr) [23:08:18] (03CR) 10CRusnov: "> Patch Set 1:" (036 comments) [software/netbox] - 10https://gerrit.wikimedia.org/r/668574 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [23:08:54] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup100[12] - https://phabricator.wikimedia.org/T274206 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['ms-backup1001.eqiad.wmnet', 'ms-backup1002.eqiad.wmnet'] ` The log ca... [23:09:25] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install backup1003 - https://phabricator.wikimedia.org/T274184 (10Jclark-ctr) a:03Cmjohnson @Cmjohnson backup1003 Rack b2 U39 port33 Cableid #5031 [23:14:20] (03PS1) 10RobH: ms-backup100[12] updates [puppet] - 10https://gerrit.wikimedia.org/r/670318 (https://phabricator.wikimedia.org/T274206) [23:14:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: 2021-03-31) rack/setup/install ms-backup100[12] - https://phabricator.wikimedia.org/T274206 (10Jclark-ctr) Ms-backup1001 A4 u4 p3 id5322. Ms-backup1001. C2 U41 P41 ID5587 [23:15:47] (03CR) 10RobH: [C: 03+2] ms-backup100[12] updates [puppet] - 10https://gerrit.wikimedia.org/r/670318 (https://phabricator.wikimedia.org/T274206) (owner: 10RobH) [23:18:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install kafka-logging100[123] - https://phabricator.wikimedia.org/T273778 (10RobH) [23:18:39] (03PS2) 10Bstorm: wikireplica: depool clouddb1018 [puppet] - 10https://gerrit.wikimedia.org/r/670190 (owner: 10Phamhi) [23:19:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:20:04] (03CR) 10Bstorm: [C: 03+1] wikireplica: depool clouddb1018 [puppet] - 10https://gerrit.wikimedia.org/r/670190 (owner: 10Phamhi) [23:20:22] (03PS2) 10Bstorm: wikireplica: depool clouddb1020 [puppet] - 10https://gerrit.wikimedia.org/r/670221 (owner: 10Phamhi) [23:20:58] (03CR) 10Bstorm: "If labsdb1009 get repooled, you'll need to rebase again." [puppet] - 10https://gerrit.wikimedia.org/r/670221 (owner: 10Phamhi) [23:21:05] (03CR) 10Bstorm: [C: 03+1] wikireplica: depool clouddb1020 [puppet] - 10https://gerrit.wikimedia.org/r/670221 (owner: 10Phamhi) [23:21:17] (03PS2) 10Bstorm: wikireplica: depool clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/670209 (owner: 10Phamhi) [23:21:35] (03CR) 10Bstorm: [C: 03+1] wikireplica: depool clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/670209 (owner: 10Phamhi) [23:21:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:25:09] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup100[12] - https://phabricator.wikimedia.org/T274206 (10RobH) [23:25:38] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup100[12] - https://phabricator.wikimedia.org/T274206 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['ms-backup1001.eqiad.wmnet', 'ms-backup1002.eqiad.wmnet'] ` The log ca... [23:33:49] 10SRE, 10DNS, 10Traffic, 10serviceops, and 3 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10wkandek) Let's go with the simpler solution and use the CNAME. [23:38:09] (03CR) 10Cwhite: [C: 03+1] Run tests for alerts [alerts] - 10https://gerrit.wikimedia.org/r/670231 (https://phabricator.wikimedia.org/T272977) (owner: 10Filippo Giunchedi) [23:39:34] (03CR) 10Cwhite: [C: 03+2] profile: add scap log duplication and ecs mutations [puppet] - 10https://gerrit.wikimedia.org/r/659426 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [23:41:08] (03PS1) 10RobH: fixing mac for ms-backup100[12] [puppet] - 10https://gerrit.wikimedia.org/r/670323 (https://phabricator.wikimedia.org/T274206) [23:41:32] (03CR) 10RobH: [C: 03+2] fixing mac for ms-backup100[12] [puppet] - 10https://gerrit.wikimedia.org/r/670323 (https://phabricator.wikimedia.org/T274206) (owner: 10RobH) [23:43:00] (03PS1) 10Krinkle: Fix layout shift class name parsing for SVGElement [extensions/NavigationTiming] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/670125 (https://phabricator.wikimedia.org/T276826) [23:44:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: 2021-03-31) rack/setup/install ms-backup100[12] - https://phabricator.wikimedia.org/T274206 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['ms-backup1001.eqiad.wmnet', 'ms-backup1002.eqia... [23:48:45] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install kafka-logging100[123] - https://phabricator.wikimedia.org/T273778 (10RobH) [23:54:16] (03PS7) 10Addshore: Update several Wikidata-related configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612918 (owner: 10Matěj Suchánek) [23:54:41] (03CR) 10Andrew Bogott: [C: 03+2] Update prepare_cinder_volume.py to support mounting formatted volumes [puppet] - 10https://gerrit.wikimedia.org/r/670278 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [23:54:50] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: Add a new resource to detect and format available cinder volumes [puppet] - 10https://gerrit.wikimedia.org/r/668757 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [23:55:01] (03CR) 10Andrew Bogott: [C: 03+2] Move prepare_cinder_volume.py into the cinderutils module [puppet] - 10https://gerrit.wikimedia.org/r/669896 (owner: 10Andrew Bogott) [23:56:56] (03PS9) 10Andrew Bogott: Add role::labs::cindermount::srv [puppet] - 10https://gerrit.wikimedia.org/r/669958 (https://phabricator.wikimedia.org/T269511) [23:58:20] (03CR) 10Andrew Bogott: [C: 03+2] Add role::labs::cindermount::srv [puppet] - 10https://gerrit.wikimedia.org/r/669958 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [23:58:24] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-backup1001.eqiad.wmnet with reason: REIMAGE [23:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:24] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-backup1002.eqiad.wmnet with reason: REIMAGE [23:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log