[00:41:39] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for awight - https://phabricator.wikimedia.org/T250364 (10awight) @fgiunchedi Thank you! [00:50:52] (03PS1) 10BryanDavis: toolforge: add toolforge.org and wmcloud.org to CSP allows [puppet] - 10https://gerrit.wikimedia.org/r/591236 (https://phabricator.wikimedia.org/T130748) [01:06:32] PROBLEM - SSH on mw1311.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:07:18] RECOVERY - SSH on mw1311.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:42:52] (03PS3) 10Jhedden: cloudvps: metricsinfra add prometheus alert manager and email notifications [puppet] - 10https://gerrit.wikimedia.org/r/591202 (https://phabricator.wikimedia.org/T250206) [02:45:03] (03CR) 10Jhedden: cloudvps: metricsinfra add prometheus alert manager and email notifications (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/591202 (https://phabricator.wikimedia.org/T250206) (owner: 10Jhedden) [04:32:50] PROBLEM - Host ganeti1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:38:44] RECOVERY - Host ganeti1006.mgmt is UP: PING WARNING - Packet loss = 75%, RTA = 0.79 ms [04:46:44] PROBLEM - Host ganeti1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:52:36] RECOVERY - Host ganeti1006.mgmt is UP: PING WARNING - Packet loss = 33%, RTA = 0.82 ms [04:54:14] (03CR) 10Vgutierrez: [C: 03+1] Add Host regex filtering [software/purged] - 10https://gerrit.wikimedia.org/r/591024 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [05:06:41] (03CR) 10Vgutierrez: [C: 03+2] ATS: Enable SSL_OP_PRIORITIZE_CHACHA on ats-tls [puppet] - 10https://gerrit.wikimedia.org/r/591092 (owner: 10Vgutierrez) [05:09:17] !log rolling restart of ats-tls to enable SSL_OP_PRIORITIZE_CHACHA [05:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:50] PROBLEM - SSH on ganeti1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:15:34] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/591063 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [05:17:02] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/591062 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [05:18:23] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/591064 (owner: 10Andrew Bogott) [05:19:28] !log Deploy schema change on s2 codfw - T250055 [05:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:35] T250055: Remove image.img_deleted column from production - https://phabricator.wikimedia.org/T250055 [05:22:24] (03CR) 10Muehlenhoff: "The patch looks good to me, but note that the Ferm version in jessie has broken AAAA handling, so if there are any instances accessing the" [puppet] - 10https://gerrit.wikimedia.org/r/591065 (owner: 10Andrew Bogott) [05:32:35] !log installing git security updates [05:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:11] !log Add db1095:3312, db1095:3320 to tendril - T250602 [05:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:17] T250602: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 [05:38:07] (03PS1) 10Marostegui: site.pp: Remove 10.4 comment [puppet] - 10https://gerrit.wikimedia.org/r/591259 [05:41:01] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove 10.4 comment [puppet] - 10https://gerrit.wikimedia.org/r/591259 (owner: 10Marostegui) [05:46:53] !log Deploy schema change on s6 codfw master [05:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:38] 10Operations, 10Discovery-Search, 10Elasticsearch: Reindex commonswiki as shards have grown beyond critical threshold - https://phabricator.wikimedia.org/T231446 (10Aklapper) a:05Mathew.onipe→03None Removing assignee @mathew.onipe as the user does not seem to be active anymore. [05:53:41] 10Operations, 10SRE-tools, 10User-Joe, 10User-jijiki: Create cookbook to do `nodetool repair` across cassandra cluster - https://phabricator.wikimedia.org/T225694 (10Aklapper) a:05Mathew.onipe→03None Removing assignee @mathew.onipe as the user does not seem to be active anymore. [05:53:44] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Traffic: Enable nginx prometheus metrics for all elastic nodes - https://phabricator.wikimedia.org/T216681 (10Aklapper) a:05Mathew.onipe→03None Removing assignee @mathew.onipe as the user does not seem to be active anymore. [05:53:46] 10Operations, 10Discovery-Search, 10Elasticsearch: Add more metrics to upstream's elasticsearch exporter. - https://phabricator.wikimedia.org/T214547 (10Aklapper) a:05Mathew.onipe→03None Removing assignee @mathew.onipe as the user does not seem to be active anymore. [05:53:50] 10Operations, 10Discovery-Search, 10Epic, 10Patch-For-Review: Migrate elasticsearch scripts to spicerack cookbooks - https://phabricator.wikimedia.org/T202885 (10Aklapper) a:05Mathew.onipe→03None Removing assignee @mathew.onipe as the user does not seem to be active anymore. [06:04:30] PROBLEM - Host mw1307.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:10:32] RECOVERY - Host mw1307.mgmt is UP: PING WARNING - Packet loss = 77%, RTA = 0.81 ms [06:20:23] (03CR) 10Ayounsi: [C: 03+2] Revert "eqsin: temporarily prefer tunnel transport" [homer/public] - 10https://gerrit.wikimedia.org/r/589832 (owner: 10Ayounsi) [06:20:42] (03Merged) 10jenkins-bot: Revert "eqsin: temporarily prefer tunnel transport" [homer/public] - 10https://gerrit.wikimedia.org/r/589832 (owner: 10Ayounsi) [06:24:06] !log restore eqsin/ulsfo OSPF metric - T250653 [06:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:11] !log Rename flagged* tables on mediawikiwiki on db1075 - T248298 [06:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:17] T248298: Drop flagged revs tables on mediawikiwiki - https://phabricator.wikimedia.org/T248298 [07:00:42] (03CR) 10Ema: [C: 03+2] Add Host regex filtering [software/purged] - 10https://gerrit.wikimedia.org/r/591024 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [07:06:26] !log cp4026: restart ats-tls and ats-be [07:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:40] (03PS1) 10Vgutierrez: Release 8.0.7-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/591298 [07:26:22] !log cp4032: restart ats-tls and ats-be [07:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:51] (03PS2) 10Vgutierrez: Release 8.0.7-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/591298 [07:40:14] (03PS1) 10Ema: Release 0.8 [software/purged] - 10https://gerrit.wikimedia.org/r/591300 (https://phabricator.wikimedia.org/T249583) [07:45:50] (03CR) 10Vgutierrez: [C: 03+1] purged: raise 'frontend_workers' [puppet] - 10https://gerrit.wikimedia.org/r/591108 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [07:46:24] (03CR) 10Ema: [C: 03+2] purged: raise 'frontend_workers' [puppet] - 10https://gerrit.wikimedia.org/r/591108 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [07:47:55] !log dropping old data and optimizing tables on pc1010 and pc2010 T247787 [07:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:02] T247787: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 [07:50:05] (03CR) 10Ema: [C: 03+1] Release 8.0.7-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/591298 (owner: 10Vgutierrez) [07:50:46] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.7-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/591298 (owner: 10Vgutierrez) [07:54:10] !log cp3050: restart purged with 4 frontend workers [07:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:53] (03CR) 10Dzahn: [C: 03+1] DNS: Add mgmt and production DNS for cloudcontrol2004-dev [dns] - 10https://gerrit.wikimedia.org/r/591194 (owner: 10Papaul) [07:57:53] 10Operations, 10Research, 10Patch-For-Review: recommendation api's test on scb nodes are flapping - https://phabricator.wikimedia.org/T247732 (10elukey) @bmansurov the idea seems good! About the propagation of the error, I would say that it is better to wrap the 503 returned by the API in something ad-hoc fo... [08:13:06] (03PS1) 10Dzahn: gerrit: add Keepalive=on to ProxyPass config lines [puppet] - 10https://gerrit.wikimedia.org/r/591304 (https://phabricator.wikimedia.org/T246763) [08:13:16] RECOVERY - SSH on ganeti1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:26:33] 10Operations, 10Core Platform Team, 10Traffic: Move all purge traffic to kafka - https://phabricator.wikimedia.org/T250781 (10ema) [08:30:38] (03PS1) 10Jcrespo: mariadb-backups: Include es4/5 into backup checks and refactor [puppet] - 10https://gerrit.wikimedia.org/r/591305 (https://phabricator.wikimedia.org/T79922) [08:32:11] (03PS2) 10Jcrespo: mariadb-backups: Include es4/5 into backup checks and refactor [puppet] - 10https://gerrit.wikimedia.org/r/591305 (https://phabricator.wikimedia.org/T79922) [08:35:23] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Include es4/5 into backup checks and refactor [puppet] - 10https://gerrit.wikimedia.org/r/591305 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [08:38:53] (03PS2) 10Dzahn: ATS: switch backends for annual.wm.org and 15.wp.org [puppet] - 10https://gerrit.wikimedia.org/r/591052 (https://phabricator.wikimedia.org/T247650) [08:39:28] (03PS3) 10Jcrespo: mariadb-backups: Include es4/5 into backup checks and refactor [puppet] - 10https://gerrit.wikimedia.org/r/591305 (https://phabricator.wikimedia.org/T79922) [08:40:32] (03PS3) 10Dzahn: ATS: switch backends for annual.wm.org and 15.wp.org [puppet] - 10https://gerrit.wikimedia.org/r/591052 (https://phabricator.wikimedia.org/T247650) [08:41:19] (03CR) 10Dzahn: [C: 03+2] "SANs have been added to the cert already" [puppet] - 10https://gerrit.wikimedia.org/r/591052 (https://phabricator.wikimedia.org/T247650) (owner: 10Dzahn) [08:48:19] (03PS4) 10Jcrespo: mariadb-backups: Include es4/5 into backup checks and refactor [puppet] - 10https://gerrit.wikimedia.org/r/591305 (https://phabricator.wikimedia.org/T79922) [08:48:35] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Enable notifications on db1095 after failover from db1140 [puppet] - 10https://gerrit.wikimedia.org/r/591047 (https://phabricator.wikimedia.org/T250602) (owner: 10Jcrespo) [08:51:51] (03PS1) 10Dzahn: ATS: switch backends for and research and bienvenida.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/591306 (https://phabricator.wikimedia.org/T247650) [08:51:58] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler1002/22112/db1115.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/591305 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [08:52:06] !log purged: rolling restart with 4 frontend workers [08:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:02] (03CR) 10Paladox: [C: 03+1] gerrit: add Keepalive=on to ProxyPass config lines [puppet] - 10https://gerrit.wikimedia.org/r/591304 (https://phabricator.wikimedia.org/T246763) (owner: 10Dzahn) [09:12:08] PROBLEM - SSH on mw1308.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:17:07] (03PS2) 10Dzahn: ATS: switch contint backend to use TLS [puppet] - 10https://gerrit.wikimedia.org/r/589565 (https://phabricator.wikimedia.org/T210411) [09:23:37] (03PS1) 10Vgutierrez: Release 8.1.0-unreleased-1wm1 [debs/trafficserver] (8.1.x) - 10https://gerrit.wikimedia.org/r/591308 [09:23:52] (03CR) 10jerkins-bot: [V: 04-1] Release 8.1.0-unreleased-1wm1 [debs/trafficserver] (8.1.x) - 10https://gerrit.wikimedia.org/r/591308 (owner: 10Vgutierrez) [09:24:52] (03PS1) 10Jcrespo: mariadb-backups: Move OK alert description to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/591309 (https://phabricator.wikimedia.org/T138562) [09:27:39] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Include es4/5 into backup checks and refactor [puppet] - 10https://gerrit.wikimedia.org/r/591305 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [09:33:13] (03PS9) 10Jbond: apereo_cas: update templates login page [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) [09:33:38] (03CR) 10Jbond: "thanks for the continued review, updated" (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond) [09:35:45] (03CR) 10Marostegui: [C: 03+1] mariadb-backups: Move OK alert description to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/591309 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [09:37:01] PROBLEM - dump of es4 in codfw on db1115 is CRITICAL: We could not find any completed dump for es4 at codfw https://wikitech.wikimedia.org/wiki/MariaDB/Backups [09:37:01] PROBLEM - dump of es5 in codfw on db1115 is CRITICAL: We could not find any completed dump for es5 at codfw https://wikitech.wikimedia.org/wiki/MariaDB/Backups [09:37:58] ^investigating, the check was just enabled, but not sure why it failed [09:38:04] (03PS1) 10Elukey: role::statistics::private: factor out geoip archive to a profile [puppet] - 10https://gerrit.wikimedia.org/r/591310 (https://phabricator.wikimedia.org/T249754) [09:38:06] (03CR) 10Jbond: [C: 03+2] cli: add pcc invocation to logging output (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/588703 (https://phabricator.wikimedia.org/T250169) (owner: 10Jbond) [09:38:08] (03PS1) 10Elukey: role::statistics::explorer: move target hosts to hiera [puppet] - 10https://gerrit.wikimedia.org/r/591311 (https://phabricator.wikimedia.org/T243934) [09:38:12] (03CR) 10Jbond: [C: 03+2] pcc templates: refactor templates to make them more DRY [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/588735 (https://phabricator.wikimedia.org/T250169) (owner: 10Jbond) [09:38:24] (03CR) 10Jbond: [C: 03+2] pcc templates: add cli instructions to template footer [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/588736 (https://phabricator.wikimedia.org/T250169) (owner: 10Jbond) [09:38:41] (03Merged) 10jenkins-bot: cli: add pcc invocation to logging output [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/588703 (https://phabricator.wikimedia.org/T250169) (owner: 10Jbond) [09:38:53] (03Merged) 10jenkins-bot: pcc templates: refactor templates to make them more DRY [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/588735 (https://phabricator.wikimedia.org/T250169) (owner: 10Jbond) [09:38:57] (03Merged) 10jenkins-bot: pcc templates: add cli instructions to template footer [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/588736 (https://phabricator.wikimedia.org/T250169) (owner: 10Jbond) [09:41:03] (03CR) 10jerkins-bot: [V: 04-1] role::statistics::private: factor out geoip archive to a profile [puppet] - 10https://gerrit.wikimedia.org/r/591310 (https://phabricator.wikimedia.org/T249754) (owner: 10Elukey) [09:41:12] (03CR) 10jerkins-bot: [V: 04-1] role::statistics::explorer: move target hosts to hiera [puppet] - 10https://gerrit.wikimedia.org/r/591311 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [09:43:10] (03PS1) 10Jbond: 0.7.2: prepare release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/591312 [09:43:58] (03CR) 10Jbond: [C: 03+2] 0.7.2: prepare release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/591312 (owner: 10Jbond) [09:46:43] (03PS1) 10Jbond: puppet_compiler: updated default version [puppet] - 10https://gerrit.wikimedia.org/r/591313 [09:48:44] (03CR) 10Jbond: [C: 03+2] puppet_compiler: updated default version [puppet] - 10https://gerrit.wikimedia.org/r/591313 (owner: 10Jbond) [09:49:45] (03PS2) 10Elukey: role::statistics::private: factor out geoip archive to a profile [puppet] - 10https://gerrit.wikimedia.org/r/591310 (https://phabricator.wikimedia.org/T249754) [09:49:47] (03PS2) 10Elukey: role::statistics::explorer: move target hosts to hiera [puppet] - 10https://gerrit.wikimedia.org/r/591311 (https://phabricator.wikimedia.org/T243934) [09:55:27] RECOVERY - dump of es4 in codfw on db1115 is OK: dump for es4 at codfw taken less than 8 days ago and larger than 10 GB: Last one 2020-04-21 09:52:07 from es2022.codfw.wmnet (163 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [09:55:27] RECOVERY - dump of es5 in codfw on db1115 is OK: dump for es5 at codfw taken less than 8 days ago and larger than 10 GB: Last one 2020-04-21 09:52:07 from es2025.codfw.wmnet (141 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [09:55:44] ^ marostegui \o/ [09:55:54] nice! [09:57:26] (03PS1) 10Elukey: Add fake kerberos keytabs for stat1006 [labs/private] - 10https://gerrit.wikimedia.org/r/591314 [09:57:41] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake kerberos keytabs for stat1006 [labs/private] - 10https://gerrit.wikimedia.org/r/591314 (owner: 10Elukey) [09:58:58] (03PS3) 10Elukey: role::statistics::explorer: move target hosts to hiera [puppet] - 10https://gerrit.wikimedia.org/r/591311 (https://phabricator.wikimedia.org/T243934) [10:01:04] (03PS1) 10Kormat: icinga: change from unix login to wikitech username [puppet] - 10https://gerrit.wikimedia.org/r/591315 [10:06:43] PROBLEM - Unmerged changes on repository puppet on labtestpuppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [10:07:43] elukey: ^ maybe you? [10:08:08] marostegui: maybe those are for the labs private stuff, fixing [10:08:48] marostegui: ah no it is a commit from jbond42 [10:09:07] we were discussing elsewere, probably I distracted him [10:09:46] oh sorry one sec [10:10:08] mereging [10:10:31] RECOVERY - Unmerged changes on repository puppet on labtestpuppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [10:12:55] RECOVERY - SSH on mw1308.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:14:48] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Move OK alert description to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/591309 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [10:15:04] (03PS2) 10Jcrespo: mariadb-backups: Move OK alert description to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/591309 (https://phabricator.wikimedia.org/T138562) [10:23:01] 10Operations, 10netops, 10observability: Investigate Juniper structured logs - https://phabricator.wikimedia.org/T250703 (10ayounsi) Looks like it's a mixed bag, not all logs are structured, eg: ` <78>1 2020-04-21T10:00:00.049Z re0.cr2-esams /usr/sbin/cron 59827 - - (root) CMD ( /usr/libexec/atrun) <35>1 2... [10:26:45] (03PS1) 10Arturo Borrero Gonzalez: toolforge: wmcs-package-build: don't use the $INSTANCEPROJECT env var [puppet] - 10https://gerrit.wikimedia.org/r/591318 (https://phabricator.wikimedia.org/T249837) [10:30:33] hrmm.. systemd state on deploy1001 and contint2001 is broken.. looking what's happening there [10:32:05] eh.. it's gone and existed in Icinga for 30 seconds [10:35:56] (03PS2) 10Arturo Borrero Gonzalez: toolforge: wmcs-package-build: don't use the $INSTANCEPROJECT env var [puppet] - 10https://gerrit.wikimedia.org/r/591318 (https://phabricator.wikimedia.org/T249837) [10:36:14] (03CR) 10jerkins-bot: [V: 04-1] toolforge: wmcs-package-build: don't use the $INSTANCEPROJECT env var [puppet] - 10https://gerrit.wikimedia.org/r/591318 (https://phabricator.wikimedia.org/T249837) (owner: 10Arturo Borrero Gonzalez) [10:49:05] <_joe_> !log mwdebug1001:~# iptables -A INPUT -s 10.64.32.208 -m statistic --mode random --probability 0.1 -j DROP (T240684) [10:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:13] T240684: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 [10:49:30] (03PS3) 10Arturo Borrero Gonzalez: toolforge: wmcs-package-build: don't use the $INSTANCEPROJECT env var [puppet] - 10https://gerrit.wikimedia.org/r/591318 (https://phabricator.wikimedia.org/T249837) [10:50:20] 10Operations, 10observability, 10cloud-services-team (Kanban): remove cloud "dev" hosts from Icinga? - https://phabricator.wikimedia.org/T250787 (10Dzahn) [10:51:46] 10Operations, 10observability, 10cloud-services-team (Kanban): remove cloud "dev" hosts from Icinga? - https://phabricator.wikimedia.org/T250787 (10Dzahn) [10:52:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: wmcs-package-build: don't use the $INSTANCEPROJECT env var [puppet] - 10https://gerrit.wikimedia.org/r/591318 (https://phabricator.wikimedia.org/T249837) (owner: 10Arturo Borrero Gonzalez) [10:53:16] _joe_: random, probability? is that chaos engineering? [10:53:36] <_joe_> mutante: I'm trying to see if I can trigger the tkos with random packet loss [10:53:49] aha, interesting [10:54:37] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Sam Walton - https://phabricator.wikimedia.org/T250189 (10Samwalton9) 05Resolved→03Open I don't seem to be able to log in, using my Wikitech login :( [11:00:04] Deploy window No Deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200421T1100) [11:03:18] (03CR) 10Dzahn: [C: 03+2] ATS: switch contint backend to use TLS [puppet] - 10https://gerrit.wikimedia.org/r/589565 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [11:05:22] (03PS1) 10Jbond: ferm_status: add diff function [puppet] - 10https://gerrit.wikimedia.org/r/591319 [11:06:12] !log https://integration.wikimedia.org now also using TLS between ATS and contint1001 using envoy (T210411) [11:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:19] T210411: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 [11:07:22] (03PS1) 10Jcrespo: wikireplicas: Set innodb_purge_threads to 10 [puppet] - 10https://gerrit.wikimedia.org/r/591320 (https://phabricator.wikimedia.org/T247978) [11:13:01] (03CR) 10Marostegui: "I agree with the change, but keep in mind that since I killed all the heavy queries the host has reached the peak (:p) on lag and purges a" [puppet] - 10https://gerrit.wikimedia.org/r/591320 (https://phabricator.wikimedia.org/T247978) (owner: 10Jcrespo) [11:15:47] !log recreating cert for contint/integration to add integration.mediawiki.org in addition to integration.wikimedia.org [11:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:13] (03PS1) 10Dzahn: ssl: add integration.mediawiki.org to contint cert [puppet] - 10https://gerrit.wikimedia.org/r/591321 (https://phabricator.wikimedia.org/T210411) [11:21:15] (03CR) 10Dzahn: [C: 03+2] "openssl x509 -in contint.wikimedia.org.crt -text -noout | grep DNS" [puppet] - 10https://gerrit.wikimedia.org/r/591321 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [11:22:52] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [11:23:09] (03PS1) 10Jbond: phabricator: remove srcaddr in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/591322 [11:28:29] (03CR) 10Dzahn: [C: 03+1] "lgtm. easy to test that ssh to git-ssh.wikimedia.org still works after this. maybe this whole thing would be more readable if we'd use re" [puppet] - 10https://gerrit.wikimedia.org/r/591322 (owner: 10Jbond) [11:33:34] (03CR) 10Fdans: [C: 03+1] Add Druid support for event.editattemptstep [puppet] - 10https://gerrit.wikimedia.org/r/587984 (https://phabricator.wikimedia.org/T249945) (owner: 10Dr0ptp4kt) [11:46:33] (03PS2) 10Jbond: phabricator: remove srcaddr in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/591322 [11:53:08] (03CR) 10Elukey: [C: 03+2] role::statistics::private: factor out geoip archive to a profile [puppet] - 10https://gerrit.wikimedia.org/r/591310 (https://phabricator.wikimedia.org/T249754) (owner: 10Elukey) [11:53:17] https://integration.wikimedia.org/ is throwing 5xx but I don't have root on contint* boxes so I can't debug further than to say that jenkins and envoy are running [11:53:21] (03CR) 10Elukey: [C: 03+2] role::statistics::explorer: move target hosts to hiera [puppet] - 10https://gerrit.wikimedia.org/r/591311 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [11:53:28] contint roots aren't around. anyone here can help debug? [11:53:30] (03PS4) 10Elukey: role::statistics::explorer: move target hosts to hiera [puppet] - 10https://gerrit.wikimedia.org/r/591311 (https://phabricator.wikimedia.org/T243934) [11:56:46] oh my [11:56:51] twentyafterfour: i am around [11:57:25] hashar: cool, I can't tell if it's jenkins, envoy or something else causing the problem [11:57:44] what's up ? [11:57:54] jenkins logs complaining about unable to complete log rotation which is strange but probably not related [11:58:00] https://integration.wikimedia.org/ci/ gives 500 [11:58:06] which is proxied to jenkins [11:58:17] then https://integration.wikimedia.org/ is timing out / 500 ing as well [11:58:20] via cp1083.eqiad.wmnet, ATS/8.0.7 Error: 502, internal error - server connection terminated at 2020-04-21 11:49:22 GMT [11:58:30] so it is something else between the caches and apache on contint1001 [11:58:32] sigh, that was me [11:58:51] I'm surprised there aren't alerts for that [11:59:10] oh [11:59:15] we use human monitoring I guess :] [11:59:44] but yeah seems like we need a http service check [12:00:43] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "While probably it would be more "correct" to set a keepalive timeout on jetty and set a similar one here, this is probably the simplest so" [puppet] - 10https://gerrit.wikimedia.org/r/591304 (https://phabricator.wikimedia.org/T246763) (owner: 10Dzahn) [12:00:49] give me a minute [12:01:06] PROBLEM - Host mw1308.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:03:22] (03PS1) 10Dzahn: Revert "ATS: switch contint backend to use TLS" [puppet] - 10https://gerrit.wikimedia.org/r/591325 [12:03:48] (03PS2) 10Dzahn: Revert "ATS: switch contint backend to use TLS" [puppet] - 10https://gerrit.wikimedia.org/r/591325 [12:05:21] (03PS3) 10Dzahn: Revert "ATS: switch contint backend to use TLS" [puppet] - 10https://gerrit.wikimedia.org/r/591325 (https://phabricator.wikimedia.org/T210411) [12:05:49] (03CR) 10Dzahn: [C: 03+2] Revert "ATS: switch contint backend to use TLS" [puppet] - 10https://gerrit.wikimedia.org/r/591325 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [12:06:10] (03CR) 10Dzahn: [V: 03+2 C: 03+2] Revert "ATS: switch contint backend to use TLS" [puppet] - 10https://gerrit.wikimedia.org/r/591325 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [12:07:08] RECOVERY - Host mw1308.mgmt is UP: PING WARNING - Packet loss = 77%, RTA = 0.79 ms [12:09:13] twentyafterfour: not sure why yet, just reverting. checked the cert and everything running puppet on the ATS hosts via cumin now [12:14:16] twentyafterfour: hashar: back https://integration.wikimedia.org/ci/ [12:14:28] gotta try it again after the long weekend [12:14:56] ;]]] [12:15:17] probably because of the reverse proxy [12:15:27] might have been yeah [12:16:11] (03CR) 10Elukey: [C: 03+2] Add Druid support for event.editattemptstep [puppet] - 10https://gerrit.wikimedia.org/r/587984 (https://phabricator.wikimedia.org/T249945) (owner: 10Dr0ptp4kt) [12:16:50] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [12:18:24] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) The change for contint (integration.wikimedia.org, integration.mediawiki.org) had to be reverted because https://integration.wikimedia.org/ci/ returned 502s. [12:19:46] (03PS3) 10Jbond: phabricator: remove srcaddr in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/591322 [12:21:41] twentyafterfour: sorry about that, i did not force the puppet run on all (72) ATS hosts earlier so i did not notice it right away.. depended which one you got. it wasn't a very long downtime though [12:24:20] (03PS4) 10Jbond: phabricator: remove srcaddr in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/591322 [12:25:08] mutante: np thanks for the fix [12:30:55] (03PS2) 10Jcrespo: wikireplicas: Set innodb_purge_threads to 10 [puppet] - 10https://gerrit.wikimedia.org/r/591320 (https://phabricator.wikimedia.org/T247978) [12:30:56] (03PS1) 10Jcrespo: mariadb-backups: Check backups size also based on previous runs [puppet] - 10https://gerrit.wikimedia.org/r/591326 (https://phabricator.wikimedia.org/T138562) [12:31:33] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Check backups size also based on previous runs [puppet] - 10https://gerrit.wikimedia.org/r/591326 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [12:33:30] (03CR) 10Jcrespo: "sudo -u nagios ./check_mariadb_backups.py --section=s3 --datacenter=codfw --type=snapshot --warn-size-percentage=0" [puppet] - 10https://gerrit.wikimedia.org/r/591309 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [12:33:49] (03CR) 10Jcrespo: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/591309 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [12:33:52] (03CR) 10Marostegui: [C: 03+1] "You can either manually run puppet on icinga or just let it run whenever it comes next" [puppet] - 10https://gerrit.wikimedia.org/r/591315 (owner: 10Kormat) [12:34:05] (03CR) 10Jcrespo: "sudo -u nagios ./check_mariadb_backups.py --section=s3 --datacenter=codfw --type=snapshot --warn-size-percentage=0" [puppet] - 10https://gerrit.wikimedia.org/r/591326 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [12:35:26] (03CR) 10Kormat: [C: 03+2] icinga: change from unix login to wikitech username [puppet] - 10https://gerrit.wikimedia.org/r/591315 (owner: 10Kormat) [12:44:51] (03CR) 10Dzahn: "Did you change the contact name in the private repo and the membership in contactgroups in the public repo? Please check with "icinga -v " [puppet] - 10https://gerrit.wikimedia.org/r/591315 (owner: 10Kormat) [12:45:47] kormat: you'll have to also change it in contacts in the private repo on the puppetmaster and contactgroups in the public puppet repo [12:45:59] (03CR) 10Volans: [C: 03+1] "Didn't tested but code looks good to me." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/591319 (owner: 10Jbond) [12:46:17] mutante: But isn't contacts deprecated? [12:46:24] mutante: hmm. icinga -v didn't complain, at least [12:46:37] (03CR) 10Muehlenhoff: "New hires start with Victorops only and no longer need a private Icinga contact." [puppet] - 10https://gerrit.wikimedia.org/r/591315 (owner: 10Kormat) [12:46:55] (03PS2) 10Jcrespo: mariadb-backups: Check backups size also based on previous runs [puppet] - 10https://gerrit.wikimedia.org/r/591326 (https://phabricator.wikimedia.org/T138562) [12:46:59] marostegui: what would replace it? [12:47:07] victorops [12:47:11] mutante: victorops [12:47:30] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Check backups size also based on previous runs [puppet] - 10https://gerrit.wikimedia.org/r/591326 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [12:47:38] nobody has individual contacts anymore? that's new to me then, ok [12:47:58] existing SREs still have them, but not for new hires anymore [12:48:36] kormat: has it because his onboarding buddy was not informed :-P [12:48:54] onboarding "buddy", really. ;) [12:49:00] :-( [12:49:15] that's how it's officially defined :D [12:49:19] not my naming ;) [12:49:35] so, i can remove myself from contacts then? [12:49:52] moritzm: gotcha [12:50:57] (03CR) 10Jbond: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/591322 (owner: 10Jbond) [12:52:51] (03PS3) 10Jcrespo: mariadb-backups: Check backups size also based on previous runs [puppet] - 10https://gerrit.wikimedia.org/r/591326 (https://phabricator.wikimedia.org/T138562) [12:54:45] kormat: yes AFAIK [12:55:32] that's the sort of certainty i was looking for! :) [12:56:23] it changed like last week, that's why, so we're still trying to figure it out the best config [12:56:29] to make sure that the UI works fine [12:56:36] and you can auth/downtime/etc... [12:56:39] still not being there [12:56:56] moritzm should be able to confirm as he has done it for Janis [12:58:34] yeah, Janis doesn't have a private Icinga contact, we sorted that out when he was onboarded [13:01:51] volans: i kid :) thanks for the info! [13:03:12] :) [13:04:16] PROBLEM - SSH on db1096.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:05:06] (03PS3) 10Kormat: mariadb: Decommission dbproxy1011 [puppet] - 10https://gerrit.wikimedia.org/r/589185 (https://phabricator.wikimedia.org/T249590) (owner: 10Marostegui) [13:08:48] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Sam Walton - https://phabricator.wikimedia.org/T250189 (10CDanis) Sam, Which username did you use? Can you confirm that logging into Wikitech itself works? Thanks! [13:10:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [13:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [13:11:23] kormat: go for the merge and deploy please \o/ [13:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:10] (03CR) 10Kormat: [C: 03+2] mariadb: Decommission dbproxy1011 [puppet] - 10https://gerrit.wikimedia.org/r/589185 (https://phabricator.wikimedia.org/T249590) (owner: 10Marostegui) [13:12:42] (03PS2) 10Jbond: ferm_status: add diff function [puppet] - 10https://gerrit.wikimedia.org/r/591319 [13:12:50] (03CR) 10Jbond: "updated thanks" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/591319 (owner: 10Jbond) [13:18:33] (03PS1) 10Paladox: Grant push access to ldap/ops [software/cas-overlay-template] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/591329 [13:19:05] (03CR) 10Jbond: [V: 03+2 C: 03+2] Grant push access to ldap/ops [software/cas-overlay-template] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/591329 (owner: 10Paladox) [13:19:39] (03CR) 10Kormat: [C: 03+2] wmnet: Remove production dns entry for dbproxy1011 [dns] - 10https://gerrit.wikimedia.org/r/589186 (https://phabricator.wikimedia.org/T249590) (owner: 10Marostegui) [13:28:18] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: decommission dbproxy1011.eqiad.wmnet - https://phabricator.wikimedia.org/T249590 (10Kormat) a:05Marostegui→03Jclark-ctr [13:29:13] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: decommission dbproxy1011.eqiad.wmnet - https://phabricator.wikimedia.org/T249590 (10Kormat) This is ready for the DC-Ops team. [13:29:33] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Sam Walton - https://phabricator.wikimedia.org/T250189 (10Samwalton9) I log into Wikitech using `Samwalton9`, which works as expected. I've tried both Samwalton9 and samwalton on Superset. [13:30:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission dbproxy1011.eqiad.wmnet - https://phabricator.wikimedia.org/T249590 (10Marostegui) [13:31:59] (03CR) 10Hashar: "If one writes "docker-pkg build" surely they intend to build aren't they?" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/589710 (owner: 10Hashar) [13:33:54] (03PS1) 10Jbond: idp: alow specifying specific branches for the idp profile [puppet] - 10https://gerrit.wikimedia.org/r/591330 (https://phabricator.wikimedia.org/T233930) [13:33:56] (03PS1) 10Jbond: idp_test: update idp_test to use staging branch [puppet] - 10https://gerrit.wikimedia.org/r/591331 (https://phabricator.wikimedia.org/T233930) [13:34:07] (03CR) 10Jbond: [C: 03+2] ferm_status: add diff function [puppet] - 10https://gerrit.wikimedia.org/r/591319 (owner: 10Jbond) [13:35:39] (03PS1) 10Vgutierrez: Release 8.0.7-rc1+really8.0.7final [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/591333 [13:35:51] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.7-rc1+really8.0.7final [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/591333 (owner: 10Vgutierrez) [13:36:50] 10Operations, 10Analytics, 10Traffic: Create replacement for Varnishkafka - https://phabricator.wikimedia.org/T237993 (10elukey) Me and Ema had several chats on IRC, reporting in here the summary: - the HTTP status `000` seems to be used for clients that have some trouble doing a HTTP request to ats-tls, wi... [13:37:28] 10Operations, 10Analytics: kafka-jumbo1003 /srv disk space usage over 90% - https://phabricator.wikimedia.org/T250347 (10elukey) 05Open→03Resolved [13:39:42] (03Abandoned) 10Vgutierrez: Release 8.0.7-rc1+really8.0.7final [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/591333 (owner: 10Vgutierrez) [13:43:02] !log upload trafficserver 8.0.7-1wm1 to apt.wm.o (buster) [13:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:07] 10Operations: git-pbuilder incorrectly copies DIST=stretch package files into results/buster-amd64 on deneb.codfw.wmnet - https://phabricator.wikimedia.org/T250803 (10Ottomata) [13:54:26] (03PS1) 10Reedy: Update path to CirrusSearch maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/591335 [13:58:44] (03PS2) 10Gehel: mjolnir: Ensure python3.7 is available before initializing repo [puppet] - 10https://gerrit.wikimedia.org/r/578628 (https://phabricator.wikimedia.org/T247362) (owner: 10EBernhardson) [14:04:35] (03CR) 10Gehel: [C: 03+2] mjolnir: Ensure python3.7 is available before initializing repo [puppet] - 10https://gerrit.wikimedia.org/r/578628 (https://phabricator.wikimedia.org/T247362) (owner: 10EBernhardson) [14:05:02] RECOVERY - SSH on db1096.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:06:26] (03PS2) 10Reedy: Update path to CirrusSearch maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/591335 (https://phabricator.wikimedia.org/T250806) [14:07:29] (03CR) 10jerkins-bot: [V: 04-1] Update path to CirrusSearch maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/591335 (https://phabricator.wikimedia.org/T250806) (owner: 10Reedy) [14:08:39] !log contint1001: rm /var/log/apache2/doc_* # service has been moved to doc1001.eqiad.wmnet [14:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:15] hashar: :) puppet on contint2001 all done [14:12:50] (03CR) 10Hashar: "Just a note, we should probably just phase out "integration.mediawiki.org". We have moved everything under wikimedia.org ages ago :)" [puppet] - 10https://gerrit.wikimedia.org/r/591321 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [14:13:54] mutante: great :] I have added a couple more to have Docker data moved out of the / partition and mute some erroneous icinga alarms [14:14:18] mutante: https://phabricator.wikimedia.org/T224591#6071338 ;D [14:14:31] hashar: aww:) i merged the change and then realized "aww, crap , you did not add mediawiki.org to the cert" :) [14:14:39] fine with me to remove it of course [14:15:44] hashar: ah, i remember those fixes from contint1001, ACK! [14:16:27] (03CR) 10Dzahn: [C: 03+2] contint: ignore more Docker partitions disk checks [puppet] - 10https://gerrit.wikimedia.org/r/591037 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [14:16:31] !log installing OpenSSL updates on caches [14:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:41] hashar: /srv/docker does not exist yet on 2001 [14:17:50] !log rolling upgrade of ats to version 8.0.7-1wm1 [14:17:52] what will create it? [14:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:07] (03PS1) 10Elukey: role::swap: add deprecation notice motd on notebook100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/591336 (https://phabricator.wikimedia.org/T249752) [14:20:19] (03PS2) 10Elukey: role::swap: add deprecation notice motd on notebook100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/591336 (https://phabricator.wikimedia.org/T249752) [14:21:47] hashar: re: integration.mediawiki.org just noticed it is already pointing to dyna.wikimedia.org / apache cluster and rewritten to integration.wikimedia.org so we can delete that right away from the backend i guess [14:22:17] mutante: and we can most probably drop it from dns as well [14:22:45] mutante: the few requests that reaches contint1001 seems to be solely for bots looking for wordpress or some security vulnerability [14:23:06] (03PS1) 10Jbond: debdeploy: update zsh autocompletion [puppet] - 10https://gerrit.wikimedia.org/r/591337 [14:23:30] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/22122/" [puppet] - 10https://gerrit.wikimedia.org/r/591336 (https://phabricator.wikimedia.org/T249752) (owner: 10Elukey) [14:23:35] (03CR) 10Elukey: [C: 03+2] role::swap: add deprecation notice motd on notebook100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/591336 (https://phabricator.wikimedia.org/T249752) (owner: 10Elukey) [14:24:08] (03PS1) 10Dzahn: ci: remove integration.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/591338 [14:25:02] (03CR) 10Dzahn: "when merging this remember to also manually remove it for real.. not absenting it here" [puppet] - 10https://gerrit.wikimedia.org/r/591338 (owner: 10Dzahn) [14:26:15] (03PS1) 10Dzahn: delete integration.mediawiki.org [dns] - 10https://gerrit.wikimedia.org/r/591340 [14:26:21] (03CR) 10Jbond: [C: 03+2] debdeploy: update zsh autocompletion [puppet] - 10https://gerrit.wikimedia.org/r/591337 (owner: 10Jbond) [14:28:05] !log installing OpenSSL security updates [14:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:05] (03PS1) 10Jbond: debdeploy: update zsh autocomplete [puppet] - 10https://gerrit.wikimedia.org/r/591342 [14:31:31] (03CR) 10Hashar: contint: move Docker data out of / on contint2001 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/591038 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [14:31:42] PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:33:22] (03PS1) 10Dzahn: ci: create /srv/docker for docker data files on contint2001 [puppet] - 10https://gerrit.wikimedia.org/r/591344 [14:34:05] (03CR) 10Jbond: [C: 03+2] debdeploy: update zsh autocomplete [puppet] - 10https://gerrit.wikimedia.org/r/591342 (owner: 10Jbond) [14:34:54] 10Operations, 10ops-eqiad: (Need by: TDB) rack/setup/install cloudelastic100[56] - https://phabricator.wikimedia.org/T249062 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson cloudelastic1005. rack A4 U34 WMF5337 switchport 29 cloudelastic1006 rack b4 u23. WMF5338 switchport 41 [14:35:19] 10Operations, 10ops-eqiad: (Need by: TDB) rack/setup/install cloudelastic100[56] - https://phabricator.wikimedia.org/T249062 (10Jclark-ctr) [14:35:58] (03CR) 10Muehlenhoff: [C: 03+1] contint: move Docker data out of / on contint2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/591038 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [14:36:10] moritzm: yeah it is a bit messy :/ [14:36:43] I don't even what defines that we have a /mnt/docker on contint1001 [14:36:55] !log disable puppet fleet wide to restart puppemaster [14:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:20] (03CR) 10Dzahn: contint: move Docker data out of / on contint2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/591038 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [14:37:23] (03CR) 10Dzahn: [C: 03+2] contint: move Docker data out of / on contint2001 [puppet] - 10https://gerrit.wikimedia.org/r/591038 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [14:37:46] !log restarting apache on netbox1001 [14:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:14] mutante: puppetmaster is restarting ;D [14:38:42] should only take a few minutes, will log when its back [14:38:57] eh, as long as it's ok that i merged [14:39:00] stopped now [14:39:05] yes thats fine [14:39:16] good, thx [14:40:48] !log restarting apache on miscweb [14:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:58] PROBLEM - SSH on ganeti1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:45:06] !log puppet enabled again [14:45:10] mutante: ^^ [14:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:59] (03CR) 10Hashar: ci: create /srv/docker for docker data files on contint2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/591344 (owner: 10Dzahn) [14:46:28] (03CR) 10Papaul: [C: 03+2] DNS: Add mgmt and production DNS for cloudcontrol2004-dev [dns] - 10https://gerrit.wikimedia.org/r/591194 (owner: 10Papaul) [14:47:00] mutante: I am running puppet on contint2001 [14:47:46] (03PS3) 10Papaul: DNS: Add mgmt and production DNS for cloudcontrol2004-dev [dns] - 10https://gerrit.wikimedia.org/r/591194 [14:47:47] !log volker-e@deploy1001 Started deploy [design/style-guide@d101234]: Deploy design/style-guide: [14:47:50] (03CR) 10Papaul: [V: 03+2 C: 03+2] DNS: Add mgmt and production DNS for cloudcontrol2004-dev [dns] - 10https://gerrit.wikimedia.org/r/591194 (owner: 10Papaul) [14:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:56] !log volker-e@deploy1001 Finished deploy [design/style-guide@d101234]: Deploy design/style-guide: (duration: 00m 09s) [14:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:16] thanks jbond42, ack hashar [14:48:27] !log restarting docker on contint2001 [14:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:36] hehe [14:48:37] !log restart haproxy on dns-auth [14:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:45] docker created /srv/docker upon restarting [14:49:29] (03CR) 10Hashar: "The Docker daemon does create the directory upon starting, this change is thus not needed ;)" [puppet] - 10https://gerrit.wikimedia.org/r/591344 (owner: 10Dzahn) [14:50:56] (03PS1) 10Muehlenhoff: Update kubernetes-etcd Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/591354 [14:51:14] 10Operations, 10ops-codfw, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need by: TBD) rack/setup/install cloudcontrol2004-dev - https://phabricator.wikimedia.org/T250708 (10Papaul) [14:51:26] !log contint2001: manually dropping /var/lib/docker (we now use /srv/docker ) [14:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:03] (03PS3) 10RLazarus: cergen: Add script for renewing mcrouter certs. [puppet] - 10https://gerrit.wikimedia.org/r/589076 [14:53:19] (03Abandoned) 10Dzahn: ci: create /srv/docker for docker data files on contint2001 [puppet] - 10https://gerrit.wikimedia.org/r/591344 (owner: 10Dzahn) [14:53:42] mutante: all good :] and puppet is still ok [14:53:49] hashar: nice! [14:54:26] (03CR) 10Muehlenhoff: [C: 03+2] Update kubernetes-etcd Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/591354 (owner: 10Muehlenhoff) [14:58:30] PROBLEM - Host db1096.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:59:23] !log volans@cumin1001 START - Cookbook sre.hosts.downtime [14:59:24] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:04] 10Operations, 10ops-codfw, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need by: TBD) rack/setup/install cloudcontrol2004-dev - https://phabricator.wikimedia.org/T250708 (10Papaul) [15:04:32] RECOVERY - Host db1096.mgmt is UP: PING WARNING - Packet loss = 90%, RTA = 0.77 ms [15:07:17] (03PS1) 10Muehlenhoff: More Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/591357 [15:09:59] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install backup1002 + array - https://phabricator.wikimedia.org/T250816 (10RobH) [15:10:30] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install backup1002 + array - https://phabricator.wikimedia.org/T250816 (10RobH) [15:11:18] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install backup1002 + array - https://phabricator.wikimedia.org/T250816 (10RobH) a:03jcrespo We did not have the racking info for this before it arrived, I've made the above task. Can you confirm racking and hostname details and then assign to... [15:13:02] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2002 + array - https://phabricator.wikimedia.org/T250817 (10RobH) [15:13:34] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2002 + array - https://phabricator.wikimedia.org/T250817 (10RobH) a:03jcrespo We did not have the racking info for this before it arrived, I've made the above task. Can you confirm racking and hostname details and then assign to @... [15:13:36] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install backup1002 + array - https://phabricator.wikimedia.org/T250816 (10jcrespo) > Can you confirm racking and hostname details Cannot they be copied from the ones I gave for backup2002? T248934 [15:15:46] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2002 + array - https://phabricator.wikimedia.org/T250817 (10jcrespo) This is a duplicate task, this is already done at T248934 - you have quite a mixup there. [15:16:53] 10Operations, 10ops-codfw: Degraded RAID on restbase2014 - https://phabricator.wikimedia.org/T250050 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` restbase2014.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202004211512_dzahn_66242_... [15:18:49] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2002 + array - https://phabricator.wikimedia.org/T250817 (10RobH) 05Open→03Invalid >>! In T250817#6075383, @jcrespo wrote: > This is a duplicate task, this is already done at T248934 - you have quite a mixup there. [15:19:33] (03PS1) 10Hashar: Revert "Update .ruby-version to what is running in production" [puppet] - 10https://gerrit.wikimedia.org/r/591358 (https://phabricator.wikimedia.org/T250538) [15:19:35] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install backup1002 + array - https://phabricator.wikimedia.org/T250816 (10RobH) >>! In T250816#6075371, @jcrespo wrote: >> Can you confirm racking and hostname details > > Cannot they be copied from the ones I gave for backup2002? T248934 Sure. [15:19:37] (03PS1) 10Hashar: Run rubocop on changes to .ruby-version [puppet] - 10https://gerrit.wikimedia.org/r/591359 (https://phabricator.wikimedia.org/T250538) [15:20:09] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install backup1002 + array - https://phabricator.wikimedia.org/T250816 (10RobH) [15:22:44] (03PS1) 10Hashar: Update .ruby-version to what is running in production [puppet] - 10https://gerrit.wikimedia.org/r/591361 (https://phabricator.wikimedia.org/T250538) [15:22:46] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install backup1002 + array - https://phabricator.wikimedia.org/T250816 (10jcrespo) [15:24:24] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install backup1002 + array - https://phabricator.wikimedia.org/T250816 (10jcrespo) [15:24:28] (03CR) 10Hashar: "This change should be merged right now. The rake test break whenever a .rb file is changed!" [puppet] - 10https://gerrit.wikimedia.org/r/591358 (https://phabricator.wikimedia.org/T250538) (owner: 10Hashar) [15:26:18] (03CR) 10CDanis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/591357 (owner: 10Muehlenhoff) [15:26:32] !log CI / Zuul does not get any events for some reason :/ [15:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:38] grlblbl [15:27:55] filed as https://phabricator.wikimedia.org/T250820 [15:28:24] there is no more zuul running?!! [15:28:34] ah no [15:28:56] that is gerrit [15:29:11] hashar: I don't know if related, but I noticed that on contint2001 there is no zuul command. e.g. for enqueueing pipeline testing [15:29:27] 10Operations, 10LDAP-Access-Requests: Naïké Nembetwa Nzali(@Naike ) is a new Project Manager on CPT, and should be added to the "wmf" ldap group. - https://phabricator.wikimedia.org/T250821 (10AMooney) [15:30:05] 10Operations, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Chtnnh) [15:30:17] I am restarting Gerrit [15:30:55] 10Operations, 10ops-codfw: Degraded RAID on restbase2014 - https://phabricator.wikimedia.org/T250050 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase2014.codfw.wmnet'] ` Of which those **FAILED**: ` ['restbase2014.codfw.wmnet'] ` [15:32:06] !log Restarting Gerrit T250820 T246973 [15:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:13] T250820: CI / Zuul does not process any events! - https://phabricator.wikimedia.org/T250820 [15:32:13] T246973: CI / Zuul not processing changes - https://phabricator.wikimedia.org/T246973 [15:34:17] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/591361 (https://phabricator.wikimedia.org/T250538) (owner: 10Hashar) [15:34:38] (03CR) 10Jbond: [C: 03+1] More Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/591357 (owner: 10Muehlenhoff) [15:34:49] (03PS1) 10Muehlenhoff: Exclude /mnt/hfds on airflow for debdeploy restart checks [puppet] - 10https://gerrit.wikimedia.org/r/591366 [15:34:57] (03CR) 10jerkins-bot: [V: 04-1] Update .ruby-version to what is running in production [puppet] - 10https://gerrit.wikimedia.org/r/591361 (https://phabricator.wikimedia.org/T250538) (owner: 10Hashar) [15:35:26] (03PS2) 10Muehlenhoff: Exclude /mnt/hfds on airflow for debdeploy restart checks [puppet] - 10https://gerrit.wikimedia.org/r/591366 [15:35:30] 10Operations, 10LDAP-Access-Requests: Add Naïké Nembetwa Nzali to "wmf" ldap group - https://phabricator.wikimedia.org/T250821 (10Aklapper) [15:35:33] (03PS1) 10Papaul: Add cloudcontrol2004-dev to DHCP and partman with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/591367 (https://phabricator.wikimedia.org/T250708) [15:35:44] 10Operations, 10LDAP-Access-Requests: Add Naïké Nembetwa Nzali to "wmf" ldap group - https://phabricator.wikimedia.org/T250821 (10Aklapper) 05Open→03Stalled @Amooney: Please see https://phabricator.wikimedia.org/tag/ldap-access-requests/ for required information. After providing the information, please set... [15:37:24] 10Operations, 10ops-codfw, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need by: TBD) rack/setup/install cloudcontrol2004-dev - https://phabricator.wikimedia.org/T250708 (10Papaul) [15:38:11] !log CI is back, patches would need to be rechecked by commenting "recheck" in Gerrit. [15:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:52] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/591358 (https://phabricator.wikimedia.org/T250538) (owner: 10Hashar) [15:38:53] moritzm: ^ [15:40:07] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/591359 (https://phabricator.wikimedia.org/T250538) (owner: 10Hashar) [15:40:19] !log replacing mgmt switch on a6-eqiad T250652 [15:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:26] T250652: msw1-a6-eqiad flopping up and down mgmt connections on A6 - https://phabricator.wikimedia.org/T250652 [15:40:40] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10Cmjohnson) I pulled all power plugs, reseated psu's, DIMM and CPU. The server will not power on, the LEDs are flashing orange and red. [15:42:10] (03CR) 10Jbond: [C: 03+1] Exclude /mnt/hfds on airflow for debdeploy restart checks [puppet] - 10https://gerrit.wikimedia.org/r/591366 (owner: 10Muehlenhoff) [15:42:23] (03PS2) 10Papaul: Add cloudcontrol2004-dev to DHCP with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/591367 (https://phabricator.wikimedia.org/T250708) [15:42:48] PROBLEM - Host mw1308.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:45:03] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/591357 (owner: 10Muehlenhoff) [15:45:42] RECOVERY - SSH on ganeti1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:46:04] (03CR) 10Papaul: [C: 03+2] Add cloudcontrol2004-dev to DHCP with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/591367 (https://phabricator.wikimedia.org/T250708) (owner: 10Papaul) [15:46:14] (03PS3) 10Papaul: Add cloudcontrol2004-dev to DHCP with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/591367 (https://phabricator.wikimedia.org/T250708) [15:46:17] (03CR) 10Papaul: [V: 03+2 C: 03+2] Add cloudcontrol2004-dev to DHCP with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/591367 (https://phabricator.wikimedia.org/T250708) (owner: 10Papaul) [15:47:43] (03CR) 10Andrew Bogott: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/591359 (https://phabricator.wikimedia.org/T250538) (owner: 10Hashar) [15:48:08] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install backup1002 + array - https://phabricator.wikimedia.org/T250816 (10Jclark-ctr) backup1002 rack C7 U8 asset. WMF4805 Port 27 backup1002-array rack C7 U6 asset WMF4806 [15:48:14] (03CR) 10Andrew Bogott: [C: 03+1] Revert "Update .ruby-version to what is running in production" [puppet] - 10https://gerrit.wikimedia.org/r/591358 (https://phabricator.wikimedia.org/T250538) (owner: 10Hashar) [15:48:42] RECOVERY - Host mw1308.mgmt is UP: PING WARNING - Packet loss = 33%, RTA = 0.79 ms [15:49:11] 10Operations, 10LDAP-Access-Requests: Add Naïké Nembetwa Nzali to "wmf" ldap group - https://phabricator.wikimedia.org/T250821 (10AMooney) Hi @Aklapper, She already completed like https://wikitech.wikimedia.org/wiki/Special:CreateAccount, Does she just need to be added to the "wmf" ldap group? [15:49:33] hashar: CI worked fine now [15:50:01] (03CR) 10Muehlenhoff: [C: 03+2] More Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/591357 (owner: 10Muehlenhoff) [15:53:08] RECOVERY - Check systemd state on restbase2014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:27] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install backup1002 + array - https://phabricator.wikimedia.org/T250816 (10Jclark-ctr) a:05jcrespo→03Cmjohnson [15:55:32] PROBLEM - Host mw1312.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:58:10] RECOVERY - MD RAID on restbase2014 is OK: OK: Active: 9, Working: 9, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:59:28] PROBLEM - Host ps1-a6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [15:59:40] RECOVERY - Host ps1-a6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.36 ms [16:00:00] 10Puppet, 10Horizon: Allow providing a commit message for hieradata changes - https://phabricator.wikimedia.org/T250623 (10aborrero) Bonus point: allow managing hieradata using git. This has the challenge of auth, proper multitenancy, etc, but worth noting anyway. [16:00:16] PROBLEM - Host krb1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:00:42] PROBLEM - Host mc1022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:00:42] PROBLEM - Host mc1020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:00:42] PROBLEM - Host mc1023.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:00:42] PROBLEM - Host mc1021.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:00:42] PROBLEM - Host mc1019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:00:54] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [16:01:10] PROBLEM - Host wtp1026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:01:28] RECOVERY - Host mw1312.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.47 ms [16:01:44] PROBLEM - Host restbase1021.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:01:56] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [16:02:56] PROBLEM - Host an-master1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:03:34] PROBLEM - Host elastic1048.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:03:34] PROBLEM - Host wtp1027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:06:10] RECOVERY - Host krb1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [16:06:18] 10Operations, 10observability, 10cloud-services-team (Kanban): remove cloud "dev" hosts from Icinga? - https://phabricator.wikimedia.org/T250787 (10JHedden) p:05Triage→03Medium [16:06:36] RECOVERY - Host mc1022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [16:06:36] RECOVERY - Host mc1020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [16:06:36] RECOVERY - Host mc1023.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.38 ms [16:06:36] RECOVERY - Host mc1021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [16:06:36] RECOVERY - Host mc1019.mgmt is UP: PING OK - Packet loss = 0%, RTA = 11.97 ms [16:07:04] RECOVERY - Host wtp1026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [16:07:38] RECOVERY - Host restbase1021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [16:08:50] RECOVERY - Host an-master1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [16:09:30] RECOVERY - Host elastic1048.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [16:09:30] RECOVERY - Host wtp1027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [16:09:38] 10Operations, 10observability, 10cloud-services-team (Kanban): remove cloud "dev" hosts from Icinga? - https://phabricator.wikimedia.org/T250787 (10JHedden) The hosts in codfw are used for platform testing and staging. It's useful to have these in Icinga, but we don't need email notifications or on the alert... [16:10:20] (03PS1) 10Papaul: Partman: Add cloudcontrol2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/591370 (https://phabricator.wikimedia.org/T250708) [16:10:44] 10Operations, 10ops-eqiad, 10DBA: msw1-a6-eqiad flopping up and down mgmt connections on A6 - https://phabricator.wikimedia.org/T250652 (10Cmjohnson) Replaced the management switch, updated netbox [16:10:47] 10Operations, 10ops-eqiad, 10DBA: msw1-a6-eqiad flopping up and down mgmt connections on A6 - https://phabricator.wikimedia.org/T250652 (10Cmjohnson) 05Open→03Resolved [16:11:28] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10Cmjohnson) @Marostegui this could be down for a bit, between HPE troubleshooting and getting a tech on-site. [16:13:16] (03PS2) 10Cmjohnson: Adding mgmt dns for restbase1028-1030 [dns] - 10https://gerrit.wikimedia.org/r/589667 (https://phabricator.wikimedia.org/T241784) [16:14:28] (03PS2) 10Papaul: Partman: Add cloudcontrol2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/591370 (https://phabricator.wikimedia.org/T250708) [16:14:51] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns for restbase1028-1030 [dns] - 10https://gerrit.wikimedia.org/r/589667 (https://phabricator.wikimedia.org/T241784) (owner: 10Cmjohnson) [16:15:03] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install backup1002 + array - https://phabricator.wikimedia.org/T250816 (10jcrespo) Small correction: backup1002-array1 Please note the 1 at the end, while it is unlikely that we will add a second one, it is not completely impossible 0:-D More th... [16:15:26] (03PS1) 10Bstorm: toolforge-k8s: move the tool role into maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/591371 [16:15:50] (03PS3) 10Papaul: Partman: Add cloudcontrol2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/591370 (https://phabricator.wikimedia.org/T250708) [16:16:00] (03CR) 10jerkins-bot: [V: 04-1] toolforge-k8s: move the tool role into maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/591371 (owner: 10Bstorm) [16:16:53] 10Operations, 10ops-eqiad, 10DBA: msw1-a6-eqiad flopping up and down mgmt connections on A6 - https://phabricator.wikimedia.org/T250652 (10jcrespo) Thanks, Chris, for the prompt response! [16:18:39] (03CR) 10Papaul: [C: 03+2] Partman: Add cloudcontrol2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/591370 (https://phabricator.wikimedia.org/T250708) (owner: 10Papaul) [16:20:45] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10jcrespo) I will be the first contact for this server (although manuel will be around if needed, ofc :-D) @Cmjohnson we are aware- that is why migrated the service away, as it couldn't w... [16:22:16] 10Operations, 10ops-codfw, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need by: TBD) rack/setup/install cloudcontrol2004-dev - https://phabricator.wikimedia.org/T250708 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cloudcontrol2004-de... [16:25:18] RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:40:55] 10Operations, 10Cloud-Services, 10Traffic, 10Wikimedia-Incident, 10cloud-services-team (Kanban): Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10JHedden) p:05Triage→03Medium [16:43:30] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/591359 (https://phabricator.wikimedia.org/T250538) (owner: 10Hashar) [16:46:40] (03CR) 10JMeybohm: [C: 03+1] "Sorry for the noise :/" [puppet] - 10https://gerrit.wikimedia.org/r/591358 (https://phabricator.wikimedia.org/T250538) (owner: 10Hashar) [16:47:19] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Sam Walton - https://phabricator.wikimedia.org/T250189 (10CDanis) Ah, sorry, I hadn't realized you had multiple wikitech accounts. There's also a `Samwalton` (not `samwalton`) on wikitech. The uppercased one is the one I added to the `wmf... [16:47:36] (03CR) 10Hashar: "No worries, it was just a glitch in the test runner. It did not ran rubocop for your change and thus let it pass through." [puppet] - 10https://gerrit.wikimedia.org/r/591358 (https://phabricator.wikimedia.org/T250538) (owner: 10Hashar) [16:48:28] RECOVERY - cassandra-a service on restbase2014 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:49:18] !log bootstrapping restbase2014-a — T250050 [16:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:25] T250050: Degraded RAID on restbase2014 - https://phabricator.wikimedia.org/T250050 [16:49:40] RECOVERY - cassandra-a SSL 10.192.16.85:7001 on restbase2014 is OK: SSL OK - Certificate restbase2014-a valid until 2020-11-29 09:26:08 +0000 (expires in 221 days) https://phabricator.wikimedia.org/T120662 [16:50:30] (03PS1) 10Cmjohnson: Adding mgmt dns for cloudelastic100[56] [dns] - 10https://gerrit.wikimedia.org/r/591377 (https://phabricator.wikimedia.org/T249062) [16:51:58] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [16:52:29] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns for cloudelastic100[56] [dns] - 10https://gerrit.wikimedia.org/r/591377 (https://phabricator.wikimedia.org/T249062) (owner: 10Cmjohnson) [16:53:19] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10Cmjohnson) [16:53:36] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10Cmjohnson) updated bios/mgmt/idrac [16:53:48] 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need by: TDB) rack/setup/install cloudelastic100[56] - https://phabricator.wikimedia.org/T249062 (10Cmjohnson) [16:53:54] 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need by: TDB) rack/setup/install cloudelastic100[56] - https://phabricator.wikimedia.org/T249062 (10Cmjohnson) updated bios/mgmt/idrac [16:58:27] 10Operations, 10ops-codfw, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need by: TBD) rack/setup/install cloudcontrol2004-dev - https://phabricator.wikimedia.org/T250708 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcontrol2004-dev.wikimedia.org'] ` Of which those **FAILED**:... [16:59:09] 10Operations, 10ops-codfw, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need by: TBD) rack/setup/install cloudcontrol2004-dev - https://phabricator.wikimedia.org/T250708 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cloudcontrol2004-de... [17:08:25] (03PS2) 10Elukey: role::statistics::explorer: add profiles to match role::statistics::private [puppet] - 10https://gerrit.wikimedia.org/r/589553 (https://phabricator.wikimedia.org/T249754) [17:12:41] (03PS1) 10Aaron Schulz: Enable $wgResourceLoaderUseObjectCacheForDeps for testwiki/test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591388 [17:13:09] (03PS2) 10Aaron Schulz: Enable $wgResourceLoaderUseObjectCacheForDeps for testwiki/test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591388 (https://phabricator.wikimedia.org/T113916) [17:14:13] (03PS2) 10Bstorm: toolforge-k8s: move the tool role into maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/591371 [17:16:40] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [17:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:58] (03PS1) 10Cmjohnson: Adding mgmt dns for backup1002 [dns] - 10https://gerrit.wikimedia.org/r/591389 (https://phabricator.wikimedia.org/T250816) [17:19:42] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:56] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns for backup1002 [dns] - 10https://gerrit.wikimedia.org/r/591389 (https://phabricator.wikimedia.org/T250816) (owner: 10Cmjohnson) [17:24:31] 10Operations, 10ops-codfw, 10cloud-services-team (Hardware): (Need by: TBD) rack/setup/install cloudcontrol2004-dev - https://phabricator.wikimedia.org/T250708 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcontrol2004-dev.wikimedia.org'] ` and were **ALL** successful. [17:26:24] (03PS1) 10CDanis: admin: add naike2020 to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/591391 (https://phabricator.wikimedia.org/T250821) [17:32:20] 10Operations, 10ops-codfw, 10cloud-services-team (Hardware): (Need by: TBD) rack/setup/install cloudcontrol2004-dev - https://phabricator.wikimedia.org/T250708 (10Papaul) [17:35:04] 10Operations, 10ops-codfw, 10cloud-services-team (Hardware): (Need by: TBD) rack/setup/install cloudcontrol2004-dev - https://phabricator.wikimedia.org/T250708 (10Papaul) 05Open→03Resolved @Bstorm @Andrew server is ready for service. [17:35:53] (03CR) 10CDanis: [C: 03+2] admin: add naike2020 to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/591391 (https://phabricator.wikimedia.org/T250821) (owner: 10CDanis) [17:37:53] (03CR) 10Elukey: [C: 03+2] role::statistics::explorer: add profiles to match role::statistics::private [puppet] - 10https://gerrit.wikimedia.org/r/589553 (https://phabricator.wikimedia.org/T249754) (owner: 10Elukey) [17:38:05] (03CR) 10Andrew Bogott: [C: 03+2] wmcs openstack haproxy: use openstack_controllers instead of nova_controller [puppet] - 10https://gerrit.wikimedia.org/r/591133 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [17:38:37] andrewbogott: ok to merge? [17:38:48] elukey: yep, thanks [17:38:54] * andrewbogott was still typing passphrases [17:40:47] elukey: while you're here… have any thoughts about my mcrouter/ordered_json patches? (They're all blocked by CI/ruby things so can't be merged immediately in any case) [17:41:48] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add Naïké Nembetwa Nzali to "wmf" ldap group - https://phabricator.wikimedia.org/T250821 (10Aklapper) >>! In T250821#6075586, @AMooney wrote: > Hi @Aklapper, She already completed like https://wikitech.wikimedia.org/wiki/Special:CreateAccount, Does sh... [17:41:58] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add Naïké Nembetwa Nzali to "wmf" ldap group - https://phabricator.wikimedia.org/T250821 (10Aklapper) 05Stalled→03Open [17:42:38] andrewbogott: ah sorry didn't check them, are you blocked on those? [17:43:13] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add Naïké Nembetwa Nzali to "wmf" ldap group - https://phabricator.wikimedia.org/T250821 (10CDanis) 05Open→03Resolved a:03CDanis Hi, @AMooney and @Naike! I'm happy to help, but wanted to clarify a few things first. Please re-open the ticket an... [17:43:25] elukey: Not especially blocked; I want to re-use that mcrouter config on another cluster and want to get things cleaned up first. [17:43:46] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps puppetmasters: use openstack_controllers instead of nova_controller [puppet] - 10https://gerrit.wikimedia.org/r/591086 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [17:44:12] (03CR) 10Elukey: [C: 03+2] Exclude /mnt/hfds on airflow for debdeploy restart checks [puppet] - 10https://gerrit.wikimedia.org/r/591366 (owner: 10Muehlenhoff) [17:46:41] andrewbogott: I am not sure if I agree with those patches, using jq seems really easy to check the config.. I wouldn't mess with mcrouter configs that may affect mediawiki [17:46:57] huh, ok [17:47:15] having an unreadable config file seems dumb [17:47:25] but it's true that it's easy enough to add the whitespace after the fact [17:48:25] (03CR) 10Andrew Bogott: [C: 03+2] mariadb wmcs ferm: use openstack_controllers instead of nova_controller etc. [puppet] - 10https://gerrit.wikimedia.org/r/591062 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [17:48:46] 10Operations, 10MediaWiki-General, 10serviceops-radar, 10Availability (MediaWiki-MultiDC), and 3 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Krinkle) [17:49:04] andrewbogott: I agree that a pretty version of it would be nice when checking configs manually, but ordered_json seems used a lot in puppet. I wouldn't be super comfortable in changing it [17:49:16] especially since using jq is super quick and easy [17:49:36] but these are my 2c, the SRE team might be willing to do it [17:49:41] it seems scary from my point of view :) [17:49:44] (for little gain) [17:49:49] yeah, I can see that [17:50:04] in theory it doesn't change the default behavior but there are probably more explicit ways to do that [17:50:06] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Sam Walton - https://phabricator.wikimedia.org/T250189 (10Samwalton9) Oh, yep, the fun that comes with the overlaps of your volunteer and staff accounts :) `samwalton9` is the ideal account to have set here, since that's the login I use fo... [17:50:20] (e.g. adding an entirely new function instead of adding an option to the existing one) [17:51:57] andrewbogott: I am wondering if a new function that simply pretty prints json would be a compromise [17:52:15] yep, that would make for a safer refactor [17:52:31] I'm sure that Ruby will do that automatically, it just involves importing some things. [17:52:35] 10Operations, 10MediaWiki-General, 10serviceops-radar, 10Availability (MediaWiki-MultiDC), and 3 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Krinkle) p:05Triage→03High [17:52:38] yep yep [17:52:52] I'll see if I can make that work [17:53:29] super [17:53:59] (03CR) 10Andrew Bogott: [C: 03+2] mariadb wmcs ferm: update ::wmcs profiles to use types and lookup() [puppet] - 10https://gerrit.wikimedia.org/r/591063 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [17:54:07] (03PS2) 10Andrew Bogott: mariadb wmcs ferm: update ::wmcs profiles to use types and lookup() [puppet] - 10https://gerrit.wikimedia.org/r/591063 (https://phabricator.wikimedia.org/T249941) [17:54:17] 10Operations, 10Performance-Team: MW Memcached get hit ratio trend over the past months - https://phabricator.wikimedia.org/T248890 (10Krinkle) p:05Triage→03Medium [17:54:32] 10Operations, 10Performance-Team: MW Memcached get hit ratio trend over the past months - https://phabricator.wikimedia.org/T248890 (10Krinkle) Does this warrant any further investigation or we can we consider this closed? [17:55:43] 10Operations, 10Performance-Team: MW Memcached get hit ratio trend over the past months - https://phabricator.wikimedia.org/T248890 (10elukey) 05Open→03Resolved I think that we can close, it was just to double check that everything was ok :) [17:55:52] btw elukey https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/589743/ is unrelated to the other patches, it's just a comment change summarizing what we learned when troubleshooting last week [17:56:15] (03PS2) 10Andrew Bogott: mariadb wmcs ferm: remove references to osm_host and cloudweb_dev_hosts [puppet] - 10https://gerrit.wikimedia.org/r/591064 [17:58:46] (03CR) 10Elukey: mcrouter: update example code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/589743 (owner: 10Andrew Bogott) [17:59:03] andrewbogott: I have one doubt about "clustername" --^ [17:59:11] ok [18:02:02] yep, you're right [18:02:21] (03PS2) 10Andrew Bogott: mcrouter: update example code [puppet] - 10https://gerrit.wikimedia.org/r/589743 [18:03:02] (03CR) 10Andrew Bogott: [C: 03+2] mariadb wmcs ferm: remove references to osm_host and cloudweb_dev_hosts [puppet] - 10https://gerrit.wikimedia.org/r/591064 (owner: 10Andrew Bogott) [18:03:51] (03CR) 10Andrew Bogott: "good to know! There's no rush for this so let's set it aside until Jessie is finished off." [puppet] - 10https://gerrit.wikimedia.org/r/591065 (owner: 10Andrew Bogott) [18:05:27] elukey, andrewbogott: I'm about to renew mcrouter certs, will that interfere with anything you're in the middle of? [18:05:29] (03PS1) 10Mholloway: MachineVision: Update image withholding term list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591406 [18:05:37] rzl: nope! [18:05:49] thanks :) [18:07:29] (03PS1) 10CDanis: admin: add mshaver to ldap_only_users; mnoor to-be-replaced [puppet] - 10https://gerrit.wikimedia.org/r/591407 (https://phabricator.wikimedia.org/T250430) [18:08:58] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP/NDA Access Request for mshaver - https://phabricator.wikimedia.org/T250430 (10CDanis) a:03MNoorWMF Got it, makes sense. I've granted `wmf` access to your `mshaver` account. Please try it and confirm it works, then I'll remove `wmf` access fro... [18:09:44] (03CR) 10CDanis: [C: 03+2] admin: add mshaver to ldap_only_users; mnoor to-be-replaced [puppet] - 10https://gerrit.wikimedia.org/r/591407 (https://phabricator.wikimedia.org/T250430) (owner: 10CDanis) [18:12:29] (03CR) 10Andrew Bogott: [C: 04-1] "revisit in a couple of months" [puppet] - 10https://gerrit.wikimedia.org/r/591065 (owner: 10Andrew Bogott) [18:15:50] (03PS1) 10Andrew Bogott: glance_seed.sh.erb: update to use openstack_controllers [puppet] - 10https://gerrit.wikimedia.org/r/591410 (https://phabricator.wikimedia.org/T249941) [18:16:17] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Sam Walton - https://phabricator.wikimedia.org/T250189 (10CDanis) AIUI, we aren't generally in the business of giving `wmf` access for volunteer accounts, since that becomes too hard to track for offboarding procedures. [18:16:55] (03PS1) 10Andrew Bogott: OpenStack: remove nova_controller and nova_controller_standby from hiera [puppet] - 10https://gerrit.wikimedia.org/r/591411 (https://phabricator.wikimedia.org/T249941) [18:18:25] (03CR) 10Andrew Bogott: [C: 03+2] glance_seed.sh.erb: update to use openstack_controllers [puppet] - 10https://gerrit.wikimedia.org/r/591410 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [18:19:34] PROBLEM - Check systemd state on labtestpuppetmaster2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:19:48] !log disabling puppet on all mcrouter hosts for cert renewal T248093 [18:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:55] T248093: Renew certs for mcrouter on all application servers. - https://phabricator.wikimedia.org/T248093 [18:21:34] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack: remove nova_controller and nova_controller_standby from hiera [puppet] - 10https://gerrit.wikimedia.org/r/591411 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [18:22:40] (03PS1) 10Ayounsi: Cleanup unused policy-statements [homer/public] - 10https://gerrit.wikimedia.org/r/591414 [18:24:35] 10Operations, 10ops-eqiad, 10DBA: msw1-a6-eqiad flopping up and down mgmt connections on A6 - https://phabricator.wikimedia.org/T250652 (10ayounsi) 05Resolved→03Open Thanks! Re-opening so we don't forget to update the cable in Netbox as well. [18:28:35] !log Updated the Wikidata property suggester with data from the 2020-04-06 JSON dump and applied the T132839 workarounds [18:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:43] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [18:29:16] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudceph200[123]-dev - https://phabricator.wikimedia.org/T250846 (10RobH) [18:29:26] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudceph200[123]-dev - https://phabricator.wikimedia.org/T250846 (10RobH) [18:32:46] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash, 10decommission, and 2 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10Papaul) [18:33:00] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash, 10decommission, and 2 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10Papaul) 05Open→03Resolved complete [18:33:06] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-herron: Replace and expand Elasticsearch/Kafka storage in eqiad and upgrade the cluster from Debian jessie to stretch - https://phabricator.wikimedia.org/T213898 (10Papaul) [18:34:13] (03PS1) 10Esanders: VisualEditor: Allow external link paste on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591416 [18:35:41] 10Operations, 10ops-eqiad, 10decommission, 10fundraising-tech-ops: decommission frav1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T222109 (10Papaul) [18:37:17] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: decommission dbproxy1007.eqiad.wmnet - https://phabricator.wikimedia.org/T245385 (10Papaul) @Jclark-ctr this is left for you to finish and resolve when done [18:38:38] (03PS1) 10Papaul: DNS: Remove mgmt asset tag for dbproxy100[1-2] [dns] - 10https://gerrit.wikimedia.org/r/591417 [18:40:38] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt asset tag for dbproxy100[1-2] [dns] - 10https://gerrit.wikimedia.org/r/591417 (owner: 10Papaul) [18:41:24] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [18:41:53] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission dbproxy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T244463 (10Papaul) [18:42:24] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [18:43:46] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission dbproxy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T244463 (10Papaul) @Jclark-ctr this is left for you to finish and resolve when done [18:44:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission dbproxy1002.eqiad.wmnet - https://phabricator.wikimedia.org/T245384 (10Papaul) [18:46:12] RECOVERY - cassandra-a CQL 10.192.16.85:9042 on restbase2014 is OK: TCP OK - 0.036 second response time on 10.192.16.85 port 9042 https://phabricator.wikimedia.org/T93886 [18:48:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission dbproxy1002.eqiad.wmnet - https://phabricator.wikimedia.org/T245384 (10Papaul) @Jclark-ctr this is left for you to finish and resolve when done [18:53:08] (03PS1) 10Andrew Bogott: Openstack observerenv profiles: tidy up params [puppet] - 10https://gerrit.wikimedia.org/r/591420 [18:53:10] (03PS1) 10Andrew Bogott: Openstack observerenv profiles: use $keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591421 [18:53:12] (03PS1) 10Andrew Bogott: Openstack metrics profiles: tidy up params [puppet] - 10https://gerrit.wikimedia.org/r/591422 [18:53:14] (03PS1) 10Andrew Bogott: Openstack metrics profiles: remove keystone_host params [puppet] - 10https://gerrit.wikimedia.org/r/591423 [18:56:06] (03CR) 10jerkins-bot: [V: 04-1] Openstack observerenv profiles: tidy up params [puppet] - 10https://gerrit.wikimedia.org/r/591420 (owner: 10Andrew Bogott) [18:56:21] (03CR) 10jerkins-bot: [V: 04-1] Openstack observerenv profiles: use $keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591421 (owner: 10Andrew Bogott) [18:58:51] (03PS2) 10Andrew Bogott: Openstack observerenv profiles: tidy up params [puppet] - 10https://gerrit.wikimedia.org/r/591420 [18:58:53] (03PS2) 10Andrew Bogott: Openstack observerenv profiles: use $keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591421 [18:58:55] (03PS2) 10Andrew Bogott: Openstack metrics profiles: tidy up params [puppet] - 10https://gerrit.wikimedia.org/r/591422 [18:58:57] (03PS2) 10Andrew Bogott: Openstack metrics profiles: remove keystone_host params [puppet] - 10https://gerrit.wikimedia.org/r/591423 [19:02:14] !log bootstrapping restbase2014-b — T250050 [19:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:20] T250050: Degraded RAID on restbase2014 - https://phabricator.wikimedia.org/T250050 [19:02:28] RECOVERY - cassandra-b service on restbase2014 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:03:26] RECOVERY - cassandra-b SSL 10.192.16.86:7001 on restbase2014 is OK: SSL OK - Certificate restbase2014-b valid until 2020-11-29 09:26:08 +0000 (expires in 221 days) https://phabricator.wikimedia.org/T120662 [19:09:04] !log mcrouter certs renewed on puppetmaster1001 (again); puppet re-enabled on mcrouter hosts and will update certs naturally over the next 30m T248093 [19:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:09] T248093: Renew certs for mcrouter on all application servers. - https://phabricator.wikimedia.org/T248093 [19:15:12] rzl: \o/ [19:15:16] \o/ [19:15:30] still keeping an eye on it for a while, but looks good [19:18:15] I happened to be logged in to mwmaint1002 for other reasons, and a manual run there just completed fine [19:18:27] 👍 [19:18:40] yeah I had tried a couple manually before enabling fleetwide [19:20:57] lol, seeing zero failures on https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&from=now-1h&to=now almost makes me skeptical of reporting ;) [19:26:29] mm true, I'll add a module with like 0.5% chance to fail randomly, just for comfort [19:28:00] rzl: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/manifests/failoid.pp [19:28:10] haha yep [19:49:21] 10Operations, 10serviceops: Renew certs for mcrouter on all application servers. - https://phabricator.wikimedia.org/T248093 (10RLazarus) 05Open→03Resolved Certs renewed! I still need to merge the script for next time, and maybe set it up to run periodically unattended, but I'm resolving this. [20:11:22] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [20:12:14] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [20:20:07] 10Operations, 10Core Platform Team, 10Traffic: Move all purge traffic to kafka - https://phabricator.wikimedia.org/T250781 (10Pchelolo) AFAIK cdnPurgeJob is only involved if the delayed purge is required if reboundDelay option is set. For every rebound purge job there's an instant purge multicast. Is the req... [20:26:04] 10Operations, 10Core Platform Team, 10Traffic: Move all purge traffic to kafka - https://phabricator.wikimedia.org/T250781 (10holger.knust) p:05Triage→03Medium [20:32:08] (03CR) 10Bstorm: [C: 03+2] toolforge-k8s: move the tool role into maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/591371 (owner: 10Bstorm) [20:34:35] 10Operations, 10Cassandra, 10Core Platform Team Workboards (Clinic Duty Team): Revisit default settings for c-foreach-restart - https://phabricator.wikimedia.org/T198787 (10holger.knust) a:03hnowlan [20:42:38] (03PS1) 10Urbanecm: noc.wikimedia.org: highlight.php should not append .txt to dblist URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591459 (https://phabricator.wikimedia.org/T250852) [20:43:49] (03CR) 10jerkins-bot: [V: 04-1] noc.wikimedia.org: highlight.php should not append .txt to dblist URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591459 (https://phabricator.wikimedia.org/T250852) (owner: 10Urbanecm) [20:47:24] (03PS2) 10Urbanecm: noc.wikimedia.org: highlight.php should not append .txt to dblist URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591459 (https://phabricator.wikimedia.org/T250852) [20:48:17] (03PS4) 10RLazarus: cergen: Add script for renewing mcrouter certs. [puppet] - 10https://gerrit.wikimedia.org/r/589076 [20:48:52] (03CR) 10jerkins-bot: [V: 04-1] noc.wikimedia.org: highlight.php should not append .txt to dblist URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591459 (https://phabricator.wikimedia.org/T250852) (owner: 10Urbanecm) [20:49:56] (03CR) 10RLazarus: "After testing in prod successfully, this is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/589076 (owner: 10RLazarus) [20:50:12] (03PS3) 10Urbanecm: noc.wikimedia.org: highlight.php should not append .txt to dblist URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591459 (https://phabricator.wikimedia.org/T250852) [20:51:19] (03CR) 10jerkins-bot: [V: 04-1] noc.wikimedia.org: highlight.php should not append .txt to dblist URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591459 (https://phabricator.wikimedia.org/T250852) (owner: 10Urbanecm) [20:54:14] (03PS1) 10Andrew Bogott: designate service profiles: replace keystone_host with keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591465 [20:54:16] (03PS1) 10Andrew Bogott: Horizon: replace keystone_host with keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591466 [20:54:18] (03PS1) 10Andrew Bogott: designate pdns: replace keystone_host with keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591467 [20:54:20] (03PS1) 10Andrew Bogott: cloud-vps cumin: replace keystone_host with keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591468 [20:54:22] (03PS1) 10Andrew Bogott: keystone service: remove keystone_host hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/591469 [20:54:24] (03PS1) 10Andrew Bogott: Neutron common: replace keystone_host with keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591470 [20:54:26] (03PS1) 10Andrew Bogott: shinken: rename keystone_host to keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591471 [20:54:28] (03PS1) 10Andrew Bogott: profile::wmcs::prometheus::metricsinfra: rename keystone_host to keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591472 [20:59:03] (03CR) 10Andrew Bogott: [C: 03+2] Openstack observerenv profiles: tidy up params [puppet] - 10https://gerrit.wikimedia.org/r/591420 (owner: 10Andrew Bogott) [20:59:15] (03CR) 10Andrew Bogott: [C: 03+2] Openstack metrics profiles: tidy up params [puppet] - 10https://gerrit.wikimedia.org/r/591422 (owner: 10Andrew Bogott) [20:59:24] (03CR) 10Andrew Bogott: [C: 03+2] Openstack observerenv profiles: use $keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591421 (owner: 10Andrew Bogott) [21:00:30] (03PS3) 10Andrew Bogott: Openstack observerenv profiles: use $keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591421 [21:00:32] (03PS3) 10Andrew Bogott: Openstack metrics profiles: tidy up params [puppet] - 10https://gerrit.wikimedia.org/r/591422 [21:00:34] (03PS3) 10Andrew Bogott: Openstack metrics profiles: remove keystone_host params [puppet] - 10https://gerrit.wikimedia.org/r/591423 [21:00:36] (03PS2) 10Andrew Bogott: designate service profiles: replace keystone_host with keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591465 [21:00:38] (03PS2) 10Andrew Bogott: Horizon: replace keystone_host with keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591466 [21:00:40] (03PS2) 10Andrew Bogott: designate pdns: replace keystone_host with keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591467 [21:00:42] (03PS2) 10Andrew Bogott: cloud-vps cumin: replace keystone_host with keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591468 [21:00:48] (03PS2) 10Andrew Bogott: keystone service: remove keystone_host hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/591469 [21:00:50] (03PS2) 10Andrew Bogott: Neutron common: replace keystone_host with keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591470 [21:00:52] (03PS2) 10Andrew Bogott: shinken: rename keystone_host to keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591471 [21:00:54] (03PS2) 10Andrew Bogott: profile::wmcs::prometheus::metricsinfra: rename keystone_host to keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591472 [21:01:21] (03PS4) 10Urbanecm: noc.wikimedia.org: highlight.php should not append .txt to dblist URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591459 (https://phabricator.wikimedia.org/T250852) [21:03:12] RECOVERY - cassandra-b CQL 10.192.16.86:9042 on restbase2014 is OK: TCP OK - 0.036 second response time on 10.192.16.86 port 9042 https://phabricator.wikimedia.org/T93886 [21:05:09] (03CR) 10jerkins-bot: [V: 04-1] noc.wikimedia.org: highlight.php should not append .txt to dblist URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591459 (https://phabricator.wikimedia.org/T250852) (owner: 10Urbanecm) [21:05:10] !log milimetric@deploy1001 Started deploy [analytics/refinery@35781db] (thin): Regular Analytics weekly train deploy THIN [analytics/refinery@35781db] [21:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:19] !log milimetric@deploy1001 Finished deploy [analytics/refinery@35781db] (thin): Regular Analytics weekly train deploy THIN [analytics/refinery@35781db] (duration: 00m 08s) [21:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:29] !log milimetric@deploy1001 Started deploy [analytics/refinery@35781db]: Regular Analytics weekly train deploy [analytics/refinery@35781db] [21:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:58] (03CR) 10Andrew Bogott: [C: 03+2] Openstack metrics profiles: remove keystone_host params [puppet] - 10https://gerrit.wikimedia.org/r/591423 (owner: 10Andrew Bogott) [21:08:49] (03CR) 10Andrew Bogott: [C: 03+2] designate pdns: replace keystone_host with keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591467 (owner: 10Andrew Bogott) [21:08:52] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:08:54] (03CR) 10Andrew Bogott: [C: 03+2] designate service profiles: replace keystone_host with keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591465 (owner: 10Andrew Bogott) [21:10:36] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: replace keystone_host with keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591466 (owner: 10Andrew Bogott) [21:12:22] (03PS5) 10Urbanecm: noc.wikimedia.org: highlight.php should not append .txt to dblist URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591459 (https://phabricator.wikimedia.org/T250852) [21:12:39] (03CR) 10jerkins-bot: [V: 04-1] profile::wmcs::prometheus::metricsinfra: rename keystone_host to keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591472 (owner: 10Andrew Bogott) [21:14:58] (03CR) 10jerkins-bot: [V: 04-1] noc.wikimedia.org: highlight.php should not append .txt to dblist URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591459 (https://phabricator.wikimedia.org/T250852) (owner: 10Urbanecm) [21:15:01] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps cumin: replace keystone_host with keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591468 (owner: 10Andrew Bogott) [21:16:38] (03PS6) 10Urbanecm: noc.wikimedia.org: highlight.php should not append .txt to dblist URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591459 (https://phabricator.wikimedia.org/T250852) [21:17:48] (03PS3) 10Andrew Bogott: keystone service: remove keystone_host hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/591469 [21:17:53] (03PS3) 10Andrew Bogott: Neutron common: replace keystone_host with keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591470 [21:17:55] (03PS3) 10Andrew Bogott: shinken: rename keystone_host to keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591471 [21:17:57] (03PS3) 10Andrew Bogott: profile::wmcs::prometheus::metricsinfra: rename keystone_host to keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591472 [21:17:59] (03PS1) 10Andrew Bogott: cloud-vps: define keystone_api_fqdn in hiera for VMs [puppet] - 10https://gerrit.wikimedia.org/r/591480 [21:18:01] (03CR) 10Andrew Bogott: [C: 03+2] keystone service: remove keystone_host hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/591469 (owner: 10Andrew Bogott) [21:21:05] (03CR) 10Andrew Bogott: [C: 03+2] Neutron common: replace keystone_host with keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591470 (owner: 10Andrew Bogott) [21:21:48] !log milimetric@deploy1001 Finished deploy [analytics/refinery@35781db]: Regular Analytics weekly train deploy [analytics/refinery@35781db] (duration: 16m 19s) [21:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:15] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: define keystone_api_fqdn in hiera for VMs [puppet] - 10https://gerrit.wikimedia.org/r/591480 (owner: 10Andrew Bogott) [21:24:08] (03CR) 10jerkins-bot: [V: 04-1] profile::wmcs::prometheus::metricsinfra: rename keystone_host to keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591472 (owner: 10Andrew Bogott) [21:24:14] (03CR) 10Andrew Bogott: [C: 03+2] shinken: rename keystone_host to keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591471 (owner: 10Andrew Bogott) [21:27:15] (03PS4) 10Andrew Bogott: prometheus::metricsinfra: rename keystone_host to keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591472 [21:31:36] (03CR) 10Andrew Bogott: [C: 03+2] prometheus::metricsinfra: rename keystone_host to keystone_api_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/591472 (owner: 10Andrew Bogott) [21:35:41] (03PS1) 10Andrew Bogott: shinken: leave internal var named keystone_host [puppet] - 10https://gerrit.wikimedia.org/r/591488 [21:37:07] (03CR) 10Krinkle: noc.wikimedia.org: highlight.php should not append .txt to dblist URLs (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591459 (https://phabricator.wikimedia.org/T250852) (owner: 10Urbanecm) [21:37:52] !log milimetric@deploy1001 Started deploy [analytics/refinery@35781db]: Regular Analytics weekly train deploy [analytics/refinery@35781db] try 2 (analytics1030 failed with OSError the first time) [21:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:05] !log milimetric@deploy1001 Finished deploy [analytics/refinery@35781db]: Regular Analytics weekly train deploy [analytics/refinery@35781db] try 2 (analytics1030 failed with OSError the first time) (duration: 00m 13s) [21:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:56] (03CR) 10Andrew Bogott: [C: 03+2] shinken: leave internal var named keystone_host [puppet] - 10https://gerrit.wikimedia.org/r/591488 (owner: 10Andrew Bogott) [21:46:58] !log milimetric@deploy1001 Started deploy [analytics/refinery@64c5ec4]: Analytics: tiny follow-up on weekly train [analytics/refinery@64c5ec4] [21:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:26] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:50:46] !log bootstrapping restbase2014-c — T250050 [21:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:53] T250050: Degraded RAID on restbase2014 - https://phabricator.wikimedia.org/T250050 [21:56:44] !log rebooting cloudvirt1004, total raid controller failure [21:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:52] (03CR) 10Volans: "Some possible alternative approaches, questions/curiosities and small comments inline for being an SRE-operated tool." (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/589076 (owner: 10RLazarus) [22:24:03] !log milimetric@deploy1001 Finished deploy [analytics/refinery@64c5ec4]: Analytics: tiny follow-up on weekly train [analytics/refinery@64c5ec4] (duration: 37m 05s) [22:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:30] (03PS1) 10Andrew Bogott: Openstack hiera: remove keystone_host hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/591525 [23:02:32] (03PS1) 10Andrew Bogott: shinken: settle on keystone_api_fqdn after all [puppet] - 10https://gerrit.wikimedia.org/r/591526 [23:05:00] (03CR) 10Andrew Bogott: [C: 03+2] shinken: settle on keystone_api_fqdn after all [puppet] - 10https://gerrit.wikimedia.org/r/591526 (owner: 10Andrew Bogott) [23:05:19] (03PS2) 10Andrew Bogott: shinken: settle on keystone_api_fqdn after all [puppet] - 10https://gerrit.wikimedia.org/r/591526 [23:05:21] (03PS2) 10Andrew Bogott: Openstack hiera: remove keystone_host hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/591525 [23:19:40] !log begin deploy of WDQS v 0.3.23 on deploy1001 [23:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:03] !log mstyles@deploy1001 Started deploy [wdqs/wdqs@4e0d55f]: v0.3.23 [23:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:38] !log mstyles@deploy1001 Finished deploy [wdqs/wdqs@4e0d55f]: v0.3.23 (duration: 11m 35s) [23:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:25] !log deploy complete for wdqs v0.3.23 [23:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log