[00:00:04] twentyafterfour: Dear deployers, time to do the Phabricator update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201008T0000). [00:02:44] I'm running over the backport window but I imagine that won't interfere with the Phab one [00:03:07] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:632795|Enable logging of session cookie changes in group0 (T264793)]] (duration: 00m 58s) [00:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:00] (03CR) 10Gergő Tisza: [C: 03+2] Enable logging of session cookie changes in group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632796 (https://phabricator.wikimedia.org/T264793) (owner: 10Gergő Tisza) [00:05:42] (03Merged) 10jenkins-bot: Enable logging of session cookie changes in group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632796 (https://phabricator.wikimedia.org/T264793) (owner: 10Gergő Tisza) [00:10:36] (03Merged) 10jenkins-bot: Log when SessionManager is emitting cookies [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/632806 (https://phabricator.wikimedia.org/T264793) (owner: 10Gergő Tisza) [00:12:13] (03Merged) 10jenkins-bot: Log when SessionManager is emitting cookies [core] (wmf/1.36.0-wmf.12) - 10https://gerrit.wikimedia.org/r/632807 (https://phabricator.wikimedia.org/T264793) (owner: 10Gergő Tisza) [00:15:08] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:632796|Enable logging of session cookie changes in group1 (T264793)]] (duration: 00m 57s) [00:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:14] T264793: Make sure SessionManager emitting Set-Cookie headers gets logged - https://phabricator.wikimedia.org/T264793 [00:20:51] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:632796|Enable logging of session cookie changes in group1 (T264793)]] (again, forgot to rebase the previous time) (duration: 00m 59s) [00:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:57] T264793: Make sure SessionManager emitting Set-Cookie headers gets logged - https://phabricator.wikimedia.org/T264793 [00:31:48] !log evening deploys done [00:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:07] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 35065024 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:38:49] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 14 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:43:57] feel free to revert the last config patch if it is causing too much log traffic. [01:03:18] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10Ladsgroup) I don't know if this has been considered or not and I admit I don... [01:10:23] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10CDanis) >>! In T264881#6527486, @Ladsgroup wrote: > I don't know if this has... [01:50:11] 10Operations, 10Release-Engineering-Team, 10Wikimedia Design Style Guide: Deployment of latest Design Style Guide Gerrit clone doesn't seem to succeed - https://phabricator.wikimedia.org/T264894 (10Dzahn) I checked the git status on both backends, miscweb1002 and miscweb2002 and they are both at commit e3fda... [01:51:41] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide, 10Wikimedia-GitHub: Deployment of latest Design Style Guide Gerrit clone doesn't seem to succeed - https://phabricator.wikimedia.org/T264894 (10Dzahn) [02:02:01] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide, 10Wikimedia-GitHub: Deployment of latest Design Style Guide Gerrit clone doesn't seem to succeed - https://phabricator.wikimedia.org/T264894 (10Dzahn) The `deploy-style-guide.sh` script git pulls from https://gerrit.wikim... [03:19:23] (03CR) 10Hazard-SJ: [C: 03+1] Require autoconfirmed status to edit Wikidata Properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631809 (https://phabricator.wikimedia.org/T254280) (owner: 10Abián) [03:22:54] (03CR) 10Hazard-SJ: [C: 04-1] "It seems that the desired approach has changed in T258354: instead of creating a new group, the current discussions steers towards removin" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618245 (https://phabricator.wikimedia.org/T258354) (owner: 10Tobias Andersson) [04:35:11] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 59.73 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:36:53] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 72.5 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:26:02] (03PS1) 10Marostegui: mariadb: Remove puppet entries for es2015 [puppet] - 10https://gerrit.wikimedia.org/r/632835 (https://phabricator.wikimedia.org/T264700) [05:26:54] (03PS1) 10Marostegui: dns: Remove es2015 entries [dns] - 10https://gerrit.wikimedia.org/r/632836 (https://phabricator.wikimedia.org/T264700) [05:27:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:31] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove puppet entries for es2015 [puppet] - 10https://gerrit.wikimedia.org/r/632835 (https://phabricator.wikimedia.org/T264700) (owner: 10Marostegui) [05:33:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [05:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:07] (03CR) 10Marostegui: [C: 03+2] dns: Remove es2015 entries [dns] - 10https://gerrit.wikimedia.org/r/632836 (https://phabricator.wikimedia.org/T264700) (owner: 10Marostegui) [05:35:17] (03PS1) 10Ladsgroup: mailman: Set default charset in mailman2 configs [puppet] - 10https://gerrit.wikimedia.org/r/632837 (https://phabricator.wikimedia.org/T261031) [05:35:41] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2015.codfw.wmnet - https://phabricator.wikimedia.org/T264700 (10Marostegui) [05:37:37] (03CR) 10Ladsgroup: "I have some confidence that this one would work, if it's working, we can remove the apache hack." [puppet] - 10https://gerrit.wikimedia.org/r/632837 (https://phabricator.wikimedia.org/T261031) (owner: 10Ladsgroup) [05:48:46] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [06:09:51] (03PS2) 10Giuseppe Lavagetto: restbase: remove monitoring calls to the http endpoint [puppet] - 10https://gerrit.wikimedia.org/r/632708 [06:15:35] (03PS1) 10KartikMistry: Update cxserver to 2020-10-08-053343-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/632838 (https://phabricator.wikimedia.org/T264407) [06:20:01] (03CR) 10Giuseppe Lavagetto: [C: 03+2] restbase: remove monitoring calls to the http endpoint [puppet] - 10https://gerrit.wikimedia.org/r/632708 (owner: 10Giuseppe Lavagetto) [06:40:29] (03PS1) 10Elukey: Move the HDFS balancer to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/632877 [06:45:37] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/25773/" [puppet] - 10https://gerrit.wikimedia.org/r/632877 (owner: 10Elukey) [06:45:41] (03CR) 10Elukey: [C: 03+2] Move the HDFS balancer to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/632877 (owner: 10Elukey) [06:46:19] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [06:46:37] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) I've added a dropdown to pick the percentile on https://grafana.wikimedia.org/d/M7xQ_BeWk/response-time-by-host Here's what it looks... [06:47:33] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/P [06:48:13] looking ^ [06:49:24] dcausse: lemme know if you need help [06:49:35] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:49:59] <_joe_> I was about to ask :) [06:50:32] <_joe_> why wdqs-ssl-codfw has notifications disabled? [06:50:35] <_joe_> ffs. [06:51:00] hmm they recovered themselves... graph shows a spike in load [06:51:19] <_joe_> !log enable notifications for wdqs-ssl-codfw [06:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:36] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [06:53:01] it's wdqs2002 that disappeared since yester 19h and we did not notice :/ [06:53:49] <_joe_> dcausse: disappeared? [06:54:01] <_joe_> ok who stole the server? [06:54:05] <_joe_> :P [06:54:29] :) [06:55:20] blazegraph deadlock I suppose, looking, it's no longer reporting any metrics [06:57:08] PROBLEM - Query Service HTTP Port on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [06:57:20] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [06:57:45] !log restart blazegraph on wdqs2002 (stuck) T242453 [06:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:51] T242453: Deadlock in blazegraph blocking all queries and updates - https://phabricator.wikimedia.org/T242453 [06:58:16] RECOVERY - Query Service HTTP Port on wdqs2002 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [06:58:34] <_joe_> thanks dcausse [07:00:22] !log depooling wdqs2002 (catching-up lag) [07:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:11] (03PS1) 10Elukey: Remove an-worker1043 from the Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/632878 (https://phabricator.wikimedia.org/T260411) [07:06:49] (03PS2) 10Elukey: Remove analytics1043 from the Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/632878 (https://phabricator.wikimedia.org/T260411) [07:12:07] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/25775/" [puppet] - 10https://gerrit.wikimedia.org/r/632878 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [07:12:10] (03CR) 10Elukey: [C: 03+2] Remove analytics1043 from the Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/632878 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [07:23:01] !log installing pyzmq updates from Buster point release [07:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:17] !log updated envoyproxy to 1.15.1-2 on all codfw hosts [07:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:09] !log depooled wdqs2002 to catch up on lag [07:40:12] ryankemper: ^ [07:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:18] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Reshard commonswiki_file elasticsearch index - https://phabricator.wikimedia.org/T260083 (10Gehel) Increasing the number of shards for commons wiki is starting to be an issue. We need a better strategy. [07:44:57] (03PS7) 10Elukey: Move oozie server to an-scheduler1001 [puppet] - 10https://gerrit.wikimedia.org/r/618339 (https://phabricator.wikimedia.org/T257412) [07:45:26] !log Stop MySQL on db1077 to build it from s1 snapshot [07:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:50] 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10akosiaris) Overall, I am willing to test this out, couples of points though: * Since it's recommended by various standards to do the default DROP thing, w... [07:47:37] (03PS8) 10Elukey: Move oozie server to an-scheduler1001 [puppet] - 10https://gerrit.wikimedia.org/r/618339 (https://phabricator.wikimedia.org/T257412) [07:50:53] 10Operations, 10serviceops: Ugrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10Joe) [07:51:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikidata, and 2 others: Check for errors on wdqs1009 disks - https://phabricator.wikimedia.org/T263125 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts: ` wdqs1009.eqiad.wmnet ` The log can be found in `/var/log/w... [07:53:40] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:54:23] these servers are overloaded without wdqs2002 I think ^ [07:55:19] !log Rebuild db2125 from snapshots - T260670 [07:55:20] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:25] T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 [08:02:06] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2001.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2001.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:02:57] !log repooling wdqs2002 [08:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:16] ^better to have slightly stale data than crashing all the servers [08:03:16] (03PS9) 10Elukey: Move oozie server to an-scheduler1001 [puppet] - 10https://gerrit.wikimedia.org/r/618339 (https://phabricator.wikimedia.org/T257412) [08:03:46] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:04:56] !log gehel@cumin1001 START - Cookbook sre.hosts.downtime [08:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:51] !log gehel@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:03] (03PS10) 10Elukey: Move oozie server to an-scheduler1001 [puppet] - 10https://gerrit.wikimedia.org/r/618339 (https://phabricator.wikimedia.org/T257412) [08:10:22] RECOVERY - MD RAID on wdqs1009 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:12:32] (03PS11) 10Elukey: Move oozie server to an-scheduler1001 [puppet] - 10https://gerrit.wikimedia.org/r/618339 (https://phabricator.wikimedia.org/T257412) [08:14:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikidata, and 2 others: Check for errors on wdqs1009 disks - https://phabricator.wikimedia.org/T263125 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wdqs1009.eqiad.wmnet'] ` and were **ALL** successful. [08:15:00] RECOVERY - Check systemd state on mwdebug1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:32] 10Operations, 10serviceops: Ugrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10MoritzMuehlenhoff) I've created a standalone backport of icu63 in the component/icu63. Rebuilding PHP 7.2 with it is a little tricky, since PHP build-depends on libxml2 (for php7.2... [08:19:04] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [08:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:11] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:47] !log running schema change against s8 in eqiad T259831 [08:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:53] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [08:23:13] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#6524473, @Gilles wrote: > I really don't understand what I did wrong here It was never my intention to offend you, and g... [08:28:50] (03PS3) 10Urbanecm: [labs] Remove wmgMonologChannels override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631224 [08:33:04] 10Operations, 10observability, 10User-fgiunchedi: rsyslog occasional segfault on centrallog hosts - https://phabricator.wikimedia.org/T259780 (10fgiunchedi) [08:36:43] (03CR) 10Urbanecm: [C: 04-2] "Per decision made at T258354#6509213" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618245 (https://phabricator.wikimedia.org/T258354) (owner: 10Tobias Andersson) [08:38:33] !log roll-restart swift-object-replicator on ms-be2* - T261633 [08:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:39] T261633: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 [08:40:31] (03PS1) 10Filippo Giunchedi: swift: bump rsync_timeout [puppet] - 10https://gerrit.wikimedia.org/r/632891 (https://phabricator.wikimedia.org/T261633) [08:40:44] (03CR) 10jerkins-bot: [V: 04-1] swift: bump rsync_timeout [puppet] - 10https://gerrit.wikimedia.org/r/632891 (https://phabricator.wikimedia.org/T261633) (owner: 10Filippo Giunchedi) [08:41:49] (03PS2) 10Filippo Giunchedi: swift: bump rsync_timeout [puppet] - 10https://gerrit.wikimedia.org/r/632891 (https://phabricator.wikimedia.org/T261633) [08:53:14] (03PS2) 10Kormat: admin: Replace leila with leizi [puppet] - 10https://gerrit.wikimedia.org/r/632726 (https://phabricator.wikimedia.org/T264472) [08:55:04] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10Kormat) [09:02:37] (03CR) 10Kormat: [C: 03+2] admin: Replace leila with leizi [puppet] - 10https://gerrit.wikimedia.org/r/632726 (https://phabricator.wikimedia.org/T264472) (owner: 10Kormat) [09:03:49] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) I don't think that this discussion is appropriate in a public forum. An email thread seems like an ok starting point, and/or a meetin... [09:08:38] (03Abandoned) 10Elukey: Move oozie server to an-scheduler1001 [puppet] - 10https://gerrit.wikimedia.org/r/618339 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [09:09:00] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime [09:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:58] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:24] (03PS2) 10Elukey: Add analytics data purge for webrequest sequence stats [puppet] - 10https://gerrit.wikimedia.org/r/632773 (https://phabricator.wikimedia.org/T262826) (owner: 10Joal) [09:16:30] (03CR) 10Elukey: [C: 03+2] Add analytics data purge for webrequest sequence stats [puppet] - 10https://gerrit.wikimedia.org/r/632773 (https://phabricator.wikimedia.org/T262826) (owner: 10Joal) [09:27:32] (03PS1) 10Elukey: role::druid::analytics::worker: enable TLS for conns to mysql [puppet] - 10https://gerrit.wikimedia.org/r/632896 (https://phabricator.wikimedia.org/T257412) [09:27:41] (03PS4) 10Jbond: firewall: change to default reject instead of drop [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) [09:27:43] (03PS1) 10Jbond: sretest1001: Enable default reject rule for sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/632897 (https://phabricator.wikimedia.org/T264888) [09:29:06] (03CR) 10Elukey: [C: 03+2] role::druid::analytics::worker: enable TLS for conns to mysql [puppet] - 10https://gerrit.wikimedia.org/r/632896 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [09:29:28] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond) [09:34:41] 10Operations, 10DBA, 10Data-Persistence, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10LSobanski) [09:37:26] (03PS1) 10Elukey: Revert "role::druid::analytics::worker: enable TLS for conns to mysql" [puppet] - 10https://gerrit.wikimedia.org/r/632818 [09:40:27] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-reload [09:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:38] (03CR) 10Elukey: [C: 03+2] Revert "role::druid::analytics::worker: enable TLS for conns to mysql" [puppet] - 10https://gerrit.wikimedia.org/r/632818 (owner: 10Elukey) [09:45:52] 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10jbond) >>! In T264888#6528076, @akosiaris wrote: > Overall, I am willing to test this out, couples of points though: > > * Since it's recommended by vario... [09:45:56] (03CR) 10Hashar: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/632687 (owner: 10Jbond) [09:48:17] 10Operations, 10SRE-swift-storage, 10User-fgiunchedi: Some object-replicator log lines not making it to centrallog - https://phabricator.wikimedia.org/T264998 (10fgiunchedi) [10:00:04] mvolz: (Dis)respected human, time to deploy Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201008T1000). Please do the needful. [10:00:35] (03PS3) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/631783 (owner: 10PipelineBot) [10:02:32] 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10jbond) > This would also mean that a malicious actor could use us to reflect RST packets however the 40b rst packet comes at a cost of a 60b syn This is n... [10:03:33] (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/631783 (owner: 10PipelineBot) [10:05:42] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/631783 (owner: 10PipelineBot) [10:09:23] 10Operations, 10SRE-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10Kormat) @leila: Your access should now be active. Please let me know if you run into any issues. I've opened a couple of subtasks to cover cleanup... [10:14:52] !log mvolz@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [10:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:08] (03CR) 10Ayounsi: [C: 03+2] Pmacct add standard BGP community to flows [puppet] - 10https://gerrit.wikimedia.org/r/632603 (https://phabricator.wikimedia.org/T254332) (owner: 10Ayounsi) [10:22:25] !log mvolz@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:00] (03CR) 10Hnowlan: [C: 03+2] conftool-data: add new restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/632497 (https://phabricator.wikimedia.org/T261512) (owner: 10Hnowlan) [10:26:13] !log pooling restbase1028,restbase1029,restbase1030 [10:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:21] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=restbase,service=restbase,name=restbase1028.eqiad.wmnet [10:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:30] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=restbase,service=restbase-ssl,name=restbase1028.eqiad.wmnet [10:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:38] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=restbase,service=restbase-backend,name=restbase1028.eqiad.wmnet [10:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:36] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=restbase,service=restbase,name=restbase1028.eqiad.wmnet [10:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:56] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=restbase,service=restbase-ssl,name=restbase1028.eqiad.wmnet [10:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:07] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=restbase,service=restbase-backend,name=restbase1028.eqiad.wmnet [10:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:53] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10ayounsi) Done! And confirmed with kafkacat, eg: `"comms": "2914:420_2914:1008_2914:2000_2914:3000_14907:4"` As well as no dr... [10:29:57] !log mvolz@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:02] PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:32:37] !log installing Postgres security updates on netboxdb2001 [10:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:11] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=restbase,service=restbase,name=restbase1029.eqiad.wmnet [10:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:16] 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T264630 (10Kormat) 05Open→03Resolved a:03Kormat @CGlenn: your access is now in place. [10:34:19] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=restbase,service=restbase-ssl,name=restbase1029.eqiad.wmnet [10:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:25] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=restbase,service=restbase-backend,name=restbase1029.eqiad.wmnet [10:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:27] !log installing Postgres security updates on netboxdb1001 [10:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:20] PROBLEM - Check systemd state on netflow3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:41:24] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:44:02] XioNoX: on netflow codfw/esams nfacctd is failing with plugin_buffer_size is too short [10:44:19] uh [10:45:07] er, actually everywhere [10:45:14] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=restbase,service=restbase,name=restbase1030.eqiad.wmnet [10:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:20] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=restbase,service=restbase-ssl,name=restbase1030.eqiad.wmnet [10:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:26] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=restbase,service=restbase-backend,name=restbase1030.eqiad.wmnet [10:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:51:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:52:17] !log aborrero@cumin2001 START - Cookbook sre.hosts.downtime [10:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:52] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:54:14] !log aborrero@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:43] (03CR) 10Volans: [C: 04-2] "After some more digging on the generated files this consolidation is a bit too much and we have both prefixes managed/unmanaged via Netbox" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632574 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans) [10:57:39] (03PS1) 10Ayounsi: nfacctd: set plugin_buffer_size [puppet] - 10https://gerrit.wikimedia.org/r/632902 [10:58:45] (03CR) 10Ayounsi: [C: 03+2] nfacctd: set plugin_buffer_size [puppet] - 10https://gerrit.wikimedia.org/r/632902 (owner: 10Ayounsi) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201008T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:04:12] RECOVERY - Check systemd state on netflow3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:08] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:45] (03PS3) 10Hnowlan: api-gateway: use TLS for restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/630567 [11:51:15] (03CR) 10CDanis: [C: 03+1] nfacctd: set plugin_buffer_size [puppet] - 10https://gerrit.wikimedia.org/r/632902 (owner: 10Ayounsi) [11:52:42] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10mark) Hi all, I recommend we limit the conversations on this task to the technical aspects of this particular regression and its investigati... [11:56:33] (03CR) 10Hnowlan: [C: 03+2] api-gateway: use TLS for restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/630567 (owner: 10Hnowlan) [11:57:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:58:43] (03Merged) 10jenkins-bot: api-gateway: use TLS for restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/630567 (owner: 10Hnowlan) [11:59:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201008T1200) [12:05:51] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [12:05:51] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [12:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:40] (03PS2) 10Alexandros Kosiaris: eventgate, eventstreams: Log with namedlevels [deployment-charts] - 10https://gerrit.wikimedia.org/r/594492 (https://phabricator.wikimedia.org/T239459) [12:07:48] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [12:07:48] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [12:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:59] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [12:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:05] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:09] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [12:10:09] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [12:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:04] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) Awesome! The size of the events has increased in about 25-30%, which is considerable, but I believe sustainable for now. When we sanitize... [12:14:18] If no objection, would like to deploy cxserver now. [12:14:30] akosiaris: ^ Is it OK? [12:14:51] (03CR) 10Ayounsi: [C: 03+1] "Not tested but PCC and code LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond) [12:15:16] kart_: sure, go ahead [12:16:53] akosiaris: thanks. [12:17:09] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10ayounsi) Wow, that 's more then expected indeed! If it's an issue down the road we could think of filtering out some communities (for example only... [12:17:25] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2020-10-08-053343-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/632838 (https://phabricator.wikimedia.org/T264407) (owner: 10KartikMistry) [12:17:34] 10Operations, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: decom helium and heze - https://phabricator.wikimedia.org/T260717 (10LSobanski) [12:19:03] PROBLEM - Disk space on an-launcher1002 is CRITICAL: DISK CRITICAL - free space: / 2668 MB (3% inode=88%): /tmp 2668 MB (3% inode=88%): /var/tmp 2668 MB (3% inode=88%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops [12:19:57] (03Merged) 10jenkins-bot: Update cxserver to 2020-10-08-053343-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/632838 (https://phabricator.wikimedia.org/T264407) (owner: 10KartikMistry) [12:21:18] !log kartik@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [12:21:18] (03PS1) 10Arturo Borrero Gonzalez: hieradata: labtestvirt2003: refresh network data for cloudgw PoC with latest allocations [puppet] - 10https://gerrit.wikimedia.org/r/632904 (https://phabricator.wikimedia.org/T263622) [12:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:35] (03CR) 10jerkins-bot: [V: 04-1] hieradata: labtestvirt2003: refresh network data for cloudgw PoC with latest allocations [puppet] - 10https://gerrit.wikimedia.org/r/632904 (https://phabricator.wikimedia.org/T263622) (owner: 10Arturo Borrero Gonzalez) [12:22:27] 10Operations, 10CX-cxserver, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, and 3 others: service-runner apps running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10Mvolz) [12:24:37] !log kartik@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [12:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:51] PROBLEM - Disk space on sretest1001 is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/12f2ce7ad6ba57a14a22800561b7118b99bf03272bc56ecd4d8d88fadc4d8410/mounts/shm is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=sretest1001&var-datasource=eqiad+prometheus/ops [12:26:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good and PCC seems sane." [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond) [12:26:28] (03CR) 10Ayounsi: [C: 03+1] sretest1001: Enable default reject rule for sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/632897 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond) [12:26:50] !log kartik@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [12:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:09] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, alternatively we could enable this for role::sretest, then we can test with Stretch (1002) and Buster (1001)." [puppet] - 10https://gerrit.wikimedia.org/r/632897 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond) [12:29:30] !log Updated cxserver to 2020-10-08-053343-production (T264407, T264859) [12:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:37] T264859: Create Inari Sámi Wikipedia - https://phabricator.wikimedia.org/T264859 [12:29:37] T264407: Check Apertium configuration for Serbo-croatian - https://phabricator.wikimedia.org/T264407 [12:35:13] (03PS2) 10Klausman: modules: Add functionality to allow use of 3.8 rocm packages [puppet] - 10https://gerrit.wikimedia.org/r/632248 (https://phabricator.wikimedia.org/T264408) [12:37:50] (03PS1) 10Tchanders: Enable Special:Investigate by default on production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632908 (https://phabricator.wikimedia.org/T264357) [12:38:37] PROBLEM - Druid overlord on druid1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:38:49] PROBLEM - Check systemd state on druid1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:41:10] (03PS3) 10Klausman: modules: Add functionality to allow use of 3.8 rocm packages [puppet] - 10https://gerrit.wikimedia.org/r/632248 (https://phabricator.wikimedia.org/T264408) [12:41:19] PROBLEM - Druid coordinator on druid1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:41:47] (03CR) 10Klausman: modules: Add functionality to allow use of 3.8 rocm packages (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/632248 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [12:43:01] RECOVERY - Druid coordinator on druid1003 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:43:10] druid was me sorry, I was testing a setting [12:43:39] RECOVERY - Druid overlord on druid1003 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:43:51] RECOVERY - Check systemd state on druid1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:43:54] elukey: was the setting "alertNow=true"? [12:44:59] kormat: nono it was 'lucaYouNeedToSpecifyTheTLSVersionOtherwiseIcannotMakeIt=crazy' [12:45:05] difficult one to find [12:45:29] hehe [12:45:29] RECOVERY - Disk space on sretest1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=sretest1001&var-datasource=eqiad+prometheus/ops [12:47:50] 10Operations, 10LDAP-Access-Requests: Access to the Logstash for John Bolorinos - https://phabricator.wikimedia.org/T264918 (10Kormat) [12:48:27] 10Operations, 10CX-cxserver, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, and 3 others: service-runner apps running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10akosiaris) >>! In T239459#6504349, @Mvolz wrote: > The hold-up seems to be eventstreams; it act... [12:49:08] (03PS1) 10Elukey: Enable TLS between Druid clusters and Mariadb on an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/632909 (https://phabricator.wikimedia.org/T257412) [12:49:53] (03CR) 10Elukey: [C: 03+2] Enable TLS between Druid clusters and Mariadb on an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/632909 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [13:00:04] hashar and marxarelli: My dear minions, it's time we take the moon! Just kidding. Time for Mediawiki train - European+American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201008T1300). [13:00:21] (03CR) 10CDanis: [C: 03+1] swift: bump rsync_timeout [puppet] - 10https://gerrit.wikimedia.org/r/632891 (https://phabricator.wikimedia.org/T261633) (owner: 10Filippo Giunchedi) [13:06:44] (03CR) 10CDanis: [C: 03+1] firewall: change to default reject instead of drop [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond) [13:06:54] (03CR) 10CDanis: [C: 03+1] sretest1001: Enable default reject rule for sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/632897 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond) [13:13:48] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: refresh network config for the PoC [puppet] - 10https://gerrit.wikimedia.org/r/632904 (https://phabricator.wikimedia.org/T263622) [13:15:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: refresh network config for the PoC [puppet] - 10https://gerrit.wikimedia.org/r/632904 (https://phabricator.wikimedia.org/T263622) (owner: 10Arturo Borrero Gonzalez) [13:17:31] PROBLEM - Disk space on sretest1001 is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/7b8d97ddd0c9717a9d5e21ad5e0b4e6cf55d8cb1a9260de0263e77249fa060c1/mounts/shm is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=sretest1001&var-datasource=eqiad+prometheus/ops [13:20:58] (03CR) 10Elukey: [C: 03+1] modules: Add functionality to allow use of 3.8 rocm packages [puppet] - 10https://gerrit.wikimedia.org/r/632248 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [13:22:48] (03CR) 10Elukey: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/25782/an-coord1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [13:23:45] (03CR) 10Klausman: [C: 03+2] modules: Add functionality to allow use of 3.8 rocm packages [puppet] - 10https://gerrit.wikimedia.org/r/632248 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [13:23:47] (03PS4) 10Klausman: modules: Add functionality to allow use of 3.8 rocm packages [puppet] - 10https://gerrit.wikimedia.org/r/632248 (https://phabricator.wikimedia.org/T264408) [13:25:32] (03CR) 10Klausman: [C: 03+2] modules: Add functionality to allow use of 3.8 rocm packages (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/632248 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [13:28:02] (03PS1) 10Klausman: aptrepo: Include mivisionx package fro rocm again [puppet] - 10https://gerrit.wikimedia.org/r/632915 [13:29:03] (03PS2) 10Klausman: aptrepo: Include mivisionx package from rocm again [puppet] - 10https://gerrit.wikimedia.org/r/632915 [13:37:41] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [13:37:41] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [13:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:03] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) After discussing with the team, we think it's fine for now. If we want to add more fields or increase the sampling ratio, then we should i... [13:38:29] 10Operations, 10LDAP-Access-Requests: Access to the Logstash for John Bolorinos - https://phabricator.wikimedia.org/T264918 (10Kormat) Hi @jbolorinos-ctr, What is your user name on wikitech? See https://phabricator.wikimedia.org/tag/ldap-access-requests/ Also, we need a WMF staff member as a contact person (... [13:41:29] (03CR) 10BBlack: firewall: change to default reject instead of drop (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond) [13:41:38] 10Operations, 10Mail, 10Security: Don't get a mail to confirm my email address - https://phabricator.wikimedia.org/T264504 (10Aklapper) ping - can #Operations please take a look? Thanks. [13:44:32] 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10BBlack) FWIW, I am in general a fan of `REJECT` over `DROP`, especially when there's not even a great obscurity argument, as is the case here. It will be... [13:48:44] (03PS1) 10Filippo Giunchedi: pontoon: write the stack name once to the filesystem [puppet] - 10https://gerrit.wikimedia.org/r/632918 [13:48:46] (03PS1) 10Filippo Giunchedi: pontoon: read stack from stack.file [puppet] - 10https://gerrit.wikimedia.org/r/632919 [13:48:48] (03PS1) 10Filippo Giunchedi: pontoon: configure hiera based on the stack found on the filesystem [puppet] - 10https://gerrit.wikimedia.org/r/632920 [13:48:50] (03PS1) 10Filippo Giunchedi: pontoon: use hiera.output [puppet] - 10https://gerrit.wikimedia.org/r/632921 [13:49:01] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: bump rsync_timeout [puppet] - 10https://gerrit.wikimedia.org/r/632891 (https://phabricator.wikimedia.org/T261633) (owner: 10Filippo Giunchedi) [13:53:31] PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:37] (03CR) 10Elukey: [C: 03+1] aptrepo: Include mivisionx package from rocm again [puppet] - 10https://gerrit.wikimedia.org/r/632915 (owner: 10Klausman) [13:53:37] 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T264630 (10CGlenn) Thank you @Kormat !! :) [14:04:21] (03PS3) 10Klausman: aptrepo: Include mivisionx package from rocm again [puppet] - 10https://gerrit.wikimedia.org/r/632915 [14:04:58] (03PS4) 10Klausman: aptrepo: Include mivisionx package from rocm again [puppet] - 10https://gerrit.wikimedia.org/r/632915 [14:05:31] (03CR) 10Klausman: [C: 03+2] aptrepo: Include mivisionx package from rocm again [puppet] - 10https://gerrit.wikimedia.org/r/632915 (owner: 10Klausman) [14:07:03] RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:20] (03PS1) 10Jbond: diffscan: add defeat-rst-ratelimit [puppet] - 10https://gerrit.wikimedia.org/r/632933 (https://phabricator.wikimedia.org/T264888) [14:17:46] !log importing icu 63.1-6+deb10u1~wmf5 to component/icu63 T264991 [14:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:52] T264991: Ugrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 [14:18:22] 10Operations, 10Mail, 10Security: Don't get a mail to confirm my email address - https://phabricator.wikimedia.org/T264504 (10Kormat) Hi @Xqt, it looks like the provider for the email address you're currently using for gerrit/phabricator had some issues. There's a bunch of errors in the mail log from 2020-10... [14:18:56] (03CR) 10Ayounsi: [C: 03+1] "Indeed, good idea!" [puppet] - 10https://gerrit.wikimedia.org/r/632933 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond) [14:19:27] (03PS1) 10Hnowlan: map::postgresql_common: make maps-admin chgrp toggle [puppet] - 10https://gerrit.wikimedia.org/r/632935 (https://phabricator.wikimedia.org/T263726) [14:21:41] !log Set global innodb_change_buffering = all; on pc2009 T263443 [14:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:47] T263443: Evaluate the impact of changing innodb_change_buffering to inserts - https://phabricator.wikimedia.org/T263443 [14:25:05] 10Operations, 10Mail, 10Security: Don't get a mail to confirm my email address - https://phabricator.wikimedia.org/T264504 (10Kormat) Ok, this is weird. During the same period that mx2001 was unable to deliver mail to you, mx1001 was able to deliver mail just fine. To the same email address. [14:28:31] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:29:19] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:29:33] (03PS5) 10Jbond: firewall: change to default reject instead of drop [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) [14:30:10] (03CR) 10jerkins-bot: [V: 04-1] firewall: change to default reject instead of drop [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond) [14:31:25] (03PS1) 10Jbond: idp-test1001: Enable default reject rule [puppet] - 10https://gerrit.wikimedia.org/r/632937 [14:31:58] (03PS1) 10Elukey: amd_rocm: replace << operator with list concatenation [puppet] - 10https://gerrit.wikimedia.org/r/632938 [14:32:02] (03CR) 10Jbond: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/632897 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond) [14:32:31] (03CR) 10Klausman: [C: 03+2] amd_rocm: replace << operator with list concatenation [puppet] - 10https://gerrit.wikimedia.org/r/632938 (owner: 10Elukey) [14:32:44] (03Abandoned) 10Jbond: sretest1001: Enable default reject rule for sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/632897 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond) [14:32:57] (03Abandoned) 10Elukey: amd_rocm: replace << operator with list concatenation [puppet] - 10https://gerrit.wikimedia.org/r/632938 (owner: 10Elukey) [14:35:00] (03PS6) 10Jbond: firewall: change to default reject instead of drop [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) [14:35:43] (03CR) 10Jbond: [C: 03+2] diffscan: add defeat-rst-ratelimit [puppet] - 10https://gerrit.wikimedia.org/r/632933 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond) [14:36:20] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/632937 (owner: 10Jbond) [14:46:30] (03PS1) 10Klausman: amd_rocm: Fix package list [puppet] - 10https://gerrit.wikimedia.org/r/632943 [14:46:49] (03CR) 10jerkins-bot: [V: 04-1] amd_rocm: Fix package list [puppet] - 10https://gerrit.wikimedia.org/r/632943 (owner: 10Klausman) [14:49:00] (03PS2) 10Klausman: amd_rocm: Fix package list [puppet] - 10https://gerrit.wikimedia.org/r/632943 [14:50:18] (03PS3) 10Klausman: amd_rocm: Fix package list [puppet] - 10https://gerrit.wikimedia.org/r/632943 [14:51:57] (03CR) 10Elukey: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/25784/stat1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/632943 (owner: 10Klausman) [14:52:12] (03CR) 10Klausman: [C: 03+2] amd_rocm: Fix package list [puppet] - 10https://gerrit.wikimedia.org/r/632943 (owner: 10Klausman) [14:54:21] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: interfaces: don't bring up interfaces recursively [puppet] - 10https://gerrit.wikimedia.org/r/632945 (https://phabricator.wikimedia.org/T261724) [14:55:25] 10Operations, 10SRE-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10bd808) [14:56:09] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:57:04] (03CR) 10MSantos: [C: 03+1] map::postgresql_common: make maps-admin chgrp toggle [puppet] - 10https://gerrit.wikimedia.org/r/632935 (https://phabricator.wikimedia.org/T263726) (owner: 10Hnowlan) [14:57:17] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:57:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: interfaces: don't bring up interfaces recursively [puppet] - 10https://gerrit.wikimedia.org/r/632945 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [15:02:40] (03CR) 10Jbond: firewall: change to default reject instead of drop (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond) [15:02:52] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond) [15:03:15] (03PS2) 10Jbond: idp-test1001: Enable default reject rule [puppet] - 10https://gerrit.wikimedia.org/r/632937 [15:06:23] 10Operations, 10Technical-blog-posts, 10Traffic: Blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T264729 (10ema) >>! In T264729#6526259, @srodlund wrote: > I made some minor grammar suggestions. Can you accept / reject them Done, thank you! I chang... [15:09:20] (03CR) 10BBlack: [C: 03+1] firewall: change to default reject instead of drop [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond) [15:11:14] (03CR) 10Muehlenhoff: [C: 03+1] firewall: change to default reject instead of drop [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond) [15:11:20] (03CR) 10Dzahn: [C: 03+1] "nothing to lose" [puppet] - 10https://gerrit.wikimedia.org/r/632837 (https://phabricator.wikimedia.org/T261031) (owner: 10Ladsgroup) [15:15:14] (03CR) 10Jbond: [C: 03+2] firewall: change to default reject instead of drop [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond) [15:15:18] (03CR) 10Jbond: [C: 03+2] idp-test1001: Enable default reject rule [puppet] - 10https://gerrit.wikimedia.org/r/632937 (owner: 10Jbond) [15:25:30] (03CR) 10MarcoAurelio: [C: 03+1] "LGTM; but not yet on https://wikitech.wikimedia.org/wiki/Deployments. Is this happening on Oct. 8?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632908 (https://phabricator.wikimedia.org/T264357) (owner: 10Tchanders) [15:32:00] (03CR) 10Tchanders: "> Patch Set 1: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632908 (https://phabricator.wikimedia.org/T264357) (owner: 10Tchanders) [15:36:29] PROBLEM - Check systemd state on idp-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:33] (03PS1) 10Jbond: base::firewall: use ferm::rule instead of ferm::conf [puppet] - 10https://gerrit.wikimedia.org/r/632948 (https://phabricator.wikimedia.org/T264888) [15:42:20] (03CR) 10Jbond: [C: 03+2] base::firewall: use ferm::rule instead of ferm::conf [puppet] - 10https://gerrit.wikimedia.org/r/632948 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond) [15:42:32] ^^ the idp-test issue is me [15:44:45] RECOVERY - Check systemd state on idp-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:00:04] jbond42 and cdanis: That opportune time is upon us again. Time for a Puppet request window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201008T1600). [16:01:59] (03CR) 10Volans: [C: 03+2] dns: add --keep-files option [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632745 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans) [16:02:31] (03PS3) 10Volans: sre.dns.netbox: add --skip-authdns-update option [cookbooks] - 10https://gerrit.wikimedia.org/r/632697 (https://phabricator.wikimedia.org/T264846) [16:02:36] 10Operations, 10Mail, 10Security: Don't get a mail to confirm my email address - https://phabricator.wikimedia.org/T264504 (10herron) With regard to why mail from eqiad seemed to be working while codfw was not -- part of this is because the working email examples are gerrit mails, which in addition to having... [16:03:55] (03CR) 10Volans: [C: 03+2] sre.dns.netbox: add --skip-authdns-update option [cookbooks] - 10https://gerrit.wikimedia.org/r/632697 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans) [16:04:04] (03PS2) 10Volans: sre.dns.netbox: add --emergency-manual-edit option [cookbooks] - 10https://gerrit.wikimedia.org/r/632746 (https://phabricator.wikimedia.org/T264846) [16:04:14] (03CR) 10Volans: [C: 03+2] sre.dns.netbox: add --emergency-manual-edit option [cookbooks] - 10https://gerrit.wikimedia.org/r/632746 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans) [16:05:05] (03CR) 10Dzahn: "the puppet part looks good to me" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/632738 (https://phabricator.wikimedia.org/T148976) (owner: 10Cwhite) [16:05:15] (03Merged) 10jenkins-bot: sre.dns.netbox: add --skip-authdns-update option [cookbooks] - 10https://gerrit.wikimedia.org/r/632697 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans) [16:05:30] (03Merged) 10jenkins-bot: sre.dns.netbox: add --emergency-manual-edit option [cookbooks] - 10https://gerrit.wikimedia.org/r/632746 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans) [16:05:39] 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10jbond) Thanks all quick update. I have deployed the firewall change to idp-test1001 and the scan time about 3x faster with the new rule (see below). howe... [16:08:11] !log volans@cumin1001 START - Cookbook sre.dns.netbox [16:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:20] !log volans@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [16:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:03] (03CR) 10Dzahn: profile: apply ipsec monitoring where enabled with ipsec_exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/632738 (https://phabricator.wikimedia.org/T148976) (owner: 10Cwhite) [16:09:35] !log volans@cumin1001 START - Cookbook sre.dns.netbox [16:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:34] (03PS1) 10Volans: added prefix 91.198.174.224/27, adapt INCLUDE [dns] - 10https://gerrit.wikimedia.org/r/632950 (https://phabricator.wikimedia.org/T264846) [16:13:59] (03CR) 10jerkins-bot: [V: 04-1] added prefix 91.198.174.224/27, adapt INCLUDE [dns] - 10https://gerrit.wikimedia.org/r/632950 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans) [16:15:04] (03CR) 10Volans: "Expected CI failure as the file doesn't exist yet, I'm deploying it with the cookbook with --skip-authdns-update and then recheck this one" [dns] - 10https://gerrit.wikimedia.org/r/632950 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans) [16:15:17] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:28] (03CR) 10Volans: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/632950 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans) [16:16:15] (03CR) 10Volans: [C: 03+2] added prefix 91.198.174.224/27, adapt INCLUDE [dns] - 10https://gerrit.wikimedia.org/r/632950 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans) [16:16:16] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide, 10Wikimedia-GitHub: Deployment of latest Design Style Guide Gerrit clone doesn't seem to succeed - https://phabricator.wikimedia.org/T264894 (10hashar) The repository on deploy1001 points at master and does fetch from Ger... [16:16:34] 10Operations, 10Gerrit, 10Wikimedia Design Style Guide, 10Wikimedia-GitHub, and 2 others: Deployment of latest Design Style Guide Gerrit clone doesn't seem to succeed - https://phabricator.wikimedia.org/T264894 (10hashar) [16:18:04] 10Operations, 10Mail, 10Security: Don't get a mail to confirm my email address (remote system drops connection without providing a reason) - https://phabricator.wikimedia.org/T264504 (10Aklapper) [16:18:35] 10Operations, 10LDAP-Access-Requests: Add lilients_WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T264590 (10KFrancis) 05Open→03Resolved a:05KFrancis→03lilients_WMDE Hi All, the NDA is complete. Thanks! [16:19:09] !log Restarting CI Jenkins [16:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:01] (03PS3) 10Volans: dns: consolidate reverse zone files (part 1) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632574 (https://phabricator.wikimedia.org/T264273) [16:22:03] (03PS1) 10Volans: dns: consolidate reverse zone files (part 2) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632952 (https://phabricator.wikimedia.org/T264273) [16:22:21] (03PS1) 10Volans: netbox: move $INCLUDEs to the consolidated files [dns] - 10https://gerrit.wikimedia.org/r/632953 (https://phabricator.wikimedia.org/T264273) [16:23:04] (03CR) 10jerkins-bot: [V: 04-1] netbox: move $INCLUDEs to the consolidated files [dns] - 10https://gerrit.wikimedia.org/r/632953 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans) [16:23:50] (03CR) 10Volans: "Expected CI failure because it depends on the merge and deploy of the depends-on change." [dns] - 10https://gerrit.wikimedia.org/r/632953 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans) [16:25:19] !log rebooting cloudvirt1023 - trying PXE boot [16:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:42] 10Operations, 10Mail, 10Security: Don't get a mail to confirm my email address (remote system drops connection without providing a reason) - https://phabricator.wikimedia.org/T264504 (10herron) Actually, after some further manual testing I think we have a reason: 521 5.7.1 Service unavailable; client [208.8... [16:32:14] (03CR) 10Volans: "This is the new diff https://phabricator.wikimedia.org/P12939" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632574 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans) [16:33:35] (03CR) 10Volans: "Diff is https://phabricator.wikimedia.org/P12954" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632952 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans) [16:42:08] 10Operations, 10LDAP-Access-Requests: Access to the Logstash for John Bolorinos - https://phabricator.wikimedia.org/T264918 (10jbolorinos-ctr) I think my wikitech username is jbol (is this the login for gerrit?) [16:42:17] (03PS1) 10Andrew Bogott: wmcs server backups: Add a way to assign projects to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/632960 (https://phabricator.wikimedia.org/T260692) [16:44:35] (03PS2) 10Andrew Bogott: wmcs server backups: Add a way to assign projects to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/632960 (https://phabricator.wikimedia.org/T260692) [16:48:48] (03PS1) 10Andrew Bogott: wmcs backups: remove the 'special_projects' logic [puppet] - 10https://gerrit.wikimedia.org/r/632961 (https://phabricator.wikimedia.org/T260692) [16:50:27] (03PS3) 10Andrew Bogott: wmcs server backups: Add a way to assign projects to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/632960 (https://phabricator.wikimedia.org/T260692) [16:50:29] (03PS2) 10Andrew Bogott: wmcs backups: remove the 'special_projects' logic [puppet] - 10https://gerrit.wikimedia.org/r/632961 (https://phabricator.wikimedia.org/T260692) [16:55:13] (03CR) 10CRusnov: [C: 03+1] "I think that this looks good to me. It should be a harmless minimal change as we discussed and the code looks fine." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632574 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans) [16:58:59] (03CR) 10Bstorm: [C: 03+2] "One way to find out!" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/632582 (https://phabricator.wikimedia.org/T263339) (owner: 10Bstorm) [16:59:58] (03Merged) 10jenkins-bot: locales: switch to using locales-all package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/632582 (https://phabricator.wikimedia.org/T263339) (owner: 10Bstorm) [17:00:05] chrisalbon and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201008T1700). [17:12:35] (03CR) 10CRusnov: [C: 03+1] "LGTMa" [dns] - 10https://gerrit.wikimedia.org/r/632953 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans) [17:14:01] (03CR) 10Dzahn: "thanks for merging :))" [puppet] - 10https://gerrit.wikimedia.org/r/631900 (owner: 10Dzahn) [17:14:45] (03CR) 10Dbarratt: [C: 03+1] Enable Special:Investigate by default on production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632908 (https://phabricator.wikimedia.org/T264357) (owner: 10Tchanders) [17:16:27] (03CR) 10CRusnov: [C: 03+1] "Looks good" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632952 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans) [17:16:32] !log install prometheus-rsyslog-exporter_0.0.0+git20201008 on centrallog1001 - T210137 [17:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:37] T210137: Handle unknown stats in rsyslog_exporter - https://phabricator.wikimedia.org/T210137 [17:23:55] !log volans@cumin1001 START - Cookbook sre.dns.netbox [17:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:01] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [17:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:55] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:51] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:35] 10Operations, 10Mail, 10Security: Don't get a mail to confirm my email address (remote system drops connection without providing a reason) - https://phabricator.wikimedia.org/T264504 (10Urbanecm) Thanks @herron. I guess we should investigate why the reason doesn't appear in our local logs. Should I open a fo... [17:31:54] !log volans@cumin1001 START - Cookbook sre.dns.netbox [17:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:30] 10Operations, 10Release-Engineering-Team-TODO, 10observability, 10Release-Engineering-Team (Deployment services): "MediaWiki exceptions and fatals per minute" alarm is too slow (half an hour delay!) - https://phabricator.wikimedia.org/T141520 (10colewhite) Indeed, there is a bit of delay due to retries and... [17:32:55] 10Operations, 10Mail, 10Security: Don't get a mail to confirm my email address (remote system drops connection without providing a reason) - https://phabricator.wikimedia.org/T264504 (10Urbanecm) [17:33:29] 10Operations, 10Mail, 10Security: Don't get a mail to confirm my email address (mx2001 is blacklisted by abusix blacklist) - https://phabricator.wikimedia.org/T264504 (10Urbanecm) [17:34:05] (03PS6) 10Razzi: oozie: use admin groups to determine admin access [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) [17:34:32] RECOVERY - Disk space on an-launcher1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops [17:35:44] (03CR) 10Razzi: [C: 03+2] oozie: use admin groups to determine admin access [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [17:37:10] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@945e5c1]: airflow: Set search satisfaction dag start date to oldest current available data [17:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:19] !log root@cumin1001 START - Cookbook sre.dns.netbox [17:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:40] that was me, double sudo by mistake [17:46:38] (03PS1) 10Volans: sre.dns.netbox: improve user message [cookbooks] - 10https://gerrit.wikimedia.org/r/632971 (https://phabricator.wikimedia.org/T264846) [17:46:57] (03PS4) 10Andrew Bogott: wmcs server backups: Add a way to assign projects to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/632960 (https://phabricator.wikimedia.org/T260692) [17:46:59] (03PS3) 10Andrew Bogott: wmcs backups: remove the 'special_projects' logic [puppet] - 10https://gerrit.wikimedia.org/r/632961 (https://phabricator.wikimedia.org/T260692) [17:47:01] (03PS1) 10Andrew Bogott: cloudvirt1023: update nic names [puppet] - 10https://gerrit.wikimedia.org/r/632972 (https://phabricator.wikimedia.org/T259399) [17:48:07] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1023: update nic names [puppet] - 10https://gerrit.wikimedia.org/r/632972 (https://phabricator.wikimedia.org/T259399) (owner: 10Andrew Bogott) [17:48:59] (03CR) 10Volans: [C: 03+2] sre.dns.netbox: improve user message [cookbooks] - 10https://gerrit.wikimedia.org/r/632971 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans) [17:49:05] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@945e5c1]: airflow: Set search satisfaction dag start date to oldest current available data (duration: 11m 55s) [17:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:10] !log root@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:50:11] (03Merged) 10jenkins-bot: sre.dns.netbox: improve user message [cookbooks] - 10https://gerrit.wikimedia.org/r/632971 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans) [17:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:15] (03PS8) 10Dzahn: thumbor: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630694 [17:57:45] (03CR) 10Dzahn: [C: 03+2] "disabling puppet on thumbor* via cumin, applying on one host to confirm noop, then enable on others again.. i think this is what Effie mea" [puppet] - 10https://gerrit.wikimedia.org/r/630694 (owner: 10Dzahn) [17:58:12] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/25787/thumbor1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/630694 (owner: 10Dzahn) [17:59:10] (03CR) 10Dzahn: "wmf-style: total violations delta -6 ( -7, +1)" [puppet] - 10https://gerrit.wikimedia.org/r/630694 (owner: 10Dzahn) [18:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201008T1800). [18:00:04] Tchanders: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:29] \o/ [18:00:34] Tchanders: wanna self-service? ;) [18:01:33] Urbanecm: I'm set up to deploy this - JamesF agreed to help this time too! [18:01:36] Urbanecm: Yeah, Tchanders and I will deal. [18:01:39] Snap. [18:01:50] *scap [18:01:54] :P [18:02:38] (03CR) 10Dzahn: "confirmed noop on thumbor1001, thumbor2001,... re-enabling puppet on thumbor*" [puppet] - 10https://gerrit.wikimedia.org/r/630694 (owner: 10Dzahn) [18:02:39] Thanks Urbanecm :) [18:03:23] (03PS1) 10Razzi: Revert "oozie: use admin groups to determine admin access" [puppet] - 10https://gerrit.wikimedia.org/r/632823 [18:04:15] (03CR) 10Tchanders: [C: 03+2] Enable Special:Investigate by default on production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632908 (https://phabricator.wikimedia.org/T264357) (owner: 10Tchanders) [18:04:54] \o/ [18:06:34] (03Merged) 10jenkins-bot: Enable Special:Investigate by default on production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632908 (https://phabricator.wikimedia.org/T264357) (owner: 10Tchanders) [18:06:36] (03CR) 10Razzi: [C: 03+2] Revert "oozie: use admin groups to determine admin access" [puppet] - 10https://gerrit.wikimedia.org/r/632823 (owner: 10Razzi) [18:17:00] !log tchanders@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:632908|Enable Special:Investigate by default on production (T264357)]] (duration: 01m 06s) [18:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:06] T264357: Deploy Special:Investigate to all wikis - https://phabricator.wikimedia.org/T264357 [18:17:35] (03CR) 10Dzahn: [C: 03+2] "turns out this is not currently used but should not be deleted quite yet" [puppet] - 10https://gerrit.wikimedia.org/r/628460 (owner: 10Dzahn) [18:18:57] (03PS2) 10Dzahn: elasticsearch::cirrus: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/632567 [18:28:53] (03PS3) 10CRusnov: diffscan.py: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/630703 (https://phabricator.wikimedia.org/T247364) [18:31:25] 10Operations, 10Analytics, 10SRE-Access-Requests: Renable SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T265071 (10elukey) [18:31:59] (03CR) 10Dzahn: [C: 03+2] "noop: https://puppet-compiler.wmflabs.org/compiler1001/25790/" [puppet] - 10https://gerrit.wikimedia.org/r/632567 (owner: 10Dzahn) [18:34:26] (03CR) 10Dzahn: "confirmed complete noop on elastic1032, elastic2055, relforge1002, .." [puppet] - 10https://gerrit.wikimedia.org/r/632567 (owner: 10Dzahn) [18:34:40] (03PS5) 10Andrew Bogott: wmcs server backups: Add a way to assign projects to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/632960 (https://phabricator.wikimedia.org/T260692) [18:34:42] (03PS4) 10Andrew Bogott: wmcs backups: remove the 'special_projects' logic [puppet] - 10https://gerrit.wikimedia.org/r/632961 (https://phabricator.wikimedia.org/T260692) [18:34:44] (03PS1) 10Andrew Bogott: wmcs backy2: allow hiera config of when the backup runs [puppet] - 10https://gerrit.wikimedia.org/r/632976 (https://phabricator.wikimedia.org/T260692) [18:39:03] (03PS2) 10Dzahn: hadoop::monitoring: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/631302 [18:39:59] (03CR) 10Dzahn: [C: 03+2] "already compiled on master, client and worker, will confirm anyways" [puppet] - 10https://gerrit.wikimedia.org/r/631302 (owner: 10Dzahn) [18:47:38] (03CR) 10Razzi: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/631896 (https://phabricator.wikimedia.org/T264152) (owner: 10Razzi) [18:49:00] (03CR) 10Razzi: "This is part 1 of 3:" [puppet] - 10https://gerrit.wikimedia.org/r/631896 (https://phabricator.wikimedia.org/T264152) (owner: 10Razzi) [18:50:58] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 52 threshold =0.15 breach: active_primary_shards: 52, task_max_waiting_in_queue_millis: 608, unassigned_shards: 46, active_shards: 52, initializing_shards: 6, number_of_pending_tasks: 9, relocating_shards: 0, number_of_data_nodes: 2, delayed_unassigned_shards: 0, number_of_nodes: 2, active_shards_percent_a [18:50:58] tatus: red, cluster_name: relforge-eqiad, timed_out: False, number_of_in_flight_fetch: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:52:40] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: unassigned_shards: 0, timed_out: False, number_of_pending_tasks: 0, number_of_nodes: 2, number_of_data_nodes: 2, status: green, active_primary_shards: 83, task_max_waiting_in_queue_millis: 0, initializing_shards: 0, relocating_shards: 0, number_of_in_flight_fetch: 0, active_shards: 104, cluster_name: relforge- [18:52:40] rds_percent_as_number: 100.0, delayed_unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:57:42] !log volker-e@deploy1001 Started deploy [design/style-guide@b1166af]: Deploy design/style-guide: [18:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:49] !log volker-e@deploy1001 Finished deploy [design/style-guide@b1166af]: Deploy design/style-guide: (duration: 00m 06s) [18:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:57] 10Operations, 10Gerrit, 10Wikimedia Design Style Guide, 10Wikimedia-GitHub, and 2 others: Deployment of latest Design Style Guide Gerrit clone doesn't seem to succeed - https://phabricator.wikimedia.org/T264894 (10Volker_E) 05Open→03Invalid There was a Git misconfiguration locally. Has worked now. Sor... [19:00:04] hashar and marxarelli: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Mediawiki train - European+American Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201008T1900). [19:03:36] (03CR) 10Dzahn: "confirmed noop on an-worker1087, an-master1001, flerovium, analytics1047, alerts1001" [puppet] - 10https://gerrit.wikimedia.org/r/631302 (owner: 10Dzahn) [19:03:41] (03PS1) 10Bstorm: locales: the update-locale command is in the locales package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/632982 [19:04:37] (03CR) 10Bstorm: [C: 03+2] locales: the update-locale command is in the locales package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/632982 (owner: 10Bstorm) [19:05:06] (03Merged) 10jenkins-bot: locales: the update-locale command is in the locales package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/632982 (owner: 10Bstorm) [19:11:24] (03CR) 10Andrew Bogott: [C: 03+2] wmcs backy2: allow hiera config of when the backup runs [puppet] - 10https://gerrit.wikimedia.org/r/632976 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [19:21:04] (03PS4) 10Dzahn: zookeeper: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/631295 [19:23:44] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@3b11443]: search_satisfaction: Alias sample multiplier to expected name [19:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:01] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Investigate systemd hardening to replace Firejail for Thumbor - https://phabricator.wikimedia.org/T212941 (10Ladsgroup) [19:24:53] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@3b11443]: search_satisfaction: Alias sample multiplier to expected name (duration: 01m 09s) [19:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:27] (03CR) 10BryanDavis: wmcs server backups: Add a way to assign projects to backup hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/632960 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [19:31:28] (03CR) 10Dzahn: [V: 03+1] "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/631295 (owner: 10Dzahn) [19:32:02] (03CR) 10Andrew Bogott: wmcs server backups: Add a way to assign projects to backup hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/632960 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [19:34:25] (03PS3) 10Esanders: Drop wgHiddenPrefs hack for VE Beta Feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620958 (https://phabricator.wikimedia.org/T254349) [19:35:49] 10Operations, 10Analytics, 10SRE-Access-Requests: Renable SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T265071 (10Dzahn) Using the same key should be fine. But we will need a new "expiry_date" please. And should we use expiry_contact: nruiz@ like before? [19:36:02] (03PS6) 10Andrew Bogott: wmcs server backups: Add a way to assign projects to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/632960 (https://phabricator.wikimedia.org/T260692) [19:36:04] (03PS5) 10Andrew Bogott: wmcs backups: remove the 'special_projects' logic [puppet] - 10https://gerrit.wikimedia.org/r/632961 (https://phabricator.wikimedia.org/T260692) [19:38:49] (03CR) 10Dzahn: [V: 03+1 C: 03+2] zookeeper: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/631295 (owner: 10Dzahn) [19:39:25] (03CR) 10Bstorm: [C: 03+1] "Looks good like this, I think." [puppet] - 10https://gerrit.wikimedia.org/r/632960 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [19:40:09] (03CR) 10Andrew Bogott: [C: 03+2] wmcs server backups: Add a way to assign projects to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/632960 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [19:41:26] (03CR) 10Dzahn: "noop also confirmed on the same hosts after merging" [puppet] - 10https://gerrit.wikimedia.org/r/631295 (owner: 10Dzahn) [19:42:04] (03PS3) 10Dzahn: toolforge/grid: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/631315 [19:49:26] (03CR) 10Andrew Bogott: [C: 03+2] wmcs backups: remove the 'special_projects' logic [puppet] - 10https://gerrit.wikimedia.org/r/632961 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott) [19:49:42] (03PS6) 10Andrew Bogott: wmcs backups: remove the 'special_projects' logic [puppet] - 10https://gerrit.wikimedia.org/r/632961 (https://phabricator.wikimedia.org/T260692) [19:52:41] (03CR) 10Dzahn: "WMCS team: are you ok if we go forward with this? This may look large but it's the same hiera/lookup replacement we did before and nothing" [puppet] - 10https://gerrit.wikimedia.org/r/631315 (owner: 10Dzahn) [19:55:43] (03PS2) 10Dzahn: dumps/homer: turn bash scripts into sh scripts [puppet] - 10https://gerrit.wikimedia.org/r/631892 (https://phabricator.wikimedia.org/T95064) [19:58:36] (03CR) 10Dzahn: [C: 03+2] "thanks for the reviews, reducing this patch to just change dumps and homer script and and merging that" [puppet] - 10https://gerrit.wikimedia.org/r/631892 (https://phabricator.wikimedia.org/T95064) (owner: 10Dzahn) [20:04:24] (03CR) 10Dzahn: "Andrew/Bryan you think this monitoring can be removed at this point?" [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn) [20:07:45] 10Operations, 10ops-eqiad, 10DC-Ops: rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Jgreen) [20:08:31] 10Operations, 10ops-eqiad, 10DC-Ops: rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Jgreen) [20:09:14] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Jgreen) [20:09:59] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [20:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:29] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/25796/scb2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/631898 (owner: 10Dzahn) [20:11:56] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:14] 10Operations, 10Mail, 10Security: Don't get a mail to confirm my email address (mx2001 is blacklisted by abusix blacklist) - https://phabricator.wikimedia.org/T264504 (10herron) wiki-mail-codfw.wikimedia.org has been delisted. This should resolve the issue outlined in the description here. In terms of foll... [20:13:29] (03CR) 10Dzahn: "simply replacing hiera() with lookup() when there are no default values and other changes has never been any difference anywhere so far. n" [puppet] - 10https://gerrit.wikimedia.org/r/631898 (owner: 10Dzahn) [20:14:33] (03PS2) 10Dzahn: openstack: turn bash scripts without bashisms into sh scripts [puppet] - 10https://gerrit.wikimedia.org/r/631891 (https://phabricator.wikimedia.org/T95064) [20:15:11] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Jgreen) [20:16:11] (03PS11) 10Dzahn: labstore: add data types and some other style fixes [puppet] - 10https://gerrit.wikimedia.org/r/622666 [20:17:46] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Jgreen) [20:20:15] (03CR) 10Dzahn: "> Patch Set 7: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn) [20:23:23] _eyes ORES eqiad suspiciously_ [20:23:28] don't you do it [20:23:41] (03CR) 10BryanDavis: toolforge/dynamicproxy: remove diamond monitoring proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn) [20:30:03] (03CR) 10Dzahn: toolforge/dynamicproxy: remove diamond monitoring proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn) [20:31:18] (03PS2) 10Dzahn: toolforge/dynamicproxy: delete profile::toolforge::services::basic [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993) [20:32:23] (03CR) 10jerkins-bot: [V: 04-1] toolforge/dynamicproxy: delete profile::toolforge::services::basic [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn) [20:33:09] (03PS3) 10Dzahn: toolforge: delete profile::toolforge::services::basic [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993) [20:33:49] (03CR) 10Dzahn: "renamed to reflect what it actually is now. just removes the profile that absented a diamond collector" [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn) [20:35:51] (03PS1) 10Hashar: Deduplicate SessionBackend::logPersistenceChange calls [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/632991 (https://phabricator.wikimedia.org/T264793) [20:38:07] (03CR) 10Dzahn: "Jaime, Alex, i assume we are keeping this for a bit longer just in case?" [puppet] - 10https://gerrit.wikimedia.org/r/626460 (https://phabricator.wikimedia.org/T260717) (owner: 10Dzahn) [20:40:32] (03CR) 10BryanDavis: [C: 03+1] toolforge: delete profile::toolforge::services::basic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn) [20:42:06] (03CR) 10Dzahn: [C: 03+1] "per "already cherry-picked" 😊" [puppet] - 10https://gerrit.wikimedia.org/r/630589 (owner: 10Jbond) [20:43:03] (03PS3) 10Dzahn: Delete puppet role and module for Phragile [puppet] - 10https://gerrit.wikimedia.org/r/632475 (https://phabricator.wikimedia.org/T240308) (owner: 10Aklapper) [20:43:25] (03CR) 10Dzahn: [C: 03+2] "thanks for the cleanup!" [puppet] - 10https://gerrit.wikimedia.org/r/632475 (https://phabricator.wikimedia.org/T240308) (owner: 10Aklapper) [20:43:34] !log deploying Netbox DNS zone consolidation - T264273 [20:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:40] T264273: DNS: per prefix zone-file limitation - https://phabricator.wikimedia.org/T264273 [20:43:49] (03CR) 10Volans: [C: 03+2] dns: consolidate reverse zone files (part 1) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632574 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans) [20:44:43] (03CR) 10Dzahn: "yep, whitespace issue fixed" [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn) [20:45:23] !log volans@cumin1001 START - Cookbook sre.dns.netbox [20:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:21] (03CR) 10Hashar: [C: 03+2] "Deploying this!" [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/632991 (https://phabricator.wikimedia.org/T264793) (owner: 10Hashar) [20:48:39] (03CR) 10Dzahn: [C: 03+2] toolforge: delete profile::toolforge::services::basic [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn) [20:48:46] (03PS4) 10Dzahn: toolforge: delete profile::toolforge::services::basic [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993) [20:50:49] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:50:50] (03CR) 10Dzahn: [C: 03+2] toolforge: delete profile::toolforge::services::basic (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn) [20:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:56] (03CR) 10Volans: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/632953 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans) [20:55:05] (03CR) 10Volans: [C: 03+2] netbox: move $INCLUDEs to the consolidated files [dns] - 10https://gerrit.wikimedia.org/r/632953 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans) [20:58:21] (03PS2) 10Volans: dns: consolidate reverse zone files (part 2) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632952 (https://phabricator.wikimedia.org/T264273) [20:59:25] (03CR) 10Dzahn: [C: 03+1] "lgtm, already uses the same hdfs commands before, just drops a step" [puppet] - 10https://gerrit.wikimedia.org/r/631896 (https://phabricator.wikimedia.org/T264152) (owner: 10Razzi) [20:59:31] (03CR) 10Volans: [C: 03+2] dns: consolidate reverse zone files (part 2) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632952 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans) [21:00:23] !log volans@cumin1001 START - Cookbook sre.dns.netbox [21:00:25] !log volans@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [21:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:35] !log volans@cumin1001 START - Cookbook sre.dns.netbox [21:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:38] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:32] (03Merged) 10jenkins-bot: Deduplicate SessionBackend::logPersistenceChange calls [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/632991 (https://phabricator.wikimedia.org/T264793) (owner: 10Hashar) [21:11:11] (03PS1) 10Dzahn: ci: replace hiera with lookup, jenkins, shipyard, pipeline, k8s [puppet] - 10https://gerrit.wikimedia.org/r/633017 [21:11:35] deploying https://gerrit.wikimedia.org/r/c/mediawiki/core/+/632991 [21:14:04] (03PS1) 10Dzahn: parsoid: replace a hiera call with lookup [puppet] - 10https://gerrit.wikimedia.org/r/633020 [21:14:49] (03PS1) 10Hashar: Deduplicate SessionBackend::logPersistenceChange calls [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/632992 (https://phabricator.wikimedia.org/T264793) [21:15:02] (03CR) 10Hashar: [C: 03+2] "deploying" [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/632992 (https://phabricator.wikimedia.org/T264793) (owner: 10Hashar) [21:16:47] (03PS1) 10Dzahn: elasticsearch: replace hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/633022 [21:22:20] (03CR) 10Dzahn: "pretty much guaranteed noop, especially where there were no default values..which is all except one or so" [puppet] - 10https://gerrit.wikimedia.org/r/633017 (owner: 10Dzahn) [21:22:38] (03PS2) 10Dzahn: ci: replace hiera with lookup, jenkins, shipyard, pipeline, k8s [puppet] - 10https://gerrit.wikimedia.org/r/633017 [21:22:42] (03PS1) 10Dzahn: cumin: replace hiera with lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/633024 [21:25:12] (03CR) 10Dzahn: "also..removing the lint-ignore that apparently is not needed.. puppet-lint accepts it" [puppet] - 10https://gerrit.wikimedia.org/r/633024 (owner: 10Dzahn) [21:28:39] 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 3 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10calbon) {F32378655} Suspiciously similar just happened to ores on eqiad. I reset the uwsgi service on all ores100x boxes and will monitor. [21:29:29] (03CR) 10Volans: [C: 03+1] "LGTM if the compiler is happy :)" [puppet] - 10https://gerrit.wikimedia.org/r/633024 (owner: 10Dzahn) [21:31:21] (03PS1) 10Dzahn: pybal: replace hiera with lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/633026 [21:34:48] (03CR) 10Dzahn: [C: 03+2] "thanks! it is: cumin::master: https://puppet-compiler.wmflabs.org/compiler1002/25798/cumin2001.codfw.wmnet/index.html cumin::target: http" [puppet] - 10https://gerrit.wikimedia.org/r/633024 (owner: 10Dzahn) [21:34:56] (03PS2) 10Dzahn: cumin: replace hiera with lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/633024 [21:37:20] (03Merged) 10jenkins-bot: Deduplicate SessionBackend::logPersistenceChange calls [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/632992 (https://phabricator.wikimedia.org/T264793) (owner: 10Hashar) [21:37:57] 10Operations, 10Mail, 10Security: Don't get a mail to confirm my email address (mx2001 is blacklisted by abusix blacklist) - https://phabricator.wikimedia.org/T264504 (10Dzahn) > in addition to having different message contents are also sent outward via the main mx host interface instead of the wiki-mail-si... [21:40:27] (03CR) 10Dzahn: "noop on cumin1001/2001 and various targets" [puppet] - 10https://gerrit.wikimedia.org/r/633024 (owner: 10Dzahn) [21:41:26] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be106[0-3] - https://phabricator.wikimedia.org/T265093 (10RobH) [21:41:37] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be106[0-3] - https://phabricator.wikimedia.org/T265093 (10RobH) [21:43:43] deploying https://gerrit.wikimedia.org/r/c/mediawiki/core/+/632992 [21:45:13] (03PS1) 10Dzahn: local_dev::docker_publish: hiera->lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/633027 [21:46:11] 08Warning Alert for device mr1-eqsin.wikimedia.org - Processor usage over 85% [21:46:33] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10LGoto) [21:51:22] 10Operations, 10Analytics, 10SRE-Access-Requests: Renable SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T265071 (10Nuria) Expiry contact will be @Ottomata end data is April 1 2021 [21:52:21] !log hashar@deploy1001 Synchronized php-1.36.0-wmf.10/includes/session/SessionBackend.php: Deduplicate SessionBackend::logPersistenceChange calls - T264793 (duration: 01m 01s) [21:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:29] T264793: Make sure SessionManager emitting Set-Cookie headers gets logged - https://phabricator.wikimedia.org/T264793 [21:52:33] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10JMinor) We are publishing v6.7.2 (1780) of the app as I write, which has our... [21:52:58] (03CR) 10Hashar: "Deployed" [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/632992 (https://phabricator.wikimedia.org/T264793) (owner: 10Hashar) [21:53:16] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10JMinor) a:05Dmantena→03None [21:53:39] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@a923949]: search_satisfaction: update druid datasource to match previous data [21:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:25] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10CDanis) Really great to see this happen so quickly! Thanks so much :) I'll... [21:54:44] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@a923949]: search_satisfaction: update druid datasource to match previous data (duration: 01m 04s) [21:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:31] (03PS1) 10Dzahn: mediawiki: replace hiera with lookup, add data types in all profiles [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) [21:59:49] that deduplication patch seems to have worked properly \o/ [22:03:38] PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:03:38] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for [22:03:38] timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/news (get In the News content) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [22:03:58] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:04:14] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:04:24] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:04:34] just like last time and right after deploy [22:04:35] right [22:04:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_wikifeeds_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:05:56] mutante: the deploy is a red herring, the cause turned out to be T264881 [22:05:56] T264881: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 [22:06:12] rzl: oh, ok. thank you [22:06:15] (which the iOS devs have done a really nice job jumping on and fixing, we're just waiting for the deployment) [22:06:27] *nod* cool [22:06:37] PROBLEM - LVS wikifeeds codfw port 4101/tcp - A node webservice supporting featured wiki content feeds. termbox.svc.eqiad.wmnet IPv4 #page on wikifeeds.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.47 and port 4101: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:06:54] here [22:07:00] ^ I'll ack that, it's the same, no action needed [22:07:11] cdanis: we are just waiting for a fix to be deployed ^ [22:07:14] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:07:16] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:07:16] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:07:20] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:07:24] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [22:07:33] mutante: sorry, are you referring to the iOS app? [22:07:37] cdanis: yes [22:07:39] yeah [22:07:48] realized as soon as I looked what time it was :) [22:08:14] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:08:22] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wikifeeds_4101: Servers kubernetes2010.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:08:27] the "restbase-dev" is what it starts with and gave it away [22:08:28] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:08:32] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [22:08:32] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:08:36] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:08:40] * volans with one foot in the bed already, need any help? [22:08:44] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:08:49] volans: nope, sleep well [22:08:50] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:08:52] volans: no, this is the usual 22:00 UTC thing [22:09:04] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:09:07] "usual" :/ [22:09:26] volans: diagnosed and soon fixed, at least! [22:09:34] ack thx [22:09:50] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:10:00] rescheduled checks to make it faster [22:10:04] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:10:07] RECOVERY - LVS wikifeeds codfw port 4101/tcp - A node webservice supporting featured wiki content feeds. termbox.svc.eqiad.wmnet IPv4 #page on wikifeeds.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1006 bytes in 3.718 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:10:12] wow this one was a bit more percussive than usual -- check out appserver latency [22:10:30] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles f [22:10:30] 6) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [22:10:32] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:10:32] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [22:10:32] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:10:32] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [22:10:33] ? [22:10:34] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:10:34] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:10:34] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:10:34] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:10:34] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:10:34] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:10:34] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:10:38] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:10:38] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:10:38] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:10:41] bblack: T264881 [22:10:51] ok [22:11:12] the live icinga spam here on IRC seemed "different", like, much faster [22:11:20] yeah, this was more severe than usual [22:11:32] did we fix/tweak something about the irc echoing bits? [22:11:34] i sped up the recovery a bit [22:11:37] it's not clear to me why these requests don't get coalesced more [22:11:37] ok [22:11:44] they are cacheable, at least the ones I've seen [22:12:04] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [22:12:06] hm, restbase error rate is still fairly high [22:12:11] 08̶W̶a̶r̶n̶i̶n̶g Device mr1-eqsin.wikimedia.org recovered from Processor usage over 85% [22:12:16] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:14:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:15:25] cdanis: looks okay now, just recovered slowly for some reason [22:15:32] yeah [22:15:48] looks like it did the same yesterday too [22:15:53] (at a lower error rate) [22:16:48] actually, that was the *request* rate falling off [22:16:49] https://grafana.wikimedia.org/d/000000068/restbase?viewPanel=15&orgId=1&from=now-1h&to=now [22:17:13] yeah, the page summary API does get hit by the mobile apps AIUI [22:17:24] nod [22:17:43] sharp peak for "v1_page_random_-format-" and then a longer falloff for "v1_page_summary_-title-" [22:18:53] I'm not sure what "random" is doing in there, is that the endpoint for the featured page somehow? it's overwhelmingly where the 5xxs are [22:19:00] rzl: https://w.wiki/fsZ [22:19:02] no [22:19:11] it is literally the 'random wiki page' handler [22:19:27] also being removed https://phabricator.wikimedia.org/T264881#6525670 [22:19:49] yeah that's why I'm surprised it spikes at midn-- ohhh okay [22:19:52] OH [22:19:54] OH [22:19:57] *OH* [22:20:03] 💡? [22:20:06] do you know what is special about that endpoint?! [22:20:11] it's not cacheable, is it [22:20:12] IT ISN'T CACHEABLE [22:20:42] do you know what else is special about it, though [22:21:07] when we get a spike of requests over the ordinary rate, nobody will see the responses or care if they're 429s [22:21:19] until after this iOS deploy goes out anyway [22:21:50] 10Operations, 10Traffic, 10Platform Team Initiatives (API Gateway), 10Story: Client Developer has a cookie-free API call - https://phabricator.wikimedia.org/T258748 (10eprodromou) [22:21:55] I guess that's not true, even if we restrict by user agent we'll probably still rate limit *some* actual humans who just happen to be bored at 22:02 UTC [22:22:13] I think already almost no one cares during these spikes -- Tsevener said that the widgets (which cause the synchronized traffic) don't need the random page, they just happen to call code that also fetches it [22:22:26] right yeah exactly [22:22:51] what I mean is we can rate-limit this at varnish without breaking even the feature that relies on the background fetch [22:23:02] yeah, exactly [22:23:08] we wouldn't want to rate limit the "get featured page" but we don't have to, that's cacheable [22:23:10] we don't usually ratelimit not-per-IP but I think it's reasonable [22:24:10] ohh, I see where we got off the same page [22:24:19] by "until after this iOS deploy" I just meant, we won't need the rate limit after that [22:24:28] right, I realized after [22:24:36] temporary fix until the permanent one is out in the field [22:24:37] 👍 [22:32:30] (03CR) 10Jeena Huneidi: "This looks fine to me...but I don't know much about puppet." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633017 (owner: 10Dzahn) [22:35:05] (03PS4) 10Razzi: geoip: archive MaxMind database to hdfs only [puppet] - 10https://gerrit.wikimedia.org/r/631896 (https://phabricator.wikimedia.org/T264152) [22:35:07] (03PS1) 10Razzi: geoip: move archive timer from stat1007 to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/633032 (https://phabricator.wikimedia.org/T264152) [22:35:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:36:20] (03CR) 10jerkins-bot: [V: 04-1] geoip: move archive timer from stat1007 to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/633032 (https://phabricator.wikimedia.org/T264152) (owner: 10Razzi) [22:36:31] (03CR) 10Dzahn: ci: replace hiera with lookup, jenkins, shipyard, pipeline, k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633017 (owner: 10Dzahn) [22:36:50] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:36:59] (03PS1) 10Dzahn: calico: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/633033 [22:37:15] 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10CDanis) When the 22:00 traffic spike happened today, it was a bit more impac... [22:54:56] !log About to start plugin upgrade followed by restarts of `cloudelastic`. Maintenance window set for the next 2 hours on `cloudelastic100[1-6]` [22:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:08] !log `sudo -E cumin -b 6 C:role::elasticsearch::cloudelastic 'DEBIAN_FRONTEND=noninteractive sudo apt-get -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" install wmf-elasticsearch-search-plugins'` [22:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:34] !log `sudo apt policy wmf-elasticsearch-search-plugins` shows correct state: `Installed: 6.5.4-4~stretch` [22:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201008T2300) [23:00:04] tgr: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:42] !log Writes are frozen for `cloudelastic`: `/usr/local/bin/mwscript extensions/CirrusSearch/maintenance/FreezeWritesToCluster.php --wiki=enwiki --cluster=cloudelastic` on `mwmaint2001` => `Applied cluster-wide freeze` [23:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:49] (03CR) 10Gergő Tisza: [C: 03+2] Enable logging of session cookie changes everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632797 (https://phabricator.wikimedia.org/T264793) (owner: 10Gergő Tisza) [23:04:27] (03PS2) 10Gergő Tisza: Enable logging of session cookie changes everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632797 (https://phabricator.wikimedia.org/T264793) [23:04:56] !log Beginning cluster restarts one server at a time. For each server, the process is depool->restart elasticsearch services->wait for services to restart and then pool->wait for cluster to return to green status before starting next server [23:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:34] (03CR) 10Gergő Tisza: [C: 03+2] Enable logging of session cookie changes everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632797 (https://phabricator.wikimedia.org/T264793) (owner: 10Gergő Tisza) [23:09:23] (03Merged) 10jenkins-bot: Enable logging of session cookie changes everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632797 (https://phabricator.wikimedia.org/T264793) (owner: 10Gergő Tisza) [23:16:21] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:632797|Enable logging of session cookie changes everywhere (T264793)]] (duration: 01m 01s) [23:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:27] T264793: Make sure SessionManager emitting Set-Cookie headers gets logged - https://phabricator.wikimedia.org/T264793 [23:16:48] !log `cloudelastic1001` is done restarting and cluster is green again. Proceeding to `cloudelastic1002` [23:16:51] !log Evening deploys done [23:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:56] (03CR) 10Bstorm: [C: 03+2] "I'm feeling feisty and think I'll merge this." [puppet] - 10https://gerrit.wikimedia.org/r/630589 (owner: 10Jbond) [23:18:57] (03CR) 10Bstorm: [C: 03+1] "Oh wait, never mind. This has parent patches. Moving back to +1 until that's sorted 😜" [puppet] - 10https://gerrit.wikimedia.org/r/630589 (owner: 10Jbond) [23:23:52] !log `cloudelastic1002` done [23:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:40] !log `cloudelastic1003` done [23:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:01] !log `cloudelastic1004` done [23:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:03] (03PS1) 10Legoktm: [WIP] Add buildpack base images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/633036 [23:33:53] (03PS2) 10Legoktm: [WIP] Add buildpack base images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/633036 [23:37:24] !log `cloudelastic1005` done [23:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:34] !log `cloudelastic1006` done. Writes thawed, maintenance window lifted; restarts are done for `cloudelastic` [23:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:18] (03CR) 10Bstorm: "I'm getting an error related to drbd resources https://puppet-compiler.wmflabs.org/compiler1001/25800/" [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn) [23:45:29] (03CR) 10Bstorm: labstore: add data types and some other style fixes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn) [23:52:25] (03PS1) 10Dzahn: site: add mediawiki appserver role to a test VM [puppet] - 10https://gerrit.wikimedia.org/r/633038 [23:53:53] (03CR) 10Dzahn: [C: 03+2] site: add mediawiki appserver role to a test VM [puppet] - 10https://gerrit.wikimedia.org/r/633038 (owner: 10Dzahn) [23:57:33] (03PS1) 10Dzahn: Revert "site: add mediawiki appserver role to a test VM" [puppet] - 10https://gerrit.wikimedia.org/r/632997 [23:58:15] (03CR) 10Dzahn: [C: 03+2] Revert "site: add mediawiki appserver role to a test VM" [puppet] - 10https://gerrit.wikimedia.org/r/632997 (owner: 10Dzahn)