[00:00:04] <jouncebot>	 twentyafterfour: Dear deployers, time to do the Phabricator update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201008T0000).
[00:02:44] <tgr_>	 I'm running over the backport window but I imagine that won't interfere with the Phab one
[00:03:07] <logmsgbot>	 !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:632795|Enable logging of session cookie changes in group0 (T264793)]] (duration: 00m 58s)
[00:03:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:05:00] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] Enable logging of session cookie changes in group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632796 (https://phabricator.wikimedia.org/T264793) (owner: 10Gergő Tisza)
[00:05:42] <wikibugs>	 (03Merged) 10jenkins-bot: Enable logging of session cookie changes in group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632796 (https://phabricator.wikimedia.org/T264793) (owner: 10Gergő Tisza)
[00:10:36] <wikibugs>	 (03Merged) 10jenkins-bot: Log when SessionManager is emitting cookies [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/632806 (https://phabricator.wikimedia.org/T264793) (owner: 10Gergő Tisza)
[00:12:13] <wikibugs>	 (03Merged) 10jenkins-bot: Log when SessionManager is emitting cookies [core] (wmf/1.36.0-wmf.12) - 10https://gerrit.wikimedia.org/r/632807 (https://phabricator.wikimedia.org/T264793) (owner: 10Gergő Tisza)
[00:15:08] <logmsgbot>	 !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:632796|Enable logging of session cookie changes in group1 (T264793)]] (duration: 00m 57s)
[00:15:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:15:14] <stashbot>	 T264793: Make sure SessionManager emitting Set-Cookie headers gets logged - https://phabricator.wikimedia.org/T264793
[00:20:51] <logmsgbot>	 !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:632796|Enable logging of session cookie changes in group1 (T264793)]] (again, forgot to rebase the previous time) (duration: 00m 59s)
[00:20:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:20:57] <stashbot>	 T264793: Make sure SessionManager emitting Set-Cookie headers gets logged - https://phabricator.wikimedia.org/T264793
[00:31:48] <tgr_>	 !log evening deploys done
[00:31:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:37:07] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 35065024 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:38:49] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 14 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:43:57] <tgr_>	 feel free to revert the last config patch if it is causing too much log traffic.
[01:03:18] <wikibugs>	 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10Ladsgroup) I don't know if this has been considered or not and I admit I don...
[01:10:23] <wikibugs>	 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10CDanis) >>! In T264881#6527486, @Ladsgroup wrote: > I don't know if this has...
[01:50:11] <wikibugs>	 10Operations, 10Release-Engineering-Team, 10Wikimedia Design Style Guide: Deployment of latest Design Style Guide Gerrit clone doesn't seem to succeed - https://phabricator.wikimedia.org/T264894 (10Dzahn) I checked the git status on both backends, miscweb1002 and miscweb2002 and they are both at commit e3fda...
[01:51:41] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide, 10Wikimedia-GitHub: Deployment of latest Design Style Guide Gerrit clone doesn't seem to succeed - https://phabricator.wikimedia.org/T264894 (10Dzahn)
[02:02:01] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide, 10Wikimedia-GitHub: Deployment of latest Design Style Guide Gerrit clone doesn't seem to succeed - https://phabricator.wikimedia.org/T264894 (10Dzahn) The `deploy-style-guide.sh` script git pulls from https://gerrit.wikim...
[03:19:23] <wikibugs>	 (03CR) 10Hazard-SJ: [C: 03+1] Require autoconfirmed status to edit Wikidata Properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631809 (https://phabricator.wikimedia.org/T254280) (owner: 10Abián)
[03:22:54] <wikibugs>	 (03CR) 10Hazard-SJ: [C: 04-1] "It seems that the desired approach has changed in T258354: instead of creating a new group, the current discussions steers towards removin" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618245 (https://phabricator.wikimedia.org/T258354) (owner: 10Tobias Andersson)
[04:35:11] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 59.73 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[04:36:53] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 72.5 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[05:26:02] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove puppet entries for es2015 [puppet] - 10https://gerrit.wikimedia.org/r/632835 (https://phabricator.wikimedia.org/T264700)
[05:26:54] <wikibugs>	 (03PS1) 10Marostegui: dns: Remove es2015 entries [dns] - 10https://gerrit.wikimedia.org/r/632836 (https://phabricator.wikimedia.org/T264700)
[05:27:25] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission
[05:27:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:33:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Remove puppet entries for es2015 [puppet] - 10https://gerrit.wikimedia.org/r/632835 (https://phabricator.wikimedia.org/T264700) (owner: 10Marostegui)
[05:33:44] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[05:33:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:34:07] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dns: Remove es2015 entries [dns] - 10https://gerrit.wikimedia.org/r/632836 (https://phabricator.wikimedia.org/T264700) (owner: 10Marostegui)
[05:35:17] <wikibugs>	 (03PS1) 10Ladsgroup: mailman: Set default charset in mailman2 configs [puppet] - 10https://gerrit.wikimedia.org/r/632837 (https://phabricator.wikimedia.org/T261031)
[05:35:41] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2015.codfw.wmnet - https://phabricator.wikimedia.org/T264700 (10Marostegui)
[05:37:37] <wikibugs>	 (03CR) 10Ladsgroup: "I have some confidence that this one would work, if it's working, we can remove the apache hack." [puppet] - 10https://gerrit.wikimedia.org/r/632837 (https://phabricator.wikimedia.org/T261031) (owner: 10Ladsgroup)
[05:48:46] <wikibugs>	 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui)
[06:09:51] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: restbase: remove monitoring calls to the http endpoint [puppet] - 10https://gerrit.wikimedia.org/r/632708
[06:15:35] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2020-10-08-053343-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/632838 (https://phabricator.wikimedia.org/T264407)
[06:20:01] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] restbase: remove monitoring calls to the http endpoint [puppet] - 10https://gerrit.wikimedia.org/r/632708 (owner: 10Giuseppe Lavagetto)
[06:40:29] <wikibugs>	 (03PS1) 10Elukey: Move the HDFS balancer to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/632877
[06:45:37] <wikibugs>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/25773/" [puppet] - 10https://gerrit.wikimedia.org/r/632877 (owner: 10Elukey)
[06:45:41] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Move the HDFS balancer to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/632877 (owner: 10Elukey)
[06:46:19] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[06:46:37] <wikibugs>	 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) I've added a dropdown to pick the percentile on https://grafana.wikimedia.org/d/M7xQ_BeWk/response-time-by-host  Here's what it looks...
[06:47:33] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/P
[06:48:13] <dcausse>	 looking ^
[06:49:24] <elukey>	 dcausse: lemme know if you need help
[06:49:35] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:49:59] <_joe_>	 I was about to ask :)
[06:50:32] <_joe_>	 why  wdqs-ssl-codfw has notifications disabled?
[06:50:35] <_joe_>	 ffs.
[06:51:00] <dcausse>	 hmm they recovered themselves... graph shows a spike in load
[06:51:19] <_joe_>	 !log enable notifications for wdqs-ssl-codfw
[06:51:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:51:36] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[06:53:01] <dcausse>	 it's wdqs2002 that disappeared since yester 19h and we did not notice :/
[06:53:49] <_joe_>	 dcausse: disappeared?
[06:54:01] <_joe_>	 ok who stole the server?
[06:54:05] <_joe_>	 :P
[06:54:29] <dcausse>	 :)
[06:55:20] <dcausse>	 blazegraph deadlock I suppose, looking, it's no longer reporting any metrics
[06:57:08] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[06:57:20] <wikibugs>	 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui)
[06:57:45] <dcausse>	 !log restart blazegraph on wdqs2002 (stuck) T242453 
[06:57:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:57:51] <stashbot>	 T242453: Deadlock in blazegraph blocking all queries and updates - https://phabricator.wikimedia.org/T242453
[06:58:16] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs2002 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[06:58:34] <_joe_>	 thanks dcausse
[07:00:22] <dcausse>	 !log depooling wdqs2002 (catching-up lag)
[07:00:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:05:11] <wikibugs>	 (03PS1) 10Elukey: Remove an-worker1043 from the Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/632878 (https://phabricator.wikimedia.org/T260411)
[07:06:49] <wikibugs>	 (03PS2) 10Elukey: Remove analytics1043 from the Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/632878 (https://phabricator.wikimedia.org/T260411)
[07:12:07] <wikibugs>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/25775/" [puppet] - 10https://gerrit.wikimedia.org/r/632878 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[07:12:10] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Remove analytics1043 from the Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/632878 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[07:23:01] <moritzm>	 !log installing pyzmq updates from Buster point release
[07:23:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:17] <jayme>	 !log updated envoyproxy to 1.15.1-2 on all codfw hosts
[07:29:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:09] <gehel>	 !log depooled wdqs2002 to catch up on lag
[07:40:12] <gehel>	 ryankemper: ^
[07:40:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:18] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Reshard commonswiki_file elasticsearch index - https://phabricator.wikimedia.org/T260083 (10Gehel) Increasing the number of shards for commons wiki is starting to be an issue. We need a better strategy.
[07:44:57] <wikibugs>	 (03PS7) 10Elukey: Move oozie server to an-scheduler1001 [puppet] - 10https://gerrit.wikimedia.org/r/618339 (https://phabricator.wikimedia.org/T257412)
[07:45:26] <marostegui>	 !log Stop MySQL on db1077 to build it from s1 snapshot
[07:45:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:45:50] <wikibugs>	 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10akosiaris) Overall, I am willing to test this out, couples of points though:  * Since it's recommended by various standards to do the default DROP thing, w...
[07:47:37] <wikibugs>	 (03PS8) 10Elukey: Move oozie server to an-scheduler1001 [puppet] - 10https://gerrit.wikimedia.org/r/618339 (https://phabricator.wikimedia.org/T257412)
[07:50:53] <wikibugs>	 10Operations, 10serviceops: Ugrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10Joe)
[07:51:03] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikidata, and 2 others: Check for errors on wdqs1009 disks - https://phabricator.wikimedia.org/T263125 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts: ` wdqs1009.eqiad.wmnet ` The log can be found in `/var/log/w...
[07:53:40] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:54:23] <dcausse>	 these servers are overloaded without wdqs2002 I think ^
[07:55:19] <marostegui>	 !log Rebuild db2125 from snapshots - T260670
[07:55:20] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:55:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:25] <stashbot>	 T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670
[08:02:06] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2001.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2001.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:02:57] <gehel>	 !log repooling wdqs2002
[08:03:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:16] <gehel>	 ^better to have slightly stale data than crashing all the servers
[08:03:16] <wikibugs>	 (03PS9) 10Elukey: Move oozie server to an-scheduler1001 [puppet] - 10https://gerrit.wikimedia.org/r/618339 (https://phabricator.wikimedia.org/T257412)
[08:03:46] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:04:56] <logmsgbot>	 !log gehel@cumin1001 START - Cookbook sre.hosts.downtime
[08:05:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:06:51] <logmsgbot>	 !log gehel@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[08:06:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:03] <wikibugs>	 (03PS10) 10Elukey: Move oozie server to an-scheduler1001 [puppet] - 10https://gerrit.wikimedia.org/r/618339 (https://phabricator.wikimedia.org/T257412)
[08:10:22] <icinga-wm>	 RECOVERY - MD RAID on wdqs1009 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[08:12:32] <wikibugs>	 (03PS11) 10Elukey: Move oozie server to an-scheduler1001 [puppet] - 10https://gerrit.wikimedia.org/r/618339 (https://phabricator.wikimedia.org/T257412)
[08:14:47] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikidata, and 2 others: Check for errors on wdqs1009 disks - https://phabricator.wikimedia.org/T263125 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wdqs1009.eqiad.wmnet'] `  and were **ALL** successful.
[08:15:00] <icinga-wm>	 RECOVERY - Check systemd state on mwdebug1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:17:32] <wikibugs>	 10Operations, 10serviceops: Ugrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991 (10MoritzMuehlenhoff) I've created a standalone backport of icu63 in the component/icu63. Rebuilding PHP 7.2 with it is a little tricky, since PHP build-depends on libxml2 (for php7.2...
[08:19:04] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[08:19:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:11] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[08:19:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:47] <kormat>	 !log running schema change against s8 in eqiad T259831
[08:19:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:53] <stashbot>	 T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831
[08:23:13] <wikibugs>	 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#6524473, @Gilles wrote: > I really don't understand what I did wrong here  It was never my intention to offend you, and g...
[08:28:50] <wikibugs>	 (03PS3) 10Urbanecm: [labs] Remove wmgMonologChannels override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631224
[08:33:04] <wikibugs>	 10Operations, 10observability, 10User-fgiunchedi: rsyslog occasional segfault on centrallog hosts - https://phabricator.wikimedia.org/T259780 (10fgiunchedi)
[08:36:43] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-2] "Per decision made at T258354#6509213" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618245 (https://phabricator.wikimedia.org/T258354) (owner: 10Tobias Andersson)
[08:38:33] <godog>	 !log roll-restart swift-object-replicator on ms-be2* - T261633
[08:38:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:39] <stashbot>	 T261633: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633
[08:40:31] <wikibugs>	 (03PS1) 10Filippo Giunchedi: swift: bump rsync_timeout [puppet] - 10https://gerrit.wikimedia.org/r/632891 (https://phabricator.wikimedia.org/T261633)
[08:40:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] swift: bump rsync_timeout [puppet] - 10https://gerrit.wikimedia.org/r/632891 (https://phabricator.wikimedia.org/T261633) (owner: 10Filippo Giunchedi)
[08:41:49] <wikibugs>	 (03PS2) 10Filippo Giunchedi: swift: bump rsync_timeout [puppet] - 10https://gerrit.wikimedia.org/r/632891 (https://phabricator.wikimedia.org/T261633)
[08:53:14] <wikibugs>	 (03PS2) 10Kormat: admin: Replace leila with leizi [puppet] - 10https://gerrit.wikimedia.org/r/632726 (https://phabricator.wikimedia.org/T264472)
[08:55:04] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10Kormat)
[09:02:37] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] admin: Replace leila with leizi [puppet] - 10https://gerrit.wikimedia.org/r/632726 (https://phabricator.wikimedia.org/T264472) (owner: 10Kormat)
[09:03:49] <wikibugs>	 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) I don't think that this discussion is appropriate in a public forum. An email thread seems like an ok starting point, and/or a meetin...
[09:08:38] <wikibugs>	 (03Abandoned) 10Elukey: Move oozie server to an-scheduler1001 [puppet] - 10https://gerrit.wikimedia.org/r/618339 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey)
[09:09:00] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.downtime
[09:09:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:58] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:11:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:24] <wikibugs>	 (03PS2) 10Elukey: Add analytics data purge for webrequest sequence stats [puppet] - 10https://gerrit.wikimedia.org/r/632773 (https://phabricator.wikimedia.org/T262826) (owner: 10Joal)
[09:16:30] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add analytics data purge for webrequest sequence stats [puppet] - 10https://gerrit.wikimedia.org/r/632773 (https://phabricator.wikimedia.org/T262826) (owner: 10Joal)
[09:27:32] <wikibugs>	 (03PS1) 10Elukey: role::druid::analytics::worker: enable TLS for conns to mysql [puppet] - 10https://gerrit.wikimedia.org/r/632896 (https://phabricator.wikimedia.org/T257412)
[09:27:41] <wikibugs>	 (03PS4) 10Jbond: firewall: change to default reject instead of drop [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888)
[09:27:43] <wikibugs>	 (03PS1) 10Jbond: sretest1001: Enable default reject rule for sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/632897 (https://phabricator.wikimedia.org/T264888)
[09:29:06] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::druid::analytics::worker: enable TLS for conns to mysql [puppet] - 10https://gerrit.wikimedia.org/r/632896 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey)
[09:29:28] <wikibugs>	 (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond)
[09:34:41] <wikibugs>	 10Operations, 10DBA, 10Data-Persistence, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10LSobanski)
[09:37:26] <wikibugs>	 (03PS1) 10Elukey: Revert "role::druid::analytics::worker: enable TLS for conns to mysql" [puppet] - 10https://gerrit.wikimedia.org/r/632818
[09:40:27] <logmsgbot>	 !log gehel@cumin1001 START - Cookbook sre.wdqs.data-reload
[09:40:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:38] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Revert "role::druid::analytics::worker: enable TLS for conns to mysql" [puppet] - 10https://gerrit.wikimedia.org/r/632818 (owner: 10Elukey)
[09:45:52] <wikibugs>	 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10jbond) >>! In T264888#6528076, @akosiaris wrote: > Overall, I am willing to test this out, couples of points though: >  > * Since it's recommended by vario...
[09:45:56] <wikibugs>	 (03CR) 10Hashar: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/632687 (owner: 10Jbond)
[09:48:17] <wikibugs>	 10Operations, 10SRE-swift-storage, 10User-fgiunchedi: Some object-replicator log lines not making it to centrallog - https://phabricator.wikimedia.org/T264998 (10fgiunchedi)
[10:00:04] <jouncebot>	 mvolz: (Dis)respected human, time to deploy Services – Citoid /  Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201008T1000). Please do the needful.
[10:00:35] <wikibugs>	 (03PS3) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/631783 (owner: 10PipelineBot)
[10:02:32] <wikibugs>	 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10jbond) >  This would also mean that a malicious actor could use us to reflect RST packets however the 40b rst packet comes at a cost of a 60b syn This is n...
[10:03:33] <wikibugs>	 (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/631783 (owner: 10PipelineBot)
[10:05:42] <wikibugs>	 (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/631783 (owner: 10PipelineBot)
[10:09:23] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10Kormat) @leila: Your access should now be active. Please let me know if you run into any issues.  I've opened a couple of subtasks to cover cleanup...
[10:14:52] <logmsgbot>	 !log mvolz@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' .
[10:14:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:08] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Pmacct add standard BGP community to flows [puppet] - 10https://gerrit.wikimedia.org/r/632603 (https://phabricator.wikimedia.org/T254332) (owner: 10Ayounsi)
[10:22:25] <logmsgbot>	 !log mvolz@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' .
[10:22:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:24:00] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] conftool-data: add new restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/632497 (https://phabricator.wikimedia.org/T261512) (owner: 10Hnowlan)
[10:26:13] <hnowlan>	 !log pooling restbase1028,restbase1029,restbase1030 
[10:26:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:21] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=restbase,service=restbase,name=restbase1028.eqiad.wmnet
[10:26:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:30] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=restbase,service=restbase-ssl,name=restbase1028.eqiad.wmnet
[10:26:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:38] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=restbase,service=restbase-backend,name=restbase1028.eqiad.wmnet
[10:26:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:36] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=restbase,service=restbase,name=restbase1028.eqiad.wmnet
[10:27:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:56] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=restbase,service=restbase-ssl,name=restbase1028.eqiad.wmnet
[10:28:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:07] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=restbase,service=restbase-backend,name=restbase1028.eqiad.wmnet
[10:28:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:53] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10ayounsi) Done! And confirmed with kafkacat, eg: `"comms": "2914:420_2914:1008_2914:2000_2914:3000_14907:4"` As well as no dr...
[10:29:57] <logmsgbot>	 !log mvolz@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' .
[10:30:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:02] <icinga-wm>	 PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:32:37] <moritzm>	 !log installing Postgres security updates on netboxdb2001
[10:32:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:11] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=restbase,service=restbase,name=restbase1029.eqiad.wmnet
[10:34:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:16] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T264630 (10Kormat) 05Open→03Resolved a:03Kormat @CGlenn: your access is now in place.
[10:34:19] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=restbase,service=restbase-ssl,name=restbase1029.eqiad.wmnet
[10:34:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:25] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=restbase,service=restbase-backend,name=restbase1029.eqiad.wmnet
[10:34:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:27] <moritzm>	 !log installing Postgres security updates on netboxdb1001
[10:37:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:20] <icinga-wm>	 PROBLEM - Check systemd state on netflow3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:41:24] <icinga-wm>	 PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:44:02] <moritzm>	 XioNoX: on netflow codfw/esams nfacctd is failing with plugin_buffer_size is too short
[10:44:19] <XioNoX>	 uh
[10:45:07] <XioNoX>	 er, actually everywhere
[10:45:14] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=restbase,service=restbase,name=restbase1030.eqiad.wmnet
[10:45:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:20] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=restbase,service=restbase-ssl,name=restbase1030.eqiad.wmnet
[10:45:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:26] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=eqiad,cluster=restbase,service=restbase-backend,name=restbase1030.eqiad.wmnet
[10:45:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:42] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:51:02] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:52:17] <logmsgbot>	 !log aborrero@cumin2001 START - Cookbook sre.hosts.downtime
[10:52:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:52] <icinga-wm>	 RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:54:14] <logmsgbot>	 !log aborrero@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[10:54:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:43] <wikibugs>	 (03CR) 10Volans: [C: 04-2] "After some more digging on the generated files this consolidation is a bit too much and we have both prefixes managed/unmanaged via Netbox" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632574 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans)
[10:57:39] <wikibugs>	 (03PS1) 10Ayounsi: nfacctd: set plugin_buffer_size [puppet] - 10https://gerrit.wikimedia.org/r/632902
[10:58:45] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] nfacctd: set plugin_buffer_size [puppet] - 10https://gerrit.wikimedia.org/r/632902 (owner: 10Ayounsi)
[11:00:05] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201008T1100).
[11:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[11:04:12] <icinga-wm>	 RECOVERY - Check systemd state on netflow3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:19:08] <icinga-wm>	 RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:50:45] <wikibugs>	 (03PS3) 10Hnowlan: api-gateway: use TLS for restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/630567
[11:51:15] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] nfacctd: set plugin_buffer_size [puppet] - 10https://gerrit.wikimedia.org/r/632902 (owner: 10Ayounsi)
[11:52:42] <wikibugs>	 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10mark) Hi all,  I recommend we limit the conversations on this task to the technical aspects of this particular regression and its investigati...
[11:56:33] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: use TLS for restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/630567 (owner: 10Hnowlan)
[11:57:55] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:58:43] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: use TLS for restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/630567 (owner: 10Hnowlan)
[11:59:37] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201008T1200)
[12:05:51] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
[12:05:51] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
[12:05:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:05:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:40] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: eventgate, eventstreams: Log with namedlevels [deployment-charts] - 10https://gerrit.wikimedia.org/r/594492 (https://phabricator.wikimedia.org/T239459)
[12:07:48] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
[12:07:48] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
[12:07:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:59] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[12:08:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:05] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[12:08:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:09] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
[12:10:09] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
[12:10:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:13:04] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) Awesome! The size of the events has increased in about 25-30%, which is considerable, but I believe sustainable for now. When we sanitize...
[12:14:18] <kart_>	 If no objection, would like to deploy cxserver now.
[12:14:30] <kart_>	 akosiaris: ^ Is it OK?
[12:14:51] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Not tested but PCC and code LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond)
[12:15:16] <akosiaris>	 kart_: sure, go ahead
[12:16:53] <kart_>	 akosiaris: thanks.
[12:17:09] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10ayounsi) Wow, that 's more then expected indeed! If it's an issue down the road we could think of filtering out some communities (for example only...
[12:17:25] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2020-10-08-053343-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/632838 (https://phabricator.wikimedia.org/T264407) (owner: 10KartikMistry)
[12:17:34] <wikibugs>	 10Operations, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: decom helium and heze - https://phabricator.wikimedia.org/T260717 (10LSobanski)
[12:19:03] <icinga-wm>	 PROBLEM - Disk space on an-launcher1002 is CRITICAL: DISK CRITICAL - free space: / 2668 MB (3% inode=88%): /tmp 2668 MB (3% inode=88%): /var/tmp 2668 MB (3% inode=88%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops
[12:19:57] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2020-10-08-053343-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/632838 (https://phabricator.wikimedia.org/T264407) (owner: 10KartikMistry)
[12:21:18] <logmsgbot>	 !log kartik@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
[12:21:18] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: hieradata: labtestvirt2003: refresh network data for cloudgw PoC with latest allocations [puppet] - 10https://gerrit.wikimedia.org/r/632904 (https://phabricator.wikimedia.org/T263622)
[12:21:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] hieradata: labtestvirt2003: refresh network data for cloudgw PoC with latest allocations [puppet] - 10https://gerrit.wikimedia.org/r/632904 (https://phabricator.wikimedia.org/T263622) (owner: 10Arturo Borrero Gonzalez)
[12:22:27] <wikibugs>	 10Operations, 10CX-cxserver, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, and 3 others: service-runner apps running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10Mvolz)
[12:24:37] <logmsgbot>	 !log kartik@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' .
[12:24:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:51] <icinga-wm>	 PROBLEM - Disk space on sretest1001 is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/12f2ce7ad6ba57a14a22800561b7118b99bf03272bc56ecd4d8d88fadc4d8410/mounts/shm is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=sretest1001&var-datasource=eqiad+prometheus/ops
[12:26:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good and PCC seems sane." [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond)
[12:26:28] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] sretest1001: Enable default reject rule for sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/632897 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond)
[12:26:50] <logmsgbot>	 !log kartik@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' .
[12:26:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:27:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, alternatively we could enable this for role::sretest, then we can test with Stretch (1002) and Buster (1001)." [puppet] - 10https://gerrit.wikimedia.org/r/632897 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond)
[12:29:30] <kart_>	 !log Updated cxserver to 2020-10-08-053343-production (T264407, T264859)
[12:29:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:37] <stashbot>	 T264859: Create Inari Sámi Wikipedia - https://phabricator.wikimedia.org/T264859
[12:29:37] <stashbot>	 T264407: Check Apertium configuration for Serbo-croatian - https://phabricator.wikimedia.org/T264407
[12:35:13] <wikibugs>	 (03PS2) 10Klausman: modules: Add functionality to allow use of 3.8 rocm packages [puppet] - 10https://gerrit.wikimedia.org/r/632248 (https://phabricator.wikimedia.org/T264408)
[12:37:50] <wikibugs>	 (03PS1) 10Tchanders: Enable Special:Investigate by default on production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632908 (https://phabricator.wikimedia.org/T264357)
[12:38:37] <icinga-wm>	 PROBLEM - Druid overlord on druid1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[12:38:49] <icinga-wm>	 PROBLEM - Check systemd state on druid1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:41:10] <wikibugs>	 (03PS3) 10Klausman: modules: Add functionality to allow use of 3.8 rocm packages [puppet] - 10https://gerrit.wikimedia.org/r/632248 (https://phabricator.wikimedia.org/T264408)
[12:41:19] <icinga-wm>	 PROBLEM - Druid coordinator on druid1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[12:41:47] <wikibugs>	 (03CR) 10Klausman: modules: Add functionality to allow use of 3.8 rocm packages (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/632248 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman)
[12:43:01] <icinga-wm>	 RECOVERY - Druid coordinator on druid1003 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[12:43:10] <elukey>	 druid was me sorry, I was testing a setting
[12:43:39] <icinga-wm>	 RECOVERY - Druid overlord on druid1003 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[12:43:51] <icinga-wm>	 RECOVERY - Check systemd state on druid1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:43:54] <kormat>	 elukey: was the setting "alertNow=true"?
[12:44:59] <elukey>	 kormat: nono it was 'lucaYouNeedToSpecifyTheTLSVersionOtherwiseIcannotMakeIt=crazy'
[12:45:05] <elukey>	 difficult one to find
[12:45:29] <kormat>	 hehe
[12:45:29] <icinga-wm>	 RECOVERY - Disk space on sretest1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=sretest1001&var-datasource=eqiad+prometheus/ops
[12:47:50] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Access to the Logstash for John Bolorinos - https://phabricator.wikimedia.org/T264918 (10Kormat)
[12:48:27] <wikibugs>	 10Operations, 10CX-cxserver, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, and 3 others: service-runner apps running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10akosiaris) >>! In T239459#6504349, @Mvolz wrote: > The hold-up seems to be eventstreams; it act...
[12:49:08] <wikibugs>	 (03PS1) 10Elukey: Enable TLS between Druid clusters and Mariadb on an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/632909 (https://phabricator.wikimedia.org/T257412)
[12:49:53] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Enable TLS between Druid clusters and Mariadb on an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/632909 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey)
[13:00:04] <jouncebot>	 hashar and marxarelli: My dear minions, it's time we take the moon! Just kidding. Time for Mediawiki train - European+American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201008T1300).
[13:00:21] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] swift: bump rsync_timeout [puppet] - 10https://gerrit.wikimedia.org/r/632891 (https://phabricator.wikimedia.org/T261633) (owner: 10Filippo Giunchedi)
[13:06:44] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] firewall: change to default reject instead of drop [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond)
[13:06:54] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] sretest1001: Enable default reject rule for sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/632897 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond)
[13:13:48] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudgw: refresh network config for the PoC [puppet] - 10https://gerrit.wikimedia.org/r/632904 (https://phabricator.wikimedia.org/T263622)
[13:15:09] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: refresh network config for the PoC [puppet] - 10https://gerrit.wikimedia.org/r/632904 (https://phabricator.wikimedia.org/T263622) (owner: 10Arturo Borrero Gonzalez)
[13:17:31] <icinga-wm>	 PROBLEM - Disk space on sretest1001 is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/7b8d97ddd0c9717a9d5e21ad5e0b4e6cf55d8cb1a9260de0263e77249fa060c1/mounts/shm is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=sretest1001&var-datasource=eqiad+prometheus/ops
[13:20:58] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] modules: Add functionality to allow use of 3.8 rocm packages [puppet] - 10https://gerrit.wikimedia.org/r/632248 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman)
[13:22:48] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/25782/an-coord1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi)
[13:23:45] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] modules: Add functionality to allow use of 3.8 rocm packages [puppet] - 10https://gerrit.wikimedia.org/r/632248 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman)
[13:23:47] <wikibugs>	 (03PS4) 10Klausman: modules: Add functionality to allow use of 3.8 rocm packages [puppet] - 10https://gerrit.wikimedia.org/r/632248 (https://phabricator.wikimedia.org/T264408)
[13:25:32] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] modules: Add functionality to allow use of 3.8 rocm packages (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/632248 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman)
[13:28:02] <wikibugs>	 (03PS1) 10Klausman: aptrepo: Include mivisionx package fro rocm again [puppet] - 10https://gerrit.wikimedia.org/r/632915
[13:29:03] <wikibugs>	 (03PS2) 10Klausman: aptrepo: Include mivisionx package from rocm again [puppet] - 10https://gerrit.wikimedia.org/r/632915
[13:37:41] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
[13:37:41] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
[13:37:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:03] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) After discussing with the team, we think it's fine for now. If we want to add more fields or increase the sampling ratio, then we should i...
[13:38:29] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Access to the Logstash for John Bolorinos - https://phabricator.wikimedia.org/T264918 (10Kormat) Hi @jbolorinos-ctr,  What is your user name on wikitech? See https://phabricator.wikimedia.org/tag/ldap-access-requests/  Also, we need a WMF staff member as a contact person (...
[13:41:29] <wikibugs>	 (03CR) 10BBlack: firewall: change to default reject instead of drop (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond)
[13:41:38] <wikibugs>	 10Operations, 10Mail, 10Security: Don't get a mail to confirm my email address - https://phabricator.wikimedia.org/T264504 (10Aklapper) ping - can #Operations please take a look? Thanks.
[13:44:32] <wikibugs>	 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10BBlack) FWIW, I am in general a fan of `REJECT` over `DROP`, especially when there's not even a great obscurity argument, as is the case here.  It will be...
[13:48:44] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: write the stack name once to the filesystem [puppet] - 10https://gerrit.wikimedia.org/r/632918
[13:48:46] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: read stack from stack.file [puppet] - 10https://gerrit.wikimedia.org/r/632919
[13:48:48] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: configure hiera based on the stack found on the filesystem [puppet] - 10https://gerrit.wikimedia.org/r/632920
[13:48:50] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: use hiera.output [puppet] - 10https://gerrit.wikimedia.org/r/632921
[13:49:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] swift: bump rsync_timeout [puppet] - 10https://gerrit.wikimedia.org/r/632891 (https://phabricator.wikimedia.org/T261633) (owner: 10Filippo Giunchedi)
[13:53:31] <icinga-wm>	 PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:53:37] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] aptrepo: Include mivisionx package from rocm again [puppet] - 10https://gerrit.wikimedia.org/r/632915 (owner: 10Klausman)
[13:53:37] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T264630 (10CGlenn) Thank you @Kormat !! :)
[14:04:21] <wikibugs>	 (03PS3) 10Klausman: aptrepo: Include mivisionx package from rocm again [puppet] - 10https://gerrit.wikimedia.org/r/632915
[14:04:58] <wikibugs>	 (03PS4) 10Klausman: aptrepo: Include mivisionx package from rocm again [puppet] - 10https://gerrit.wikimedia.org/r/632915
[14:05:31] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] aptrepo: Include mivisionx package from rocm again [puppet] - 10https://gerrit.wikimedia.org/r/632915 (owner: 10Klausman)
[14:07:03] <icinga-wm>	 RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:14:20] <wikibugs>	 (03PS1) 10Jbond: diffscan: add defeat-rst-ratelimit [puppet] - 10https://gerrit.wikimedia.org/r/632933 (https://phabricator.wikimedia.org/T264888)
[14:17:46] <moritzm>	 !log importing icu 63.1-6+deb10u1~wmf5 to component/icu63 T264991
[14:17:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:52] <stashbot>	 T264991: Ugrade the MediaWiki appservers to debian buster, icu63 - https://phabricator.wikimedia.org/T264991
[14:18:22] <wikibugs>	 10Operations, 10Mail, 10Security: Don't get a mail to confirm my email address - https://phabricator.wikimedia.org/T264504 (10Kormat) Hi @Xqt, it looks like the provider for the email address you're currently using for gerrit/phabricator had some issues. There's a bunch of errors in the mail log from 2020-10...
[14:18:56] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Indeed, good idea!" [puppet] - 10https://gerrit.wikimedia.org/r/632933 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond)
[14:19:27] <wikibugs>	 (03PS1) 10Hnowlan: map::postgresql_common: make maps-admin chgrp toggle [puppet] - 10https://gerrit.wikimedia.org/r/632935 (https://phabricator.wikimedia.org/T263726)
[14:21:41] <marostegui>	 !log Set  global innodb_change_buffering = all; on pc2009 T263443
[14:21:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:47] <stashbot>	 T263443: Evaluate the impact of changing innodb_change_buffering to inserts  - https://phabricator.wikimedia.org/T263443
[14:25:05] <wikibugs>	 10Operations, 10Mail, 10Security: Don't get a mail to confirm my email address - https://phabricator.wikimedia.org/T264504 (10Kormat) Ok, this is weird. During the same period that mx2001 was unable to deliver mail to you, mx1001 was able to deliver mail just fine. To the same email address.
[14:28:31] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:29:19] <icinga-wm>	 PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:29:33] <wikibugs>	 (03PS5) 10Jbond: firewall: change to default reject instead of drop [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888)
[14:30:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] firewall: change to default reject instead of drop [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond)
[14:31:25] <wikibugs>	 (03PS1) 10Jbond: idp-test1001: Enable default reject rule [puppet] - 10https://gerrit.wikimedia.org/r/632937
[14:31:58] <wikibugs>	 (03PS1) 10Elukey: amd_rocm: replace << operator with list concatenation [puppet] - 10https://gerrit.wikimedia.org/r/632938
[14:32:02] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/632897 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond)
[14:32:31] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] amd_rocm: replace << operator with list concatenation [puppet] - 10https://gerrit.wikimedia.org/r/632938 (owner: 10Elukey)
[14:32:44] <wikibugs>	 (03Abandoned) 10Jbond: sretest1001: Enable default reject rule for sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/632897 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond)
[14:32:57] <wikibugs>	 (03Abandoned) 10Elukey: amd_rocm: replace << operator with list concatenation [puppet] - 10https://gerrit.wikimedia.org/r/632938 (owner: 10Elukey)
[14:35:00] <wikibugs>	 (03PS6) 10Jbond: firewall: change to default reject instead of drop [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888)
[14:35:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] diffscan: add defeat-rst-ratelimit [puppet] - 10https://gerrit.wikimedia.org/r/632933 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond)
[14:36:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/632937 (owner: 10Jbond)
[14:46:30] <wikibugs>	 (03PS1) 10Klausman: amd_rocm: Fix package list [puppet] - 10https://gerrit.wikimedia.org/r/632943
[14:46:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] amd_rocm: Fix package list [puppet] - 10https://gerrit.wikimedia.org/r/632943 (owner: 10Klausman)
[14:49:00] <wikibugs>	 (03PS2) 10Klausman: amd_rocm: Fix package list [puppet] - 10https://gerrit.wikimedia.org/r/632943
[14:50:18] <wikibugs>	 (03PS3) 10Klausman: amd_rocm: Fix package list [puppet] - 10https://gerrit.wikimedia.org/r/632943
[14:51:57] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/25784/stat1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/632943 (owner: 10Klausman)
[14:52:12] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] amd_rocm: Fix package list [puppet] - 10https://gerrit.wikimedia.org/r/632943 (owner: 10Klausman)
[14:54:21] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: interfaces: don't bring up interfaces recursively [puppet] - 10https://gerrit.wikimedia.org/r/632945 (https://phabricator.wikimedia.org/T261724)
[14:55:25] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10bd808)
[14:56:09] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:57:04] <wikibugs>	 (03CR) 10MSantos: [C: 03+1] map::postgresql_common: make maps-admin chgrp toggle [puppet] - 10https://gerrit.wikimedia.org/r/632935 (https://phabricator.wikimedia.org/T263726) (owner: 10Hnowlan)
[14:57:17] <icinga-wm>	 RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:57:17] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: interfaces: don't bring up interfaces recursively [puppet] - 10https://gerrit.wikimedia.org/r/632945 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez)
[15:02:40] <wikibugs>	 (03CR) 10Jbond: firewall: change to default reject instead of drop (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond)
[15:02:52] <wikibugs>	 (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond)
[15:03:15] <wikibugs>	 (03PS2) 10Jbond: idp-test1001: Enable default reject rule [puppet] - 10https://gerrit.wikimedia.org/r/632937
[15:06:23] <wikibugs>	 10Operations, 10Technical-blog-posts, 10Traffic: Blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T264729 (10ema) >>! In T264729#6526259, @srodlund wrote: > I made some minor grammar suggestions. Can you accept / reject them  Done, thank you! I chang...
[15:09:20] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] firewall: change to default reject instead of drop [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond)
[15:11:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] firewall: change to default reject instead of drop [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond)
[15:11:20] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "nothing to lose" [puppet] - 10https://gerrit.wikimedia.org/r/632837 (https://phabricator.wikimedia.org/T261031) (owner: 10Ladsgroup)
[15:15:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] firewall: change to default reject instead of drop [puppet] - 10https://gerrit.wikimedia.org/r/632543 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond)
[15:15:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] idp-test1001: Enable default reject rule [puppet] - 10https://gerrit.wikimedia.org/r/632937 (owner: 10Jbond)
[15:25:30] <wikibugs>	 (03CR) 10MarcoAurelio: [C: 03+1] "LGTM; but not yet on https://wikitech.wikimedia.org/wiki/Deployments. Is this happening on Oct. 8?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632908 (https://phabricator.wikimedia.org/T264357) (owner: 10Tchanders)
[15:32:00] <wikibugs>	 (03CR) 10Tchanders: "> Patch Set 1: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632908 (https://phabricator.wikimedia.org/T264357) (owner: 10Tchanders)
[15:36:29] <icinga-wm>	 PROBLEM - Check systemd state on idp-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:39:33] <wikibugs>	 (03PS1) 10Jbond: base::firewall: use ferm::rule instead of ferm::conf [puppet] - 10https://gerrit.wikimedia.org/r/632948 (https://phabricator.wikimedia.org/T264888)
[15:42:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] base::firewall: use ferm::rule instead of ferm::conf [puppet] - 10https://gerrit.wikimedia.org/r/632948 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond)
[15:42:32] <jbond42>	 ^^ the idp-test issue is me
[15:44:45] <icinga-wm>	 RECOVERY - Check systemd state on idp-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:00:04] <jouncebot>	 jbond42 and cdanis: That opportune time is upon us again. Time for a Puppet request window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201008T1600).
[16:01:59] <wikibugs>	 (03CR) 10Volans: [C: 03+2] dns: add --keep-files option [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632745 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans)
[16:02:31] <wikibugs>	 (03PS3) 10Volans: sre.dns.netbox: add --skip-authdns-update option [cookbooks] - 10https://gerrit.wikimedia.org/r/632697 (https://phabricator.wikimedia.org/T264846)
[16:02:36] <wikibugs>	 10Operations, 10Mail, 10Security: Don't get a mail to confirm my email address - https://phabricator.wikimedia.org/T264504 (10herron) With regard to why mail from eqiad seemed to be working while codfw was not -- part of this is because the working email examples are gerrit mails, which in addition to having...
[16:03:55] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.dns.netbox: add --skip-authdns-update option [cookbooks] - 10https://gerrit.wikimedia.org/r/632697 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans)
[16:04:04] <wikibugs>	 (03PS2) 10Volans: sre.dns.netbox: add --emergency-manual-edit option [cookbooks] - 10https://gerrit.wikimedia.org/r/632746 (https://phabricator.wikimedia.org/T264846)
[16:04:14] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.dns.netbox: add --emergency-manual-edit option [cookbooks] - 10https://gerrit.wikimedia.org/r/632746 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans)
[16:05:05] <wikibugs>	 (03CR) 10Dzahn: "the puppet part looks good to me" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/632738 (https://phabricator.wikimedia.org/T148976) (owner: 10Cwhite)
[16:05:15] <wikibugs>	 (03Merged) 10jenkins-bot: sre.dns.netbox: add --skip-authdns-update option [cookbooks] - 10https://gerrit.wikimedia.org/r/632697 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans)
[16:05:30] <wikibugs>	 (03Merged) 10jenkins-bot: sre.dns.netbox: add --emergency-manual-edit option [cookbooks] - 10https://gerrit.wikimedia.org/r/632746 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans)
[16:05:39] <wikibugs>	 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10jbond) Thanks all quick update.  I have deployed the firewall change to idp-test1001 and the scan time about 3x faster with the new rule (see below).  howe...
[16:08:11] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[16:08:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:20] <logmsgbot>	 !log volans@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
[16:08:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:03] <wikibugs>	 (03CR) 10Dzahn: profile: apply ipsec monitoring where enabled with ipsec_exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/632738 (https://phabricator.wikimedia.org/T148976) (owner: 10Cwhite)
[16:09:35] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[16:09:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:34] <wikibugs>	 (03PS1) 10Volans: added prefix 91.198.174.224/27, adapt INCLUDE [dns] - 10https://gerrit.wikimedia.org/r/632950 (https://phabricator.wikimedia.org/T264846)
[16:13:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] added prefix 91.198.174.224/27, adapt INCLUDE [dns] - 10https://gerrit.wikimedia.org/r/632950 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans)
[16:15:04] <wikibugs>	 (03CR) 10Volans: "Expected CI failure as the file doesn't exist yet, I'm deploying it with the cookbook with --skip-authdns-update and then recheck this one" [dns] - 10https://gerrit.wikimedia.org/r/632950 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans)
[16:15:17] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:15:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:28] <wikibugs>	 (03CR) 10Volans: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/632950 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans)
[16:16:15] <wikibugs>	 (03CR) 10Volans: [C: 03+2] added prefix 91.198.174.224/27, adapt INCLUDE [dns] - 10https://gerrit.wikimedia.org/r/632950 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans)
[16:16:16] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide, 10Wikimedia-GitHub: Deployment of latest Design Style Guide Gerrit clone doesn't seem to succeed - https://phabricator.wikimedia.org/T264894 (10hashar) The repository on deploy1001 points at master and does fetch from Ger...
[16:16:34] <wikibugs>	 10Operations, 10Gerrit, 10Wikimedia Design Style Guide, 10Wikimedia-GitHub, and 2 others: Deployment of latest Design Style Guide Gerrit clone doesn't seem to succeed - https://phabricator.wikimedia.org/T264894 (10hashar)
[16:18:04] <wikibugs>	 10Operations, 10Mail, 10Security: Don't get a mail to confirm my email address (remote system drops connection without providing a reason) - https://phabricator.wikimedia.org/T264504 (10Aklapper)
[16:18:35] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Add lilients_WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T264590 (10KFrancis) 05Open→03Resolved a:05KFrancis→03lilients_WMDE Hi All, the NDA is complete.  Thanks!
[16:19:09] <hashar>	 !log Restarting CI Jenkins
[16:19:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:01] <wikibugs>	 (03PS3) 10Volans: dns: consolidate reverse zone files (part 1) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632574 (https://phabricator.wikimedia.org/T264273)
[16:22:03] <wikibugs>	 (03PS1) 10Volans: dns: consolidate reverse zone files (part 2) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632952 (https://phabricator.wikimedia.org/T264273)
[16:22:21] <wikibugs>	 (03PS1) 10Volans: netbox: move $INCLUDEs to the consolidated files [dns] - 10https://gerrit.wikimedia.org/r/632953 (https://phabricator.wikimedia.org/T264273)
[16:23:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] netbox: move $INCLUDEs to the consolidated files [dns] - 10https://gerrit.wikimedia.org/r/632953 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans)
[16:23:50] <wikibugs>	 (03CR) 10Volans: "Expected CI failure because it depends on the merge and deploy of the depends-on change." [dns] - 10https://gerrit.wikimedia.org/r/632953 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans)
[16:25:19] <mutante>	 !log rebooting cloudvirt1023 - trying PXE boot
[16:25:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:42] <wikibugs>	 10Operations, 10Mail, 10Security: Don't get a mail to confirm my email address (remote system drops connection without providing a reason) - https://phabricator.wikimedia.org/T264504 (10herron) Actually, after some further manual testing I think we have a reason:  521 5.7.1 Service unavailable; client [208.8...
[16:32:14] <wikibugs>	 (03CR) 10Volans: "This is the new diff https://phabricator.wikimedia.org/P12939" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632574 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans)
[16:33:35] <wikibugs>	 (03CR) 10Volans: "Diff is https://phabricator.wikimedia.org/P12954" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632952 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans)
[16:42:08] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Access to the Logstash for John Bolorinos - https://phabricator.wikimedia.org/T264918 (10jbolorinos-ctr) I think my wikitech username is jbol (is this the login for gerrit?)
[16:42:17] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs server backups: Add a way to assign projects to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/632960 (https://phabricator.wikimedia.org/T260692)
[16:44:35] <wikibugs>	 (03PS2) 10Andrew Bogott: wmcs server backups: Add a way to assign projects to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/632960 (https://phabricator.wikimedia.org/T260692)
[16:48:48] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs backups: remove the 'special_projects' logic [puppet] - 10https://gerrit.wikimedia.org/r/632961 (https://phabricator.wikimedia.org/T260692)
[16:50:27] <wikibugs>	 (03PS3) 10Andrew Bogott: wmcs server backups: Add a way to assign projects to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/632960 (https://phabricator.wikimedia.org/T260692)
[16:50:29] <wikibugs>	 (03PS2) 10Andrew Bogott: wmcs backups: remove the 'special_projects' logic [puppet] - 10https://gerrit.wikimedia.org/r/632961 (https://phabricator.wikimedia.org/T260692)
[16:55:13] <wikibugs>	 (03CR) 10CRusnov: [C: 03+1] "I think that this looks good to me. It should be a harmless minimal change as we discussed and the code looks fine." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632574 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans)
[16:58:59] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] "One way to find out!" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/632582 (https://phabricator.wikimedia.org/T263339) (owner: 10Bstorm)
[16:59:58] <wikibugs>	 (03Merged) 10jenkins-bot: locales: switch to using locales-all package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/632582 (https://phabricator.wikimedia.org/T263339) (owner: 10Bstorm)
[17:00:05] <jouncebot>	 chrisalbon and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201008T1700).
[17:12:35] <wikibugs>	 (03CR) 10CRusnov: [C: 03+1] "LGTMa" [dns] - 10https://gerrit.wikimedia.org/r/632953 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans)
[17:14:01] <wikibugs>	 (03CR) 10Dzahn: "thanks for merging :))" [puppet] - 10https://gerrit.wikimedia.org/r/631900 (owner: 10Dzahn)
[17:14:45] <wikibugs>	 (03CR) 10Dbarratt: [C: 03+1] Enable Special:Investigate by default on production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632908 (https://phabricator.wikimedia.org/T264357) (owner: 10Tchanders)
[17:16:27] <wikibugs>	 (03CR) 10CRusnov: [C: 03+1] "Looks good" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632952 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans)
[17:16:32] <shdubsh>	 !log install prometheus-rsyslog-exporter_0.0.0+git20201008 on centrallog1001 - T210137
[17:16:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:16:37] <stashbot>	 T210137: Handle unknown stats in rsyslog_exporter - https://phabricator.wikimedia.org/T210137
[17:23:55] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[17:23:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:01] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[17:26:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:55] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[17:27:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:51] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:30:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:35] <wikibugs>	 10Operations, 10Mail, 10Security: Don't get a mail to confirm my email address (remote system drops connection without providing a reason) - https://phabricator.wikimedia.org/T264504 (10Urbanecm) Thanks @herron. I guess we should investigate why the reason doesn't appear in our local logs. Should I open a fo...
[17:31:54] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[17:31:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:30] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10observability, 10Release-Engineering-Team (Deployment services): "MediaWiki exceptions and fatals per minute" alarm is too slow (half an hour delay!) - https://phabricator.wikimedia.org/T141520 (10colewhite) Indeed, there is a bit of delay due to retries and...
[17:32:55] <wikibugs>	 10Operations, 10Mail, 10Security: Don't get a mail to confirm my email address (remote system drops connection without providing a reason) - https://phabricator.wikimedia.org/T264504 (10Urbanecm)
[17:33:29] <wikibugs>	 10Operations, 10Mail, 10Security: Don't get a mail to confirm my email address (mx2001 is blacklisted by abusix blacklist) - https://phabricator.wikimedia.org/T264504 (10Urbanecm)
[17:34:05] <wikibugs>	 (03PS6) 10Razzi: oozie: use admin groups to determine admin access [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660)
[17:34:32] <icinga-wm>	 RECOVERY - Disk space on an-launcher1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops
[17:35:44] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] oozie: use admin groups to determine admin access [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi)
[17:37:10] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@945e5c1]: airflow: Set search satisfaction dag start date to oldest current available data
[17:37:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:44:19] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.dns.netbox
[17:44:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:44:40] <volans>	 that was me, double sudo by mistake
[17:46:38] <wikibugs>	 (03PS1) 10Volans: sre.dns.netbox: improve user message [cookbooks] - 10https://gerrit.wikimedia.org/r/632971 (https://phabricator.wikimedia.org/T264846)
[17:46:57] <wikibugs>	 (03PS4) 10Andrew Bogott: wmcs server backups: Add a way to assign projects to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/632960 (https://phabricator.wikimedia.org/T260692)
[17:46:59] <wikibugs>	 (03PS3) 10Andrew Bogott: wmcs backups: remove the 'special_projects' logic [puppet] - 10https://gerrit.wikimedia.org/r/632961 (https://phabricator.wikimedia.org/T260692)
[17:47:01] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudvirt1023: update nic names [puppet] - 10https://gerrit.wikimedia.org/r/632972 (https://phabricator.wikimedia.org/T259399)
[17:48:07] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1023: update nic names [puppet] - 10https://gerrit.wikimedia.org/r/632972 (https://phabricator.wikimedia.org/T259399) (owner: 10Andrew Bogott)
[17:48:59] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.dns.netbox: improve user message [cookbooks] - 10https://gerrit.wikimedia.org/r/632971 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans)
[17:49:05] <logmsgbot>	 !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@945e5c1]: airflow: Set search satisfaction dag start date to oldest current available data (duration: 11m 55s)
[17:49:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:50:10] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:50:11] <wikibugs>	 (03Merged) 10jenkins-bot: sre.dns.netbox: improve user message [cookbooks] - 10https://gerrit.wikimedia.org/r/632971 (https://phabricator.wikimedia.org/T264846) (owner: 10Volans)
[17:50:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:50:15] <wikibugs>	 (03PS8) 10Dzahn: thumbor: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630694
[17:57:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "disabling puppet on thumbor* via cumin, applying on one host to confirm noop, then enable on others again.. i think this is what Effie mea" [puppet] - 10https://gerrit.wikimedia.org/r/630694 (owner: 10Dzahn)
[17:58:12] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/25787/thumbor1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/630694 (owner: 10Dzahn)
[17:59:10] <wikibugs>	 (03CR) 10Dzahn: "wmf-style: total violations delta -6  ( -7, +1)" [puppet] - 10https://gerrit.wikimedia.org/r/630694 (owner: 10Dzahn)
[18:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201008T1800).
[18:00:04] <jouncebot>	 Tchanders: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[18:00:29] <Urbanecm>	 \o/
[18:00:34] <Urbanecm>	 Tchanders: wanna self-service? ;)
[18:01:33] <Tchanders>	 Urbanecm: I'm set up to deploy this - JamesF agreed to help this time too!
[18:01:36] <James_F>	 Urbanecm: Yeah, Tchanders and I will deal.
[18:01:39] <James_F>	 Snap.
[18:01:50] <Urbanecm>	 *scap
[18:01:54] <Urbanecm>	 :P
[18:02:38] <wikibugs>	 (03CR) 10Dzahn: "confirmed noop on thumbor1001, thumbor2001,... re-enabling puppet on thumbor*" [puppet] - 10https://gerrit.wikimedia.org/r/630694 (owner: 10Dzahn)
[18:02:39] <Tchanders>	 Thanks Urbanecm :)
[18:03:23] <wikibugs>	 (03PS1) 10Razzi: Revert "oozie: use admin groups to determine admin access" [puppet] - 10https://gerrit.wikimedia.org/r/632823
[18:04:15] <wikibugs>	 (03CR) 10Tchanders: [C: 03+2] Enable Special:Investigate by default on production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632908 (https://phabricator.wikimedia.org/T264357) (owner: 10Tchanders)
[18:04:54] <hauskatze>	 \o/
[18:06:34] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Special:Investigate by default on production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632908 (https://phabricator.wikimedia.org/T264357) (owner: 10Tchanders)
[18:06:36] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] Revert "oozie: use admin groups to determine admin access" [puppet] - 10https://gerrit.wikimedia.org/r/632823 (owner: 10Razzi)
[18:17:00] <logmsgbot>	 !log tchanders@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:632908|Enable Special:Investigate by default on production (T264357)]] (duration: 01m 06s)
[18:17:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:06] <stashbot>	 T264357: Deploy Special:Investigate to all wikis - https://phabricator.wikimedia.org/T264357
[18:17:35] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "turns out this is not currently used but should not be deleted quite yet" [puppet] - 10https://gerrit.wikimedia.org/r/628460 (owner: 10Dzahn)
[18:18:57] <wikibugs>	 (03PS2) 10Dzahn: elasticsearch::cirrus: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/632567
[18:28:53] <wikibugs>	 (03PS3) 10CRusnov: diffscan.py: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/630703 (https://phabricator.wikimedia.org/T247364)
[18:31:25] <wikibugs>	 10Operations, 10Analytics, 10SRE-Access-Requests: Renable SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T265071 (10elukey)
[18:31:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop:  https://puppet-compiler.wmflabs.org/compiler1001/25790/" [puppet] - 10https://gerrit.wikimedia.org/r/632567 (owner: 10Dzahn)
[18:34:26] <wikibugs>	 (03CR) 10Dzahn: "confirmed complete noop on elastic1032, elastic2055, relforge1002, .." [puppet] - 10https://gerrit.wikimedia.org/r/632567 (owner: 10Dzahn)
[18:34:40] <wikibugs>	 (03PS5) 10Andrew Bogott: wmcs server backups: Add a way to assign projects to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/632960 (https://phabricator.wikimedia.org/T260692)
[18:34:42] <wikibugs>	 (03PS4) 10Andrew Bogott: wmcs backups: remove the 'special_projects' logic [puppet] - 10https://gerrit.wikimedia.org/r/632961 (https://phabricator.wikimedia.org/T260692)
[18:34:44] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs backy2: allow hiera config of when the backup runs [puppet] - 10https://gerrit.wikimedia.org/r/632976 (https://phabricator.wikimedia.org/T260692)
[18:39:03] <wikibugs>	 (03PS2) 10Dzahn: hadoop::monitoring: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/631302
[18:39:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "already compiled on master, client and worker, will confirm anyways" [puppet] - 10https://gerrit.wikimedia.org/r/631302 (owner: 10Dzahn)
[18:47:38] <wikibugs>	 (03CR) 10Razzi: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/631896 (https://phabricator.wikimedia.org/T264152) (owner: 10Razzi)
[18:49:00] <wikibugs>	 (03CR) 10Razzi: "This is part 1 of 3:" [puppet] - 10https://gerrit.wikimedia.org/r/631896 (https://phabricator.wikimedia.org/T264152) (owner: 10Razzi)
[18:50:58] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on relforge1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 52 threshold =0.15 breach: active_primary_shards: 52, task_max_waiting_in_queue_millis: 608, unassigned_shards: 46, active_shards: 52, initializing_shards: 6, number_of_pending_tasks: 9, relocating_shards: 0, number_of_data_nodes: 2, delayed_unassigned_shards: 0, number_of_nodes: 2, active_shards_percent_a
[18:50:58] <icinga-wm>	 tatus: red, cluster_name: relforge-eqiad, timed_out: False, number_of_in_flight_fetch: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:52:40] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: unassigned_shards: 0, timed_out: False, number_of_pending_tasks: 0, number_of_nodes: 2, number_of_data_nodes: 2, status: green, active_primary_shards: 83, task_max_waiting_in_queue_millis: 0, initializing_shards: 0, relocating_shards: 0, number_of_in_flight_fetch: 0, active_shards: 104, cluster_name: relforge-
[18:52:40] <icinga-wm>	 rds_percent_as_number: 100.0, delayed_unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:57:42] <logmsgbot>	 !log volker-e@deploy1001 Started deploy [design/style-guide@b1166af]: Deploy design/style-guide:
[18:57:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:57:49] <logmsgbot>	 !log volker-e@deploy1001 Finished deploy [design/style-guide@b1166af]: Deploy design/style-guide:  (duration: 00m 06s)
[18:57:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:59:57] <wikibugs>	 10Operations, 10Gerrit, 10Wikimedia Design Style Guide, 10Wikimedia-GitHub, and 2 others: Deployment of latest Design Style Guide Gerrit clone doesn't seem to succeed - https://phabricator.wikimedia.org/T264894 (10Volker_E) 05Open→03Invalid There was a Git misconfiguration locally. Has worked now.  Sor...
[19:00:04] <jouncebot>	 hashar and marxarelli: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Mediawiki train - European+American Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201008T1900).
[19:03:36] <wikibugs>	 (03CR) 10Dzahn: "confirmed noop on an-worker1087, an-master1001, flerovium, analytics1047, alerts1001" [puppet] - 10https://gerrit.wikimedia.org/r/631302 (owner: 10Dzahn)
[19:03:41] <wikibugs>	 (03PS1) 10Bstorm: locales: the update-locale command is in the locales package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/632982
[19:04:37] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] locales: the update-locale command is in the locales package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/632982 (owner: 10Bstorm)
[19:05:06] <wikibugs>	 (03Merged) 10jenkins-bot: locales: the update-locale command is in the locales package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/632982 (owner: 10Bstorm)
[19:11:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs backy2: allow hiera config of when the backup runs [puppet] - 10https://gerrit.wikimedia.org/r/632976 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott)
[19:21:04] <wikibugs>	 (03PS4) 10Dzahn: zookeeper: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/631295
[19:23:44] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@3b11443]: search_satisfaction: Alias sample multiplier to expected name
[19:23:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:24:01] <wikibugs>	 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Investigate systemd hardening to replace Firejail for Thumbor - https://phabricator.wikimedia.org/T212941 (10Ladsgroup)
[19:24:53] <logmsgbot>	 !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@3b11443]: search_satisfaction: Alias sample multiplier to expected name (duration: 01m 09s)
[19:24:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:25:27] <wikibugs>	 (03CR) 10BryanDavis: wmcs server backups: Add a way to assign projects to backup hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/632960 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott)
[19:31:28] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/631295 (owner: 10Dzahn)
[19:32:02] <wikibugs>	 (03CR) 10Andrew Bogott: wmcs server backups: Add a way to assign projects to backup hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/632960 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott)
[19:34:25] <wikibugs>	 (03PS3) 10Esanders: Drop wgHiddenPrefs hack for VE Beta Feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620958 (https://phabricator.wikimedia.org/T254349)
[19:35:49] <wikibugs>	 10Operations, 10Analytics, 10SRE-Access-Requests: Renable SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T265071 (10Dzahn) Using the same key should be fine. But we will need a new "expiry_date" please.  And should we use expiry_contact: nruiz@ like before?
[19:36:02] <wikibugs>	 (03PS6) 10Andrew Bogott: wmcs server backups: Add a way to assign projects to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/632960 (https://phabricator.wikimedia.org/T260692)
[19:36:04] <wikibugs>	 (03PS5) 10Andrew Bogott: wmcs backups: remove the 'special_projects' logic [puppet] - 10https://gerrit.wikimedia.org/r/632961 (https://phabricator.wikimedia.org/T260692)
[19:38:49] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] zookeeper: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/631295 (owner: 10Dzahn)
[19:39:25] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] "Looks good like this, I think." [puppet] - 10https://gerrit.wikimedia.org/r/632960 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott)
[19:40:09] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs server backups: Add a way to assign projects to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/632960 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott)
[19:41:26] <wikibugs>	 (03CR) 10Dzahn: "noop also confirmed on the same hosts after merging" [puppet] - 10https://gerrit.wikimedia.org/r/631295 (owner: 10Dzahn)
[19:42:04] <wikibugs>	 (03PS3) 10Dzahn: toolforge/grid: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/631315
[19:49:26] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs backups: remove the 'special_projects' logic [puppet] - 10https://gerrit.wikimedia.org/r/632961 (https://phabricator.wikimedia.org/T260692) (owner: 10Andrew Bogott)
[19:49:42] <wikibugs>	 (03PS6) 10Andrew Bogott: wmcs backups: remove the 'special_projects' logic [puppet] - 10https://gerrit.wikimedia.org/r/632961 (https://phabricator.wikimedia.org/T260692)
[19:52:41] <wikibugs>	 (03CR) 10Dzahn: "WMCS team: are you ok if we go forward with this? This may look large but it's the same hiera/lookup replacement we did before and nothing" [puppet] - 10https://gerrit.wikimedia.org/r/631315 (owner: 10Dzahn)
[19:55:43] <wikibugs>	 (03PS2) 10Dzahn: dumps/homer: turn bash scripts into sh scripts [puppet] - 10https://gerrit.wikimedia.org/r/631892 (https://phabricator.wikimedia.org/T95064)
[19:58:36] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "thanks for the reviews, reducing this patch to just change dumps and homer script and and merging that" [puppet] - 10https://gerrit.wikimedia.org/r/631892 (https://phabricator.wikimedia.org/T95064) (owner: 10Dzahn)
[20:04:24] <wikibugs>	 (03CR) 10Dzahn: "Andrew/Bryan you think this monitoring can be removed at this point?" [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn)
[20:07:45] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Jgreen)
[20:08:31] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Jgreen)
[20:09:14] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Jgreen)
[20:09:59] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[20:10:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:29] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/25796/scb2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/631898 (owner: 10Dzahn)
[20:11:56] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[20:12:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:12:14] <wikibugs>	 10Operations, 10Mail, 10Security: Don't get a mail to confirm my email address (mx2001 is blacklisted by abusix blacklist) - https://phabricator.wikimedia.org/T264504 (10herron) wiki-mail-codfw.wikimedia.org has been delisted.  This should resolve the issue outlined in the description here.  In terms of foll...
[20:13:29] <wikibugs>	 (03CR) 10Dzahn: "simply replacing hiera() with lookup() when there are no default values and other changes has never been any difference anywhere so far. n" [puppet] - 10https://gerrit.wikimedia.org/r/631898 (owner: 10Dzahn)
[20:14:33] <wikibugs>	 (03PS2) 10Dzahn: openstack: turn bash scripts without bashisms into sh scripts [puppet] - 10https://gerrit.wikimedia.org/r/631891 (https://phabricator.wikimedia.org/T95064)
[20:15:11] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Jgreen)
[20:16:11] <wikibugs>	 (03PS11) 10Dzahn: labstore: add data types and some other style fixes [puppet] - 10https://gerrit.wikimedia.org/r/622666
[20:17:46] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Jgreen)
[20:20:15] <wikibugs>	 (03CR) 10Dzahn: "> Patch Set 7: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn)
[20:23:23] <chrisalbon>	 _eyes ORES eqiad suspiciously_
[20:23:28] <chrisalbon>	 don't you do it
[20:23:41] <wikibugs>	 (03CR) 10BryanDavis: toolforge/dynamicproxy: remove diamond monitoring proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn)
[20:30:03] <wikibugs>	 (03CR) 10Dzahn: toolforge/dynamicproxy: remove diamond monitoring proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn)
[20:31:18] <wikibugs>	 (03PS2) 10Dzahn: toolforge/dynamicproxy: delete profile::toolforge::services::basic [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993)
[20:32:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toolforge/dynamicproxy: delete profile::toolforge::services::basic [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn)
[20:33:09] <wikibugs>	 (03PS3) 10Dzahn: toolforge: delete profile::toolforge::services::basic [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993)
[20:33:49] <wikibugs>	 (03CR) 10Dzahn: "renamed to reflect what it actually is now. just removes the profile that absented a diamond collector" [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn)
[20:35:51] <wikibugs>	 (03PS1) 10Hashar: Deduplicate SessionBackend::logPersistenceChange calls [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/632991 (https://phabricator.wikimedia.org/T264793)
[20:38:07] <wikibugs>	 (03CR) 10Dzahn: "Jaime, Alex, i assume we are keeping this for a bit longer just in case?" [puppet] - 10https://gerrit.wikimedia.org/r/626460 (https://phabricator.wikimedia.org/T260717) (owner: 10Dzahn)
[20:40:32] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] toolforge: delete profile::toolforge::services::basic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn)
[20:42:06] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "per "already cherry-picked" 😊" [puppet] - 10https://gerrit.wikimedia.org/r/630589 (owner: 10Jbond)
[20:43:03] <wikibugs>	 (03PS3) 10Dzahn: Delete puppet role and module for Phragile [puppet] - 10https://gerrit.wikimedia.org/r/632475 (https://phabricator.wikimedia.org/T240308) (owner: 10Aklapper)
[20:43:25] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "thanks for the cleanup!" [puppet] - 10https://gerrit.wikimedia.org/r/632475 (https://phabricator.wikimedia.org/T240308) (owner: 10Aklapper)
[20:43:34] <volans>	 !log deploying Netbox DNS zone consolidation - T264273
[20:43:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:43:40] <stashbot>	 T264273: DNS: per prefix zone-file limitation - https://phabricator.wikimedia.org/T264273
[20:43:49] <wikibugs>	 (03CR) 10Volans: [C: 03+2] dns: consolidate reverse zone files (part 1) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632574 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans)
[20:44:43] <wikibugs>	 (03CR) 10Dzahn: "yep, whitespace issue fixed" [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn)
[20:45:23] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[20:45:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:21] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] "Deploying this!" [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/632991 (https://phabricator.wikimedia.org/T264793) (owner: 10Hashar)
[20:48:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] toolforge: delete profile::toolforge::services::basic [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn)
[20:48:46] <wikibugs>	 (03PS4) 10Dzahn: toolforge: delete profile::toolforge::services::basic [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993)
[20:50:49] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:50:50] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] toolforge: delete profile::toolforge::services::basic (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/632571 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn)
[20:50:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:50:56] <wikibugs>	 (03CR) 10Volans: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/632953 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans)
[20:55:05] <wikibugs>	 (03CR) 10Volans: [C: 03+2] netbox: move $INCLUDEs to the consolidated files [dns] - 10https://gerrit.wikimedia.org/r/632953 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans)
[20:58:21] <wikibugs>	 (03PS2) 10Volans: dns: consolidate reverse zone files (part 2) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632952 (https://phabricator.wikimedia.org/T264273)
[20:59:25] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm, already uses the same hdfs commands before, just drops a step" [puppet] - 10https://gerrit.wikimedia.org/r/631896 (https://phabricator.wikimedia.org/T264152) (owner: 10Razzi)
[20:59:31] <wikibugs>	 (03CR) 10Volans: [C: 03+2] dns: consolidate reverse zone files (part 2) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/632952 (https://phabricator.wikimedia.org/T264273) (owner: 10Volans)
[21:00:23] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[21:00:25] <logmsgbot>	 !log volans@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
[21:00:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:00:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:00:35] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[21:00:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:05:38] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:05:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:08:32] <wikibugs>	 (03Merged) 10jenkins-bot: Deduplicate SessionBackend::logPersistenceChange calls [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/632991 (https://phabricator.wikimedia.org/T264793) (owner: 10Hashar)
[21:11:11] <wikibugs>	 (03PS1) 10Dzahn: ci: replace hiera with lookup, jenkins, shipyard, pipeline, k8s [puppet] - 10https://gerrit.wikimedia.org/r/633017
[21:11:35] <hashar>	 deploying https://gerrit.wikimedia.org/r/c/mediawiki/core/+/632991
[21:14:04] <wikibugs>	 (03PS1) 10Dzahn: parsoid: replace a hiera call with lookup [puppet] - 10https://gerrit.wikimedia.org/r/633020
[21:14:49] <wikibugs>	 (03PS1) 10Hashar: Deduplicate SessionBackend::logPersistenceChange calls [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/632992 (https://phabricator.wikimedia.org/T264793)
[21:15:02] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] "deploying" [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/632992 (https://phabricator.wikimedia.org/T264793) (owner: 10Hashar)
[21:16:47] <wikibugs>	 (03PS1) 10Dzahn: elasticsearch: replace hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/633022
[21:22:20] <wikibugs>	 (03CR) 10Dzahn: "pretty much guaranteed noop, especially where there were no default values..which is all except one or so" [puppet] - 10https://gerrit.wikimedia.org/r/633017 (owner: 10Dzahn)
[21:22:38] <wikibugs>	 (03PS2) 10Dzahn: ci: replace hiera with lookup, jenkins, shipyard, pipeline, k8s [puppet] - 10https://gerrit.wikimedia.org/r/633017
[21:22:42] <wikibugs>	 (03PS1) 10Dzahn: cumin: replace hiera with lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/633024
[21:25:12] <wikibugs>	 (03CR) 10Dzahn: "also..removing the lint-ignore that apparently is not needed.. puppet-lint accepts it" [puppet] - 10https://gerrit.wikimedia.org/r/633024 (owner: 10Dzahn)
[21:28:39] <wikibugs>	 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 3 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10calbon) {F32378655}   Suspiciously similar just happened to ores on eqiad. I reset the uwsgi service on all ores100x boxes and will monitor.
[21:29:29] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM if the compiler is happy :)" [puppet] - 10https://gerrit.wikimedia.org/r/633024 (owner: 10Dzahn)
[21:31:21] <wikibugs>	 (03PS1) 10Dzahn: pybal: replace hiera with lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/633026
[21:34:48] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "thanks! it is: cumin::master: https://puppet-compiler.wmflabs.org/compiler1002/25798/cumin2001.codfw.wmnet/index.html  cumin::target: http" [puppet] - 10https://gerrit.wikimedia.org/r/633024 (owner: 10Dzahn)
[21:34:56] <wikibugs>	 (03PS2) 10Dzahn: cumin: replace hiera with lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/633024
[21:37:20] <wikibugs>	 (03Merged) 10jenkins-bot: Deduplicate SessionBackend::logPersistenceChange calls [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/632992 (https://phabricator.wikimedia.org/T264793) (owner: 10Hashar)
[21:37:57] <wikibugs>	 10Operations, 10Mail, 10Security: Don't get a mail to confirm my email address (mx2001 is blacklisted by abusix blacklist) - https://phabricator.wikimedia.org/T264504 (10Dzahn) >  in addition to having different message contents are also sent outward via the main mx host interface instead of the wiki-mail-si...
[21:40:27] <wikibugs>	 (03CR) 10Dzahn: "noop on cumin1001/2001 and various targets" [puppet] - 10https://gerrit.wikimedia.org/r/633024 (owner: 10Dzahn)
[21:41:26] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be106[0-3] - https://phabricator.wikimedia.org/T265093 (10RobH)
[21:41:37] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be106[0-3] - https://phabricator.wikimedia.org/T265093 (10RobH)
[21:43:43] <hashar>	 deploying https://gerrit.wikimedia.org/r/c/mediawiki/core/+/632992
[21:45:13] <wikibugs>	 (03PS1) 10Dzahn: local_dev::docker_publish: hiera->lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/633027
[21:46:11] <librenms-wmf>	 08Warning Alert for device mr1-eqsin.wikimedia.org - Processor usage over 85%
[21:46:33] <wikibugs>	 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10LGoto)
[21:51:22] <wikibugs>	 10Operations, 10Analytics, 10SRE-Access-Requests: Renable SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T265071 (10Nuria) Expiry contact will be @Ottomata end data is April 1 2021
[21:52:21] <logmsgbot>	 !log hashar@deploy1001 Synchronized php-1.36.0-wmf.10/includes/session/SessionBackend.php: Deduplicate SessionBackend::logPersistenceChange calls - T264793 (duration: 01m 01s)
[21:52:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:52:29] <stashbot>	 T264793: Make sure SessionManager emitting Set-Cookie headers gets logged - https://phabricator.wikimedia.org/T264793
[21:52:33] <wikibugs>	 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10JMinor) We are publishing v6.7.2 (1780) of the app as I write, which has our...
[21:52:58] <wikibugs>	 (03CR) 10Hashar: "Deployed" [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/632992 (https://phabricator.wikimedia.org/T264793) (owner: 10Hashar)
[21:53:16] <wikibugs>	 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10JMinor) a:05Dmantena→03None
[21:53:39] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@a923949]: search_satisfaction: update druid datasource to match previous data
[21:53:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:54:25] <wikibugs>	 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10CDanis) Really great to see this happen so quickly!  Thanks so much :)  I'll...
[21:54:44] <logmsgbot>	 !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@a923949]: search_satisfaction: update druid datasource to match previous data (duration: 01m 04s)
[21:54:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:59:31] <wikibugs>	 (03PS1) 10Dzahn: mediawiki: replace hiera with lookup, add data types in all profiles [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953)
[21:59:49] <hashar>	 that deduplication patch seems to have worked properly \o/
[22:03:38] <icinga-wm>	 PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:03:38] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for 
[22:03:38] <icinga-wm>	 timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/news (get In the News content) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[22:03:58] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:04:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:04:24] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:04:34] <mutante>	 just like last time and right after deploy
[22:04:35] <mutante>	 right
[22:04:40] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_wikifeeds_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:05:56] <rzl>	 mutante: the deploy is a red herring, the cause turned out to be T264881
[22:05:56] <stashbot>	 T264881: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881
[22:06:12] <mutante>	 rzl: oh, ok. thank you
[22:06:15] <rzl>	 (which the iOS devs have done a really nice job jumping on and fixing, we're just waiting for the deployment)
[22:06:27] <mutante>	 *nod* cool
[22:06:37] <icinga-wm>	 PROBLEM - LVS wikifeeds codfw port 4101/tcp - A node webservice supporting featured wiki content feeds. termbox.svc.eqiad.wmnet IPv4 #page on wikifeeds.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.47 and port 4101: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[22:06:54] <cdanis>	 here
[22:07:00] <rzl>	 ^ I'll ack that, it's the same, no action needed
[22:07:11] <mutante>	 cdanis: we are just waiting for a fix to be deployed ^
[22:07:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:07:16] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:07:16] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:07:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:07:24] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[22:07:33] <cdanis>	 mutante: sorry, are you referring to the iOS app?
[22:07:37] <mutante>	 cdanis: yes
[22:07:39] <cdanis>	 yeah
[22:07:48] <cdanis>	 realized as soon as I looked what time it was :)
[22:08:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:08:22] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wikifeeds_4101: Servers kubernetes2010.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[22:08:27] <mutante>	 the "restbase-dev" is what it starts with and gave it away 
[22:08:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:08:32] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[22:08:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:08:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:08:40] * volans with one foot in the bed already, need any help?
[22:08:44] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:08:49] <rzl>	 volans: nope, sleep well
[22:08:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:08:52] <cdanis>	 volans: no, this is the usual 22:00 UTC thing
[22:09:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:09:07] <volans>	 "usual" :/
[22:09:26] <cdanis>	 volans: diagnosed and soon fixed, at least!
[22:09:34] <volans|off>	 ack thx
[22:09:50] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:10:00] <mutante>	 rescheduled checks to make it faster
[22:10:04] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:10:07] <icinga-wm>	 RECOVERY - LVS wikifeeds codfw port 4101/tcp - A node webservice supporting featured wiki content feeds. termbox.svc.eqiad.wmnet IPv4 #page on wikifeeds.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1006 bytes in 3.718 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[22:10:12] <cdanis>	 wow this one was a bit more percussive than usual -- check out appserver latency
[22:10:30] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles f
[22:10:30] <icinga-wm>	 6) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[22:10:32] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:10:32] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[22:10:32] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:10:32] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[22:10:33] <bblack>	 ?
[22:10:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:10:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:10:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:10:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:10:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:10:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:10:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:10:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:10:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:10:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:10:41] <cdanis>	 bblack: T264881
[22:10:51] <bblack>	 ok
[22:11:12] <bblack>	 the live icinga spam here on IRC seemed "different", like, much faster
[22:11:20] <cdanis>	 yeah, this was more severe than usual
[22:11:32] <bblack>	 did we fix/tweak something about the irc echoing bits?
[22:11:34] <mutante>	 i sped up the recovery a bit 
[22:11:37] <cdanis>	 it's not clear to me why these requests don't get coalesced more
[22:11:37] <bblack>	 ok
[22:11:44] <cdanis>	 they are cacheable, at least the ones I've seen
[22:12:04] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[22:12:06] <cdanis>	 hm, restbase error rate is still fairly high
[22:12:11] <librenms-wmf>	 08̶W̶a̶r̶n̶i̶n̶g Device mr1-eqsin.wikimedia.org recovered from Processor usage over 85%
[22:12:16] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:14:52] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:15:25] <rzl>	 cdanis: looks okay now, just recovered slowly for some reason
[22:15:32] <cdanis>	 yeah
[22:15:48] <rzl>	 looks like it did the same yesterday too
[22:15:53] <rzl>	 (at a lower error rate)
[22:16:48] <rzl>	 actually, that was the *request* rate falling off
[22:16:49] <rzl>	 https://grafana.wikimedia.org/d/000000068/restbase?viewPanel=15&orgId=1&from=now-1h&to=now
[22:17:13] <cdanis>	 yeah, the page summary API does get hit by the mobile apps AIUI
[22:17:24] <rzl>	 nod
[22:17:43] <rzl>	 sharp peak for "v1_page_random_-format-" and then a longer falloff for "v1_page_summary_-title-"
[22:18:53] <rzl>	 I'm not sure what "random" is doing in there, is that the endpoint for the featured page somehow? it's overwhelmingly where the 5xxs are
[22:19:00] <cdanis>	 rzl: https://w.wiki/fsZ
[22:19:02] <cdanis>	 no
[22:19:11] <cdanis>	 it is literally the 'random wiki page' handler
[22:19:27] <cdanis>	 also being removed https://phabricator.wikimedia.org/T264881#6525670
[22:19:49] <rzl>	 yeah that's why I'm surprised it spikes at midn-- ohhh okay
[22:19:52] <cdanis>	 OH
[22:19:54] <cdanis>	 OH
[22:19:57] <cdanis>	 *OH*
[22:20:03] <rzl>	 💡?
[22:20:06] <cdanis>	 do you know what is special about that endpoint?!
[22:20:11] <rzl>	 it's not cacheable, is it
[22:20:12] <cdanis>	 IT ISN'T CACHEABLE
[22:20:42] <rzl>	 do you know what else is special about it, though
[22:21:07] <rzl>	 when we get a spike of requests over the ordinary rate, nobody will see the responses or care if they're 429s
[22:21:19] <rzl>	 until after this iOS deploy goes out anyway
[22:21:50] <wikibugs>	 10Operations, 10Traffic, 10Platform Team Initiatives (API Gateway), 10Story: Client Developer has a cookie-free API call - https://phabricator.wikimedia.org/T258748 (10eprodromou)
[22:21:55] <rzl>	 I guess that's not true, even if we restrict by user agent we'll probably still rate limit *some* actual humans who just happen to be bored at 22:02 UTC
[22:22:13] <cdanis>	 I think already almost no one cares during these spikes -- Tsevener said that the widgets (which cause the synchronized traffic) don't need the random page, they just happen to call code that also fetches it
[22:22:26] <rzl>	 right yeah exactly
[22:22:51] <rzl>	 what I mean is we can rate-limit this at varnish without breaking even the feature that relies on the background fetch
[22:23:02] <cdanis>	 yeah, exactly
[22:23:08] <rzl>	 we wouldn't want to rate limit the "get featured page" but we don't have to, that's cacheable
[22:23:10] <cdanis>	 we don't usually ratelimit not-per-IP but I think it's reasonable
[22:24:10] <rzl>	 ohh, I see where we got off the same page
[22:24:19] <rzl>	 by "until after this iOS deploy" I just meant, we won't need the rate limit after that
[22:24:28] <cdanis>	 right, I realized after
[22:24:36] <rzl>	 temporary fix until the permanent one is out in the field
[22:24:37] <rzl>	 👍
[22:32:30] <wikibugs>	 (03CR) 10Jeena Huneidi: "This looks fine to me...but I don't know much about puppet." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633017 (owner: 10Dzahn)
[22:35:05] <wikibugs>	 (03PS4) 10Razzi: geoip: archive MaxMind database to hdfs only [puppet] - 10https://gerrit.wikimedia.org/r/631896 (https://phabricator.wikimedia.org/T264152)
[22:35:07] <wikibugs>	 (03PS1) 10Razzi: geoip: move archive timer from stat1007 to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/633032 (https://phabricator.wikimedia.org/T264152)
[22:35:08] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:36:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] geoip: move archive timer from stat1007 to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/633032 (https://phabricator.wikimedia.org/T264152) (owner: 10Razzi)
[22:36:31] <wikibugs>	 (03CR) 10Dzahn: ci: replace hiera with lookup, jenkins, shipyard, pipeline, k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633017 (owner: 10Dzahn)
[22:36:50] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:36:59] <wikibugs>	 (03PS1) 10Dzahn: calico: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/633033
[22:37:15] <wikibugs>	 10Operations, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10CDanis) When the 22:00 traffic spike happened today, it was a bit more impac...
[22:54:56] <ryankemper>	 !log About to start plugin upgrade followed by restarts of `cloudelastic`. Maintenance window set for the next 2 hours on `cloudelastic100[1-6]`
[22:55:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:56:08] <ryankemper>	 !log `sudo -E cumin -b 6 C:role::elasticsearch::cloudelastic 'DEBIAN_FRONTEND=noninteractive sudo apt-get -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" install wmf-elasticsearch-search-plugins'`
[22:56:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:56:34] <ryankemper>	 !log `sudo apt policy wmf-elasticsearch-search-plugins` shows correct state: `Installed: 6.5.4-4~stretch`
[22:56:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201008T2300)
[23:00:04] <jouncebot>	 tgr: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:01:42] <ryankemper>	 !log Writes are frozen for `cloudelastic`: `/usr/local/bin/mwscript extensions/CirrusSearch/maintenance/FreezeWritesToCluster.php --wiki=enwiki --cluster=cloudelastic` on `mwmaint2001` => `Applied cluster-wide freeze`
[23:01:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:01:49] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] Enable logging of session cookie changes everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632797 (https://phabricator.wikimedia.org/T264793) (owner: 10Gergő Tisza)
[23:04:27] <wikibugs>	 (03PS2) 10Gergő Tisza: Enable logging of session cookie changes everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632797 (https://phabricator.wikimedia.org/T264793)
[23:04:56] <ryankemper>	 !log Beginning cluster restarts one server at a time. For each server, the process is depool->restart elasticsearch services->wait for services to restart and then pool->wait for cluster to return to green status before starting next server
[23:05:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:08:34] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] Enable logging of session cookie changes everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632797 (https://phabricator.wikimedia.org/T264793) (owner: 10Gergő Tisza)
[23:09:23] <wikibugs>	 (03Merged) 10jenkins-bot: Enable logging of session cookie changes everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632797 (https://phabricator.wikimedia.org/T264793) (owner: 10Gergő Tisza)
[23:16:21] <logmsgbot>	 !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:632797|Enable logging of session cookie changes everywhere (T264793)]] (duration: 01m 01s)
[23:16:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:16:27] <stashbot>	 T264793: Make sure SessionManager emitting Set-Cookie headers gets logged - https://phabricator.wikimedia.org/T264793
[23:16:48] <ryankemper>	 !log `cloudelastic1001` is done restarting and cluster is green again. Proceeding to `cloudelastic1002`
[23:16:51] <tgr_>	 !log Evening deploys done
[23:16:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:16:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:17:56] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] "I'm feeling feisty and think I'll merge this." [puppet] - 10https://gerrit.wikimedia.org/r/630589 (owner: 10Jbond)
[23:18:57] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] "Oh wait, never mind. This has parent patches. Moving back to +1 until that's sorted 😜" [puppet] - 10https://gerrit.wikimedia.org/r/630589 (owner: 10Jbond)
[23:23:52] <ryankemper>	 !log `cloudelastic1002` done
[23:23:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:27:40] <ryankemper>	 !log `cloudelastic1003` done
[23:27:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:31:01] <ryankemper>	 !log `cloudelastic1004` done
[23:31:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:33:03] <wikibugs>	 (03PS1) 10Legoktm: [WIP] Add buildpack base images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/633036
[23:33:53] <wikibugs>	 (03PS2) 10Legoktm: [WIP] Add buildpack base images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/633036
[23:37:24] <ryankemper>	 !log `cloudelastic1005` done
[23:37:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:42:34] <ryankemper>	 !log `cloudelastic1006` done. Writes thawed, maintenance window lifted; restarts are done for `cloudelastic`
[23:42:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:43:18] <wikibugs>	 (03CR) 10Bstorm: "I'm getting an error related to drbd resources https://puppet-compiler.wmflabs.org/compiler1001/25800/" [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn)
[23:45:29] <wikibugs>	 (03CR) 10Bstorm: labstore: add data types and some other style fixes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn)
[23:52:25] <wikibugs>	 (03PS1) 10Dzahn: site: add mediawiki appserver role to a test VM [puppet] - 10https://gerrit.wikimedia.org/r/633038
[23:53:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site: add mediawiki appserver role to a test VM [puppet] - 10https://gerrit.wikimedia.org/r/633038 (owner: 10Dzahn)
[23:57:33] <wikibugs>	 (03PS1) 10Dzahn: Revert "site: add mediawiki appserver role to a test VM" [puppet] - 10https://gerrit.wikimedia.org/r/632997
[23:58:15] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "site: add mediawiki appserver role to a test VM" [puppet] - 10https://gerrit.wikimedia.org/r/632997 (owner: 10Dzahn)