[00:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Evening backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210209T0000).
[00:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[00:00:48] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1302.eqiad.wmnet with reason: REIMAGE
[00:00:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:01:44] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] wikilabels: replace cron with systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/662781 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup)
[00:02:49] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1301.eqiad.wmnet with reason: REIMAGE
[00:02:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:04:00] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "looks good to me, afaict. will let Bryan review though and please test a restart after merging this" [puppet] - 10https://gerrit.wikimedia.org/r/662764 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[00:09:08] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1387.eqiad.wmnet'] `  an...
[00:09:55] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "deploy2001.codfw.wmnet: Evaluation Error: Resource type not found: Stdlib::Path (Stdlib::Unixpath)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/659941 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto)
[00:10:10] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1001 is CRITICAL: 6.514e+06 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001
[00:10:50] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1008 is CRITICAL: 1.026e+07 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1008
[00:11:47] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1386.eqiad.wmnet'] `  an...
[00:14:48] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1009 is CRITICAL: 1.555e+07 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1009
[00:15:30] <wikibugs>	 (03PS4) 10Dzahn: kubernetes::deployment_server: add yaml to configure MediaWiki sites [puppet] - 10https://gerrit.wikimedia.org/r/659941 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto)
[00:15:32] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] kubernetes::deployment_server: add yaml to configure MediaWiki sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/659941 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto)
[00:19:08] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "PS4 fixed an issue but there is also "Failed to parse inline template: undefined method `content'"" [puppet] - 10https://gerrit.wikimedia.org/r/659941 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto)
[00:22:04] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1387.eqiad.wmnet
[00:22:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:22:20] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1386.eqiad.wmnet
[00:22:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:24:32] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1386.eqiad.wmnet
[00:24:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:26:16] <wikibugs>	 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Joe) Another suggestion coming from twitter is https://play.google.com/store/apps/details?id=com.app.rcn, which anyways doesn't seem popular eno...
[00:28:10] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1387.eqiad.wmnet
[00:28:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:32:38] <wikibugs>	 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Ladsgroup) I don't have much knowledge about India's internet infrastructure but from experience of Iran and blocking apps/websites. They show y...
[00:34:26] <wikibugs>	 (03PS1) 10Bstorm: wikireplicas: adjust logrotate for multiinstance on wmf-pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/662797 (https://phabricator.wikimedia.org/T274044)
[00:35:15] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] "The diffs look reasonable to me. I can't remember off the top of my head if there have been api changes in irc.bot that would require othe" [puppet] - 10https://gerrit.wikimedia.org/r/662764 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[00:36:36] <wikibugs>	 (03CR) 10Bstorm: "I have zero idea if the logging is even working right now. Currently, the daemons on each server both point at the same log file. I figure" [puppet] - 10https://gerrit.wikimedia.org/r/662797 (https://phabricator.wikimedia.org/T274044) (owner: 10Bstorm)
[00:41:56] <razzi>	 ^ re kafka max lag: a slow migration extended past the downtime window but is going smoothly
[00:45:13] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1302.eqiad.wmnet'] `  an...
[00:46:08] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1301.eqiad.wmnet'] `  an...
[00:48:22] <wikibugs>	 (03PS1) 10Dzahn: mwdebug: allow rsyncing home dirs from any mwdebug* to mwdebug1003 [puppet] - 10https://gerrit.wikimedia.org/r/662798 (https://phabricator.wikimedia.org/T274023)
[00:48:27] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1301.eqiad.wmnet
[00:48:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:48:44] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1302.eqiad.wmnet
[00:48:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:48:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mwdebug: allow rsyncing home dirs from any mwdebug* to mwdebug1003 [puppet] - 10https://gerrit.wikimedia.org/r/662798 (https://phabricator.wikimedia.org/T274023) (owner: 10Dzahn)
[00:59:26] <wikibugs>	 (03PS2) 10Dzahn: mwdebug: allow rsyncing home dirs from any mwdebug* to a backup host [puppet] - 10https://gerrit.wikimedia.org/r/662798 (https://phabricator.wikimedia.org/T274023)
[00:59:42] <wikibugs>	 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Preinheimer) Going to the TikTok website from India results in the regular TikTok page loading, with a banner from TikTok saying that the servic...
[01:01:42] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1008 is OK: (C)5e+06 ge (W)1e+06 ge 9.3e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1008
[01:01:48] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1001 is OK: (C)5e+06 ge (W)1e+06 ge 8.736e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001
[01:02:00] <wikibugs>	 (03PS3) 10Dzahn: mwdebug: allow rsyncing home dirs from any mwdebug* to a backup host [puppet] - 10https://gerrit.wikimedia.org/r/662798 (https://phabricator.wikimedia.org/T274023)
[01:06:49] <wikibugs>	 (03PS1) 10Andrew Bogott: Keystone: a new (but still terrible) approach to making projectid==projectname [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165)
[01:07:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Keystone: a new (but still terrible) approach to making projectid==projectname [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165) (owner: 10Andrew Bogott)
[01:09:00] <wikibugs>	 (03PS2) 10Andrew Bogott: Keystone: a new (but still terrible) approach to making projectid==projectname [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165)
[01:10:56] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1009 is OK: (C)5e+06 ge (W)1e+06 ge 8.019e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1009
[01:12:51] <wikibugs>	 (03PS3) 10Andrew Bogott: Keystone: a new (but still terrible) approach to making projectid==projectname [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165)
[01:13:43] <wikibugs>	 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Dzahn) https://newshimalaya.com/2021/02/09/%E2%9A%93-t273741-investigate-unusual-media-traffic-pattern-for-asternovi-belgii-flower-1mb-jpg-on-co...
[01:13:50] <wikibugs>	 (03PS1) 10Tim Starling: Caching fixes [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662677 (https://phabricator.wikimedia.org/T264391)
[01:14:41] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Caching fixes [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662677 (https://phabricator.wikimedia.org/T264391) (owner: 10Tim Starling)
[01:15:41] <ori>	 dpifke: ping on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/597654
[01:20:40] <wikibugs>	 (03Merged) 10jenkins-bot: Caching fixes [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662677 (https://phabricator.wikimedia.org/T264391) (owner: 10Tim Starling)
[01:35:16] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1007 is OK: (C)5e+06 ge (W)1e+06 ge 7.903e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1007
[01:36:00] <wikibugs>	 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10varenc) >>! In T273741#6813823, @Preinheimer wrote: > Going to the TikTok website from India results in the regular TikTok page loading, with a...
[01:46:06] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1301.eqiad.wmnet
[01:46:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:46:19] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1302.eqiad.wmnet
[01:46:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:56:55] <logmsgbot>	 !log tstarling@deploy1001 Synchronized php-1.36.0-wmf.29/extensions/FeaturedFeeds: probable fix for UBN T273242 (duration: 01m 06s)
[01:56:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:56:59] <stashbot>	 T273242: MemcachedPeclBagOStuff: Serialization of 'Closure' is not allowed - https://phabricator.wikimedia.org/T273242
[02:07:30] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.30 [core] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662803
[02:35:56] <wikibugs>	 (03PS1) 10Andrew Bogott: profile::labs::downloadserver: remove lvm class [puppet] - 10https://gerrit.wikimedia.org/r/662805 (https://phabricator.wikimedia.org/T274208)
[02:37:41] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] profile::labs::downloadserver: remove lvm class [puppet] - 10https://gerrit.wikimedia.org/r/662805 (https://phabricator.wikimedia.org/T274208) (owner: 10Andrew Bogott)
[02:57:02] <wikibugs>	 (03PS1) 10Legoktm: docker_registry_ha: Properly override nginx timeouts [puppet] - 10https://gerrit.wikimedia.org/r/662806
[02:57:04] <wikibugs>	 (03PS1) 10Legoktm: [WIP] docker_registry_ha: Have restricted/ images that are limited read/write [puppet] - 10https://gerrit.wikimedia.org/r/662807 (https://phabricator.wikimedia.org/T273521)
[02:57:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] docker_registry_ha: Have restricted/ images that are limited read/write [puppet] - 10https://gerrit.wikimedia.org/r/662807 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm)
[03:17:10] <wikibugs>	 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10tomglynch) Hi all, I've been doing a bit of research into possible apps that could be causing this and found two potential culprits that I am cu...
[03:29:47] <wikibugs>	 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10ssingh) Thank you everyone for the comments and suggestions. I just wanted to share that we have identified the app and will update this task to...
[04:33:17] <wikibugs>	 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10mmodell) >>! In T273741#6813839, @Dzahn wrote: > ^ wut?  I tried to search for links to this image and found... this Phabricator ticket content...
[04:34:07] <wikibugs>	 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10mmodell) Also hello hacker news!
[04:56:16] <wikibugs>	 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10TheOv3rminD) >>! In T273741#6813995, @mmodell wrote: > Also, hello hacker news! https://news.ycombinator.com/item?id=26072025  Hello From us Hac...
[05:02:09] <logmsgbot>	 !log krinkle@deploy1001 Started deploy [integration/docroot@fdfb265]: I271e6054880, T273247
[05:02:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:02:14] <stashbot>	 T273247: Publish wikimedia/minify as its own repo and package - https://phabricator.wikimedia.org/T273247
[05:02:16] <logmsgbot>	 !log krinkle@deploy1001 Finished deploy [integration/docroot@fdfb265]: I271e6054880, T273247 (duration: 00m 06s)
[05:02:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:22:23] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Krinkle) I suspect this might have broken deplo...
[06:03:13] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1111: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/662680
[06:04:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1111: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/662680 (owner: 10Marostegui)
[06:05:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 5%: Slowly repooling db1111 after onsite maintenance', diff saved to https://phabricator.wikimedia.org/P14233 and previous config saved to /var/cache/conftool/dbconfig/20210209-060520-root.json
[06:05:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:05:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move db1111 to rack A8 - https://phabricator.wikimedia.org/T273982 (10Marostegui) I have started to repool this host back.
[06:18:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1090:3312, db1090:3317 T258361', diff saved to https://phabricator.wikimedia.org/P14234 and previous config saved to /var/cache/conftool/dbconfig/20210209-061822-marostegui.json
[06:18:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:18:28] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[06:20:24] <marostegui>	 !log Stop mysql on s2 and s7 on db1090 to clone db1170 T258361
[06:20:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 10%: Slowly repooling db1111 after onsite maintenance', diff saved to https://phabricator.wikimedia.org/P14236 and previous config saved to /var/cache/conftool/dbconfig/20210209-062024-root.json
[06:20:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:20:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:24:29] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db1170 [puppet] - 10https://gerrit.wikimedia.org/r/662818 (https://phabricator.wikimedia.org/T258361)
[06:28:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1170 [puppet] - 10https://gerrit.wikimedia.org/r/662818 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[06:34:31] <wikibugs>	 (03CR) 10Marostegui: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/662797 (https://phabricator.wikimedia.org/T274044) (owner: 10Bstorm)
[06:35:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 25%: Slowly repooling db1111 after onsite maintenance', diff saved to https://phabricator.wikimedia.org/P14237 and previous config saved to /var/cache/conftool/dbconfig/20210209-063527-root.json
[06:35:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:38:20] <ryankemper>	 !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.63`. Pre-deploy tests passing on canary `wdqs1003`
[06:38:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:40:21] <wikibugs>	 (03PS1) 10Ayounsi: Depool esams for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/662822 (https://phabricator.wikimedia.org/T272342)
[06:40:27] <ryankemper>	 !log Pooled `wdqs1007` and depooled `wdqs1005` (`1005` is ~12 hours behind)
[06:40:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:40:39] <logmsgbot>	 !log ryankemper@deploy1001 Started deploy [wdqs/wdqs@582b070]: 0.3.63
[06:40:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:41:28] <ryankemper>	 !log [WDQS Deploy] Tests passing following deploy of `0.3.63` on canary `wdqs1003`; proceeding to rest of fleet
[06:41:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:42:27] <wikibugs>	 (03PS2) 10Ayounsi: Depool esams for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/662822 (https://phabricator.wikimedia.org/T272342)
[06:43:06] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Depool esams for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/662822 (https://phabricator.wikimedia.org/T272342) (owner: 10Ayounsi)
[06:44:12] <wikibugs>	 (03CR) 10Marostegui: "Just tested clouddb1014 getting a query there doesn't seem to appear on the log:" [puppet] - 10https://gerrit.wikimedia.org/r/662797 (https://phabricator.wikimedia.org/T274044) (owner: 10Bstorm)
[06:44:15] <XioNoX>	 !log depool esams for network maintenance - T272342
[06:44:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:47:25] <logmsgbot>	 !log ryankemper@deploy1001 Finished deploy [wdqs/wdqs@582b070]: 0.3.63 (duration: 06m 46s)
[06:47:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:48:27] <ryankemper>	 !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'`
[06:48:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:48:31] <ryankemper>	 !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'`
[06:48:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:48:37] <ryankemper>	 !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'`
[06:48:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:50:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 50%: Slowly repooling db1111 after onsite maintenance', diff saved to https://phabricator.wikimedia.org/P14238 and previous config saved to /var/cache/conftool/dbconfig/20210209-065031-root.json
[06:50:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:53:10] <wikibugs>	 (03PS7) 10Elukey: Set Apache Bigtop 1.5 as default hadoop distro [puppet] - 10https://gerrit.wikimedia.org/r/661974 (https://phabricator.wikimedia.org/T273711)
[06:53:34] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[07:00:06] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[07:04:20] <XioNoX>	 !log depool disable 2 uplinks on asw2-esams - T272342
[07:04:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:05:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 75%: Slowly repooling db1111 after onsite maintenance', diff saved to https://phabricator.wikimedia.org/P14239 and previous config saved to /var/cache/conftool/dbconfig/20210209-070534-root.json
[07:05:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:45] <wikibugs>	 (03CR) 10Marostegui: "Some more testing:" [puppet] - 10https://gerrit.wikimedia.org/r/662797 (https://phabricator.wikimedia.org/T274044) (owner: 10Bstorm)
[07:09:44] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 85, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:12:34] <XioNoX>	 planned maintenance ^
[07:15:14] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set Apache Bigtop 1.5 as default hadoop distro [puppet] - 10https://gerrit.wikimedia.org/r/661974 (https://phabricator.wikimedia.org/T273711) (owner: 10Elukey)
[07:15:22] <icinga-wm>	 RECOVERY - Check systemd state on clouddb1014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:17:30] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 12.51 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[07:20:02] <icinga-wm>	 PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 85, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:20:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 100%: Slowly repooling db1111 after onsite maintenance', diff saved to https://phabricator.wikimedia.org/P14240 and previous config saved to /var/cache/conftool/dbconfig/20210209-072038-root.json
[07:20:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:49] <ryankemper>	 !log [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good
[07:20:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:28:11] <wikibugs>	 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Zardula) Hello from hacker news
[07:30:42] <wikibugs>	 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Majavah)
[07:33:38] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db1081 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/662828 (https://phabricator.wikimedia.org/T273040)
[07:34:11] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1081 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/662828 (https://phabricator.wikimedia.org/T273040) (owner: 10Marostegui)
[07:34:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1081 from dbctl T273040', diff saved to https://phabricator.wikimedia.org/P14241 and previous config saved to /var/cache/conftool/dbconfig/20210209-073455-marostegui.json
[07:34:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:00] <stashbot>	 T273040: decommission db1081.eqiad.wmnet - https://phabricator.wikimedia.org/T273040
[07:35:10] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 87, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:38:49] <wikibugs>	 10SRE, 10Wikimedia-Portals, 10Patch-For-Review, 10Regression: www.wikipedia.org/robots.txt should not be a redirect - https://phabricator.wikimedia.org/T242500 (10SarthakKundra) a:05SarthakKundra→03None
[07:39:03] <wikibugs>	 10SRE, 10Wikimedia-Portals, 10Regression: www.wikipedia.org/robots.txt should not be a redirect - https://phabricator.wikimedia.org/T242500 (10SarthakKundra)
[07:41:48] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 83, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:43:12] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[07:43:40] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:43:52] <wikibugs>	 10SRE, 10netops: cr3-esams linecard diversity issue - https://phabricator.wikimedia.org/T262524 (10ayounsi) Disabling the interface on the asw2 side (via homer) `lang=diff [edit interfaces interface-range disabled] +    member et-4/0/51;      member ge-4/0/27 { ... } [edit interfaces interface-range disabled]...
[07:44:16] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[07:45:01] <wikibugs>	 10SRE, 10netops: cr3-esams linecard diversity issue - https://phabricator.wikimedia.org/T262524 (10ayounsi) 05Open→03Resolved Remote hands did the recabling.  Then pushed the cleanup via Homer as well. Everything is done.
[07:45:46] <icinga-wm>	 RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:46:49] <XioNoX>	 routers recabling done, now moving on to upgrading asw2-esams OS
[07:47:01] <logmsgbot>	 !log hashar@deploy1001 Started deploy [integration/docroot@672e79f]: build: Add /scap/log to gitignore
[07:47:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:07] <logmsgbot>	 !log hashar@deploy1001 Finished deploy [integration/docroot@672e79f]: build: Add /scap/log to gitignore (duration: 00m 06s)
[07:47:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:22] <XioNoX>	 !log redirect ns2 to authdns1001 - T252631
[07:54:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:26] <stashbot>	 T252631: Upgrade Junos on asw2-esams - https://phabricator.wikimedia.org/T252631
[07:55:51] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: Add support for php deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757
[08:02:55] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on 32 hosts with reason: switch upgrade
[08:02:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:07] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on 32 hosts with reason: switch upgrade
[08:03:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:12] <wikibugs>	 10SRE, 10netops: Upgrade Junos on asw2-esams - https://phabricator.wikimedia.org/T252631 (10ops-monitoring-bot) Icinga downtime set by ayounsi@cumin1001 for 1:30:00 32 host(s) and their services with reason: switch upgrade ` bast[3004-3005].wikimedia.org,cp[3050-3065].esams.wmnet,dns[3001-3002].wikimedia.org,g...
[08:09:50] <XioNoX>	 !log alright, brace yourself, esams switch stack is going to go down 
[08:09:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:36] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] style: this introduces black+isort as autoformatter [software/spicerack] - 10https://gerrit.wikimedia.org/r/659785 (https://phabricator.wikimedia.org/T211750) (owner: 10David Caro)
[08:15:27] <icinga-wm>	 PROBLEM - Router interfaces on cr3-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 63, down: 4, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:15:42] <XioNoX>	 that's fine ^
[08:15:47] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:15:55] <icinga-wm>	 PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 69, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:16:05] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal, AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:16:10] <XioNoX>	 that too ^
[08:16:43] <icinga-wm>	 PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal, AS64600/IPv4: Connect - PyBal, AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:16:47] <icinga-wm>	 PROBLEM - VRRP status on cr3-esams is CRITICAL: VRRP CRITICAL - 3 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status
[08:17:41] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 69, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:21:19] <icinga-wm>	 RECOVERY - VRRP status on cr3-esams is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status
[08:22:17] <icinga-wm>	 RECOVERY - Router interfaces on cr3-knams is OK: OK: host 91.198.174.246, interfaces up: 79, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:22:19] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:22:20] <wikibugs>	 10SRE, 10CAS-SSO, 10Discovery-Search (Current work), 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10Zbyszko) @jbond About Kryo -  from what I see in Kryo src:   >  /** Registers the class using the lowest, next available integer ID and the {@link Kryo#getD...
[08:22:45] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:22:57] <icinga-wm>	 RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:23:07] <icinga-wm>	 RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 426, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:23:08] <wikibugs>	 (03Merged) 10jenkins-bot: style: this introduces black+isort as autoformatter [software/spicerack] - 10https://gerrit.wikimedia.org/r/659785 (https://phabricator.wikimedia.org/T211750) (owner: 10David Caro)
[08:23:45] <icinga-wm>	 RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:28:09] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_arclamp site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:29:55] <hashar>	 phabricator can't reach its database, guess that is related to the above
[08:30:10] <XioNoX>	 hashar: which above?
[08:30:11] <XioNoX>	 !log rollback redirect ns2 to authdns1001 - T252631
[08:30:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:51] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[08:30:53] <hashar>	 no idea
[08:31:02] <elukey>	 hashar: can you give us more info?
[08:31:04] <hashar>	 phabricator can't reach its db m3-master.eqiad.wmnet failed with error #2002: Cannot assign requested address.
[08:31:04] <_joe_>	 XioNoX: is that esams?
[08:31:17] <XioNoX>	 _joe_: yeah, all my work is esams
[08:31:18] <_joe_>	 same error here
[08:31:32] <hashar>	 so maybe that is not related :]
[08:32:17] <elukey>	 Attempt to connect to phabricatorphd@m3-master.eqiad.wmnet failed with error #2002: Cannot assign requested address.
[08:32:23] <elukey>	 mmmm
[08:32:40] <elukey>	 marostegui: hola hola, are you around?
[08:32:50] <marostegui>	 Hello, can we either discuss here or on -sre?
[08:32:51] <elukey>	 (just to be sure)
[08:33:08] <marostegui>	 Right now we are discussing this on -databases, -sre and here
[08:33:09] <elukey>	 ahh sorry my bad
[08:33:11] <marostegui>	 So let's try to focus
[08:33:18] <marostegui>	 database and proxy look up
[08:33:23] <marostegui>	 I can reach the master from the proxy
[08:33:23] <elukey>	 ahh sorry my bad
[08:33:39] <_joe_>	 elukey: go look at the phab machine, I suspect the problem is there
[08:33:52] <marostegui>	 I just reloaded the proxies and they keep seeing the DB as up
[08:34:01] <marostegui>	 phab is now back for me
[08:34:02] <hashar>	 phabricator should be on phab1001.eqiad.wmnet   and  "Cannot assign requested address." looks strange
[08:34:05] <_joe_>	 marostegui: maybe too many connections?
[08:34:07] <elukey>	 yep I am on it, probably something weird
[08:34:41] <_joe_>	 sorry I really gtg in 5 minutes :/
[08:34:42] <marostegui>	 _joe_: huge spike on connections right now
[08:34:54] <hashar>	 yeah https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=phab1001&var-datasource=thanos&var-cluster=misc&from=now-1h&to=now  :-\
[08:35:27] <hashar>	 socket usage jumped from 10k to 30k
[08:36:02] <hashar>	 XioNoX: the Phabricator/db issue does not seem related to the network stuff. Sorry :]
[08:36:57] <hashar>	 (phab issue being followed up in private #mediawiki_security
[08:38:52] <wikibugs>	 (03PS1) 10Ayounsi: Revert "Depool esams for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/662681
[08:43:52] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Revert "Depool esams for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/662681 (owner: 10Ayounsi)
[08:44:30] <XioNoX>	 !log repool esams - T272342
[08:44:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:52] <wikibugs>	 (03PS11) 10Kosta Harlan: [WIP] linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893)
[08:46:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan)
[08:46:06] <wikibugs>	 (03CR) 10Kosta Harlan: [WIP] linkrecommendation: Cron job to load datasets (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan)
[08:46:47] <wikibugs>	 (03PS12) 10Kosta Harlan: [WIP] linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893)
[08:47:53] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to <WMF LDAP GROUP> for <VERONICA THAMAINI> - https://phabricator.wikimedia.org/T274106 (10Aklapper) Thank you. Please see https://phabricator.wikimedia.org/project/profile/1564/ for info that is usually requested.
[08:52:07] <wikibugs>	 10SRE, 10netops: Upgrade Junos on asw2-esams - https://phabricator.wikimedia.org/T252631 (10ayounsi) 05Open→03Resolved a:03ayounsi All done here.   Note that we're hitting something similar to https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1363186 But it says "These log messages are ha...
[08:55:10] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to <WMF LDAP GROUP> for <VERONICA THAMAINI> - https://phabricator.wikimedia.org/T274106 (10VeronicaThamaini) Thank you @Aklapper.  Is this the additional information needed.   Username: Veronica Thamaini Shell access: I have a shell name. Should I share the name here...
[08:57:11] <wikibugs>	 (03CR) 10Gergő Tisza: [WIP] linkrecommendation: Cron job to load datasets (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan)
[08:59:53] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 36.89 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[09:00:08] <wikibugs>	 (03CR) 10Volans: [C: 03+2] documentation: add a development page [software/spicerack] - 10https://gerrit.wikimedia.org/r/662783 (https://phabricator.wikimedia.org/T211750) (owner: 10Volans)
[09:00:49] <wikibugs>	 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Shizhao) Rename to a new filename?
[09:01:07] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [WIP] linkrecommendation: Cron job to load datasets (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan)
[09:09:11] <wikibugs>	 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10MoritzMuehlenhoff)
[09:09:31] <wikibugs>	 (03PS1) 10Jbond: P:phabricator::httpd add tcp_tw_reuse and increase empherical ports [puppet] - 10https://gerrit.wikimedia.org/r/662901
[09:09:39] <wikibugs>	 (03Merged) 10jenkins-bot: documentation: add a development page [software/spicerack] - 10https://gerrit.wikimedia.org/r/662783 (https://phabricator.wikimedia.org/T211750) (owner: 10Volans)
[09:11:25] <wikibugs>	 (03PS2) 10Jbond: P:phabricator::httpd add tcp_tw_reuse and increase empherical ports [puppet] - 10https://gerrit.wikimedia.org/r/662901
[09:11:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:phabricator::httpd add tcp_tw_reuse and increase empherical ports [puppet] - 10https://gerrit.wikimedia.org/r/662901 (owner: 10Jbond)
[09:12:19] <wikibugs>	 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10hashar)
[09:12:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:phabricator::httpd add tcp_tw_reuse and increase empherical ports [puppet] - 10https://gerrit.wikimedia.org/r/662901 (owner: 10Jbond)
[09:13:07] <wikibugs>	 (03PS3) 10Jbond: P:phabricator::httpd add tcp_tw_reuse and increase empherical ports [puppet] - 10https://gerrit.wikimedia.org/r/662901
[09:14:40] <wikibugs>	 (03PS1) 10Elukey: phabricator: add network performance tuning settings [puppet] - 10https://gerrit.wikimedia.org/r/662903
[09:15:57] <wikibugs>	 (03PS2) 10Jbond: phabricator: add network performance tuning settings [puppet] - 10https://gerrit.wikimedia.org/r/662903 (owner: 10Elukey)
[09:16:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM (minor fix to whitespace)" [puppet] - 10https://gerrit.wikimedia.org/r/662903 (owner: 10Elukey)
[09:16:52] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27922/console" [puppet] - 10https://gerrit.wikimedia.org/r/662903 (owner: 10Elukey)
[09:18:18] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "LGTM, 1 nit, maybe also point to cacheproxy::performance for additional comments/explanations." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/662903 (owner: 10Elukey)
[09:19:02] <wikibugs>	 (03PS3) 10Elukey: phabricator: add network performance tuning settings [puppet] - 10https://gerrit.wikimedia.org/r/662903
[09:19:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. I'll open a task to create a common performance profile to that we can align all the various hacks sprinkled over our Puppet t" [puppet] - 10https://gerrit.wikimedia.org/r/662903 (owner: 10Elukey)
[09:20:17] <wikibugs>	 (03PS4) 10Elukey: phabricator: add network performance tuning settings [puppet] - 10https://gerrit.wikimedia.org/r/662903
[09:20:30] <wikibugs>	 10SRE, 10Traffic: Create a generic network proformance profile - https://phabricator.wikimedia.org/T274230 (10jbond)
[09:20:38] <wikibugs>	 (03CR) 10Elukey: "Fixed the comments to reflect the Phab specific use case :)" [puppet] - 10https://gerrit.wikimedia.org/r/662903 (owner: 10Elukey)
[09:22:03] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] phabricator: add network performance tuning settings [puppet] - 10https://gerrit.wikimedia.org/r/662903 (owner: 10Elukey)
[09:22:12] <wikibugs>	 (03PS1) 10Volans: git: exclude black refactor from git blame [software/spicerack] - 10https://gerrit.wikimedia.org/r/662904 (https://phabricator.wikimedia.org/T211750)
[09:22:34] <godog>	 !log swift eqiad-prod: decrease weight for SSDs on ms-be[1019-1026] - T272836
[09:22:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:39] <stashbot>	 T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836
[09:23:46] <wikibugs>	 10SRE, 10Traffic: Create a generic network proformance profile - https://phabricator.wikimedia.org/T274230 (10ayounsi)
[09:23:57] <wikibugs>	 10SRE, 10Traffic: Create a generic network performance profile - https://phabricator.wikimedia.org/T274230 (10ayounsi)
[09:24:02] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] git: exclude black refactor from git blame [software/spicerack] - 10https://gerrit.wikimedia.org/r/662904 (https://phabricator.wikimedia.org/T211750) (owner: 10Volans)
[09:25:02] <wikibugs>	 (03PS1) 10David Caro: tox: Fix runs when system setuptools is old [software/spicerack] - 10https://gerrit.wikimedia.org/r/662905
[09:26:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] hieradata: add defaults for profile::swift::storage [puppet] - 10https://gerrit.wikimedia.org/r/662703 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi)
[09:26:41] <wikibugs>	 (03CR) 10Kormat: "Big 👍 for the direction. I had a quick look, and don't see any obvious issues with the approach." [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo)
[09:27:44] <wikibugs>	 (03PS2) 10David Caro: tox: Fix runs when system setuptools is old [software/spicerack] - 10https://gerrit.wikimedia.org/r/662905
[09:27:46] <wikibugs>	 (03PS11) 10David Caro: remote: allow prepending every command with sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (https://phabricator.wikimedia.org/T267412)
[09:30:03] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 72.74 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[09:30:04] <wikibugs>	 (03PS1) 10Filippo Giunchedi: swift: limit rsync and swift-object-replicator memory to 5% in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/662907 (https://phabricator.wikimedia.org/T221904)
[09:30:19] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[09:30:35] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10SRE-swift-storage, 10Patch-For-Review: Beta cluster Swift backend instances are missing profile::swift::storage::rsync_limit_memory_percent (puppet fails) - https://phabricator.wikimedia.org/T274092 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Issue should b...
[09:31:15] <wikibugs>	 10SRE, 10Traffic: Create a generic network performance profile - https://phabricator.wikimedia.org/T274230 (10MoritzMuehlenhoff) Other classes in our Puppet tree which already apply some of the generic settings:  * swift * profile::mediawiki::api * base::mysterious_sysctl * profile::mediawiki::common * profile...
[09:31:27] <wikibugs>	 10SRE, 10Traffic, 10User-MoritzMuehlenhoff: Create a generic network performance profile - https://phabricator.wikimedia.org/T274230 (10MoritzMuehlenhoff)
[09:31:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup100[12] - https://phabricator.wikimedia.org/T274206 (10jcrespo) a:05jcrespo→03Jclark-ctr @robh it got resolved. It will go in the regular production internal vlan (same as ms-fe hosts). In the future there is a chance it will...
[09:32:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] git: exclude black refactor from git blame [software/spicerack] - 10https://gerrit.wikimedia.org/r/662904 (https://phabricator.wikimedia.org/T211750) (owner: 10Volans)
[09:32:54] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup200[12] - https://phabricator.wikimedia.org/T274202 (10jcrespo) a:05jcrespo→03Papaul Internal production vlan, same as ms-fe hosts.
[09:33:25] <wikibugs>	 (03CR) 10David Caro: "Just a question, not a blocker in any case." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165) (owner: 10Andrew Bogott)
[09:33:43] <wikibugs>	 (03CR) 10Jcrespo: "Thanks, that was exactly what I needed, a quick thumbs up." [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo)
[09:34:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1090:3312 (re)pooling @ 10%: Slowly repooling db1090:3312 after cloning db1170', diff saved to https://phabricator.wikimedia.org/P14242 and previous config saved to /var/cache/conftool/dbconfig/20210209-093400-root.json
[09:34:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:16] <wikibugs>	 (03CR) 10Jcrespo: "Removing you to spare you from testing spam." [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo)
[09:34:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1090:3317 (re)pooling @ 10%: Slowly repooling db1090:3317 after cloning db1170', diff saved to https://phabricator.wikimedia.org/P14243 and previous config saved to /var/cache/conftool/dbconfig/20210209-093429-root.json
[09:34:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:43] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1170:3312 and db1170:3317 is now replicating
[09:39:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete netfilter sysctl setting [puppet] - 10https://gerrit.wikimedia.org/r/662908
[09:40:09] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove obsolete netfilter sysctl setting [puppet] - 10https://gerrit.wikimedia.org/r/662908
[09:40:09] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:40:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Remove obsolete netfilter sysctl setting [puppet] - 10https://gerrit.wikimedia.org/r/662908 (owner: 10Muehlenhoff)
[09:40:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Remove obsolete netfilter sysctl setting [puppet] - 10https://gerrit.wikimedia.org/r/662908 (owner: 10Muehlenhoff)
[09:42:48] <wikibugs>	 (03PS3) 10Muehlenhoff: Remove obsolete netfilter sysctl setting [puppet] - 10https://gerrit.wikimedia.org/r/662908
[09:43:20] <wikibugs>	 (03CR) 10Volans: "question inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/662905 (owner: 10David Caro)
[09:43:56] <wikibugs>	 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10jcrespo) I will open a new task for this issue and add you there. While this is not a blocker for backup generation, it would be for an emergency, and we should...
[09:49:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1090:3312 (re)pooling @ 25%: Slowly repooling db1090:3312 after cloning db1170', diff saved to https://phabricator.wikimedia.org/P14244 and previous config saved to /var/cache/conftool/dbconfig/20210209-094904-root.json
[09:49:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1090:3317 (re)pooling @ 25%: Slowly repooling db1090:3317 after cloning db1170', diff saved to https://phabricator.wikimedia.org/P14245 and previous config saved to /var/cache/conftool/dbconfig/20210209-094932-root.json
[09:49:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:04] <wikibugs>	 (03CR) 10David Caro: tox: Fix runs when system setuptools is old (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/662905 (owner: 10David Caro)
[09:51:30] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/662905 (owner: 10David Caro)
[09:59:41] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_arclamp site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:59:52] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper)
[10:00:10] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/662659 (owner: 10Hnowlan)
[10:04:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1090:3312 (re)pooling @ 50%: Slowly repooling db1090:3312 after cloning db1170', diff saved to https://phabricator.wikimedia.org/P14246 and previous config saved to /var/cache/conftool/dbconfig/20210209-100407-root.json
[10:04:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1090:3317 (re)pooling @ 50%: Slowly repooling db1090:3317 after cloning db1170', diff saved to https://phabricator.wikimedia.org/P14247 and previous config saved to /var/cache/conftool/dbconfig/20210209-100436-root.json
[10:04:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:09:41] <wikibugs>	 10SRE, 10serviceops, 10Epic: Track and remove jessie based container images from production - https://phabricator.wikimedia.org/T249724 (10hashar)
[10:10:32] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] tox: Fix runs when system setuptools is old [software/spicerack] - 10https://gerrit.wikimedia.org/r/662905 (owner: 10David Caro)
[10:12:24] <logmsgbot>	 !log gehel@cumin1001 START - Cookbook sre.wdqs.reboot
[10:12:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:36] <wikibugs>	 10SRE, 10ops-eqiad, 10observability: eqiad: Move logstash1020 to rack A8 - https://phabricator.wikimedia.org/T273984 (10fgiunchedi) 05Resolved→03Open `logstash1020.mgmt` is shown as down in icinga, reopening  ` logstash1020.mgmt  View Service Details For This Host DOWN 2021-02-09 10:08:30 0d 18h 33m 24s...
[10:13:03] <gehel>	 ryankemper: ^^^ restarting of the wdqs public and internal clusters in progress
[10:13:04] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1019.eqiad.wmnet
[10:13:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:14] <wikibugs>	 (03PS1) 10Marostegui: instances: Add db1157 [puppet] - 10https://gerrit.wikimedia.org/r/662915 (https://phabricator.wikimedia.org/T258361)
[10:14:03] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances: Add db1157 [puppet] - 10https://gerrit.wikimedia.org/r/662915 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[10:14:18] <wikibugs>	 (03PS3) 10Jcrespo: mariadb-backups: Remove old scheduled job disabling [puppet] - 10https://gerrit.wikimedia.org/r/644861
[10:14:20] <wikibugs>	 (03PS1) 10Elukey: sre.hadoop.stop-cluster: avoid context managers to apply downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/662916
[10:14:58] <wikibugs>	 10SRE, 10serviceops, 10Epic: Track and remove jessie based container images from production - https://phabricator.wikimedia.org/T249724 (10hashar) CI still had some usage of `docker-registry.wikimedia.org/wikimedia-jessie` which got removed in July 2020. I have missed the deletion of the image until docker-p...
[10:15:15] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/662916 (owner: 10Elukey)
[10:15:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1157 to dbctl, depooled T258361', diff saved to https://phabricator.wikimedia.org/P14248 and previous config saved to /var/cache/conftool/dbconfig/20210209-101556-marostegui.json
[10:16:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:02] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[10:16:51] <wikibugs>	 (03PS1) 10Marostegui: db1157: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/662917 (https://phabricator.wikimedia.org/T258361)
[10:16:53] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.hadoop.stop-cluster: avoid context managers to apply downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/662916 (owner: 10Elukey)
[10:17:44] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1157: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/662917 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[10:18:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Remove obsolete netfilter sysctl setting [puppet] - 10https://gerrit.wikimedia.org/r/662908 (owner: 10Muehlenhoff)
[10:19:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1090:3312 (re)pooling @ 75%: Slowly repooling db1090:3312 after cloning db1170', diff saved to https://phabricator.wikimedia.org/P14249 and previous config saved to /var/cache/conftool/dbconfig/20210209-101911-root.json
[10:19:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:22] <wikibugs>	 (03Merged) 10jenkins-bot: tox: Fix runs when system setuptools is old [software/spicerack] - 10https://gerrit.wikimedia.org/r/662905 (owner: 10David Caro)
[10:19:37] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1019.eqiad.wmnet
[10:19:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1090:3317 (re)pooling @ 75%: Slowly repooling db1090:3317 after cloning db1170', diff saved to https://phabricator.wikimedia.org/P14250 and previous config saved to /var/cache/conftool/dbconfig/20210209-101939-root.json
[10:19:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:13] <wikibugs>	 10SRE, 10Data-Persistence-Backup: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo)
[10:21:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1157 for the first time in s3 T258361', diff saved to https://phabricator.wikimedia.org/P14251 and previous config saved to /var/cache/conftool/dbconfig/20210209-102109-marostegui.json
[10:21:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:14] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[10:23:00] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10SRE-swift-storage: Beta cluster Swift backend instances are missing profile::swift::storage::rsync_limit_memory_percent (puppet fails) - https://phabricator.wikimedia.org/T274092 (10hashar) Puppet is all happy on both instances indeed. Thank you!
[10:23:35] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:24:27] <wikibugs>	 10SRE, 10serviceops, 10Epic: Track and remove jessie based container images from production - https://phabricator.wikimedia.org/T249724 (10akosiaris) 05Open→03Resolved a:03akosiaris And with that, I think indeed we can close this task. Production has dropped jessie support for some time now and doesn't...
[10:24:29] <wikibugs>	 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10akosiaris)
[10:29:32] <wikibugs>	 (03CR) 10Marostegui: "This was actually: Enable notifications :)" [puppet] - 10https://gerrit.wikimedia.org/r/662917 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[10:31:00] <wikibugs>	 (03PS1) 10Muehlenhoff: Swift: Stop setting net.ipv4.tcp_tw_recycle for buster and later [puppet] - 10https://gerrit.wikimedia.org/r/662918
[10:31:19] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:34:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1090:3312 (re)pooling @ 100%: Slowly repooling db1090:3312 after cloning db1170', diff saved to https://phabricator.wikimedia.org/P14252 and previous config saved to /var/cache/conftool/dbconfig/20210209-103414-root.json
[10:34:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1090:3317 (re)pooling @ 100%: Slowly repooling db1090:3317 after cloning db1170', diff saved to https://phabricator.wikimedia.org/P14253 and previous config saved to /var/cache/conftool/dbconfig/20210209-103443-root.json
[10:34:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:01] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:35:24] <wikibugs>	 (03PS2) 10David Caro: ceph.osd: Allow setting the io scheduler of the osd disks [puppet] - 10https://gerrit.wikimedia.org/r/662689 (https://phabricator.wikimedia.org/T273791)
[10:36:44] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] Add support for php deployments (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto)
[10:39:08] <wikibugs>	 (03CR) 10David Caro: ceph.osd: Allow setting the io scheduler of the osd disks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/662689 (https://phabricator.wikimedia.org/T273791) (owner: 10David Caro)
[10:40:28] <wikibugs>	 (03PS3) 10David Caro: ceph.osd: Allow setting the io scheduler of the osd disks [puppet] - 10https://gerrit.wikimedia.org/r/662689 (https://phabricator.wikimedia.org/T273791)
[10:40:30] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host cumin2001.codfw.wmnet
[10:40:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:58] <wikibugs>	 (03PS4) 10David Caro: ceph.osd: Allow setting the io scheduler of the osd disks [puppet] - 10https://gerrit.wikimedia.org/r/662689 (https://phabricator.wikimedia.org/T273791)
[10:41:37] <vgutierrez>	 !log rolling restart of esams LVS instances to catch up on kernel upgrades
[10:41:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:07] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs3007.esams.wmnet
[10:43:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:33] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:48:08] <wikibugs>	 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10Vgutierrez) p:05Triage→03Medium
[10:48:11] <wikibugs>	 (03PS1) 10DCausse: Add extra-analysis-khmer [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/662923 (https://phabricator.wikimedia.org/T274203)
[10:48:50] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3007.esams.wmnet
[10:48:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:39] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:50:00] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs3006.esams.wmnet
[10:50:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 2%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14254 and previous config saved to /var/cache/conftool/dbconfig/20210209-105109-root.json
[10:51:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:44] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[10:53:08] <wikibugs>	 (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/662904 (https://phabricator.wikimedia.org/T211750) (owner: 10Volans)
[10:53:17] <wikibugs>	 (03PS2) 10Volans: git: exclude black refactor from git blame [software/spicerack] - 10https://gerrit.wikimedia.org/r/662904 (https://phabricator.wikimedia.org/T211750)
[10:53:21] <wikibugs>	 (03PS1) 10Jbond: sysctl: Allow Boolean values [puppet] - 10https://gerrit.wikimedia.org/r/662924 (https://phabricator.wikimedia.org/T273175)
[10:53:24] <wikibugs>	 (03CR) 10Volans: [C: 03+2] git: exclude black refactor from git blame [software/spicerack] - 10https://gerrit.wikimedia.org/r/662904 (https://phabricator.wikimedia.org/T211750) (owner: 10Volans)
[10:53:49] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[10:53:55] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin2001.codfw.wmnet
[10:53:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:35] <icinga-wm>	 PROBLEM - Keyholder SSH agent on cumin2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder
[10:54:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Swift: Stop setting net.ipv4.tcp_tw_recycle for buster and later [puppet] - 10https://gerrit.wikimedia.org/r/662918 (owner: 10Muehlenhoff)
[10:55:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sysctl: Allow Boolean values [puppet] - 10https://gerrit.wikimedia.org/r/662924 (https://phabricator.wikimedia.org/T273175) (owner: 10Jbond)
[10:55:14] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3006.esams.wmnet
[10:55:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:12] <wikibugs>	 (03PS2) 10Jbond: sysctl: Allow Boolean values [puppet] - 10https://gerrit.wikimedia.org/r/662924 (https://phabricator.wikimedia.org/T273175)
[10:57:37] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on maps1005.eqiad.wmnet with reason: Resyncing database, still
[10:57:37] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on maps1005.eqiad.wmnet with reason: Resyncing database, still
[10:57:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:01:48] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro)
[11:02:25] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs3005.esams.wmnet
[11:02:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:02:37] <wikibugs>	 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10Majavah) Phabricator doesn't also seem to be caching user profile pictures at all, which are also stored in the `phabricator_file` database.
[11:02:41] <icinga-wm>	 PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:03:08] <volans>	 XioNoX: ^^
[11:04:42] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ceph.osd: Allow setting the io scheduler of the osd disks [puppet] - 10https://gerrit.wikimedia.org/r/662689 (https://phabricator.wikimedia.org/T273791) (owner: 10David Caro)
[11:05:17] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:05:40] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] remote: allow prepending every command with sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro)
[11:05:57] <vgutierrez>	 volans: that's triggered by me
[11:06:01] <volans>	 ack
[11:06:06] <vgutierrez>	 and expected (lvs3005 being restarted)
[11:06:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 3%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14255 and previous config saved to /var/cache/conftool/dbconfig/20210209-110613-root.json
[11:06:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:04] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3005.esams.wmnet
[11:07:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:13] <icinga-wm>	 RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:07:15] <vgutierrez>	 :)
[11:11:55] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] ceph.osd: Allow setting the io scheduler of the osd disks [puppet] - 10https://gerrit.wikimedia.org/r/662689 (https://phabricator.wikimedia.org/T273791) (owner: 10David Caro)
[11:17:39] <vgutierrez>	 !log rolling restart of eqiad LVS instances to catch up on kernel upgrades
[11:17:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:12] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1016.eqiad.wmnet
[11:18:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 4%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14256 and previous config saved to /var/cache/conftool/dbconfig/20210209-112116-root.json
[11:21:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:42] <wikibugs>	 (03PS12) 10David Caro: remote: allow prepending every command with sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (https://phabricator.wikimedia.org/T267412)
[11:23:40] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1016.eqiad.wmnet
[11:23:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:48] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Remove obsolete netfilter sysctl setting [puppet] - 10https://gerrit.wikimedia.org/r/662908 (owner: 10Muehlenhoff)
[11:27:56] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1015.eqiad.wmnet
[11:27:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:00] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:28:28] <vgutierrez>	 ^^ that's me again, and it's expected
[11:28:50] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.012 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[11:30:56] <icinga-wm>	 RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 63, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:32:37] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1015.eqiad.wmnet
[11:32:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:37] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster for Hadoop analytics cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001
[11:33:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:37] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] "see https://phabricator.wikimedia.org/T274137#6812206" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662720 (https://phabricator.wikimedia.org/T274137) (owner: 10Base)
[11:34:38] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:34:46] <elukey>	 !log start the upgrade process for Hadoop Analytics
[11:34:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:10] <wikibugs>	 (03PS1) 10Jbond: profile::performance: add a new profile for tweaking sysctl parameters [puppet] - 10https://gerrit.wikimedia.org/r/662932 (https://phabricator.wikimedia.org/T273175)
[11:36:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 5%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14257 and previous config saved to /var/cache/conftool/dbconfig/20210209-113620-root.json
[11:36:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:38] <wikibugs>	 10SRE, 10Traffic, 10User-MoritzMuehlenhoff: Create a generic network performance profile - https://phabricator.wikimedia.org/T274230 (10jbond) I have created a starting point [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/662932 | profile ]] using mostly based on `cacheproxy::performance ` below ill...
[11:37:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::performance: add a new profile for tweaking sysctl parameters [puppet] - 10https://gerrit.wikimedia.org/r/662932 (https://phabricator.wikimedia.org/T273175) (owner: 10Jbond)
[11:39:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sysctl: Allow Boolean values [puppet] - 10https://gerrit.wikimedia.org/r/662924 (https://phabricator.wikimedia.org/T273175) (owner: 10Jbond)
[11:40:26] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1014.eqiad.wmnet
[11:40:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:35] <wikibugs>	 (03PS3) 10David Caro: toolforge.etcdctl: add new etcdctl module [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412)
[11:41:48] <wikibugs>	 (03PS2) 10Jbond: profile::performance: add a new profile for tweaking sysctl parameters [puppet] - 10https://gerrit.wikimedia.org/r/662932 (https://phabricator.wikimedia.org/T273175)
[11:42:20] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:43:22] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::performance: add a new profile for tweaking sysctl parameters [puppet] - 10https://gerrit.wikimedia.org/r/662932 (https://phabricator.wikimedia.org/T273175) (owner: 10Jbond)
[11:44:24] <icinga-wm>	 RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 63, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:46:07] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1014.eqiad.wmnet
[11:46:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:47:50] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:48:20] <wikibugs>	 (03PS1) 10Jbond: sysctl: reject undef values [puppet] - 10https://gerrit.wikimedia.org/r/662933 (https://phabricator.wikimedia.org/T274230)
[11:50:02] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1001.eqiad.wmnet
[11:50:04] <wikibugs>	 (03PS3) 10Jbond: profile::performance: add a new profile for tweaking sysctl parameters [puppet] - 10https://gerrit.wikimedia.org/r/662932 (https://phabricator.wikimedia.org/T274230)
[11:50:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:50:52] <wikibugs>	 (03CR) 10Jbond: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27926" [puppet] - 10https://gerrit.wikimedia.org/r/662933 (https://phabricator.wikimedia.org/T274230) (owner: 10Jbond)
[11:50:55] <wikibugs>	 (03CR) 10David Caro: [C: 04-1] toolforge.etcdctl: add new etcdctl module (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro)
[11:51:04] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1013.eqiad.wmnet
[11:51:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 8%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14258 and previous config saved to /var/cache/conftool/dbconfig/20210209-115124-root.json
[11:51:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:27] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: name=maps1006.eqiad.wmnet
[11:51:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::performance: add a new profile for tweaking sysctl parameters [puppet] - 10https://gerrit.wikimedia.org/r/662932 (https://phabricator.wikimedia.org/T274230) (owner: 10Jbond)
[11:51:50] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: name=maps1007.eqiad.wmnet
[11:51:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:13] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: name=maps1008.eqiad.wmnet
[11:52:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:53] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: name=maps1010.eqiad.wmnet
[11:52:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toolforge.etcdctl: add new etcdctl module [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro)
[11:53:52] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:54:54] <icinga-wm>	 RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 63, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:55:21] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1013.eqiad.wmnet
[11:55:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/662765 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[11:56:49] <wikibugs>	 (03Abandoned) 10Jbond: P:phabricator::httpd add tcp_tw_reuse and increase empherical ports [puppet] - 10https://gerrit.wikimedia.org/r/662901 (owner: 10Jbond)
[11:57:06] <wikibugs>	 10SRE, 10ops-eqiad: eqiad: Move maps1001 same rack A4 - https://phabricator.wikimedia.org/T273983 (10hnowlan) Thanks for the heads up @CDanis  - I've repooled. It appears there were some issues with the weights of other maps hosts that should have prevented this having an impact, I've rectified that now too.
[11:57:54] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: name=maps2005.codfw.wmnet
[11:57:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:21] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: name=maps2006.codfw.wmnet
[11:58:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:27] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: name=maps2008.codfw.wmnet
[11:58:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:31] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: name=maps2009.codfw.wmnet
[11:58:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:37] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: name=maps2010.codfw.wmnet
[11:58:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: May I have your attention please! European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210209T1200)
[12:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[12:00:28] <Lucas_WMDE>	 yup, nothing to do
[12:00:35] <_joe_>	 the best kind of deploy
[12:00:38] <Urbanecm>	 hehe
[12:00:44] <_joe_>	 the one that doesn't happen
[12:00:51] <Urbanecm>	 I might deploy sth anyway
[12:00:56] <Urbanecm>	 Daimona: hi, around for some more security patches in AF?
[12:01:23] <_joe_>	 dang
[12:01:26] <_joe_>	 :)
[12:01:29] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27927/console" [puppet] - 10https://gerrit.wikimedia.org/r/662716 (https://phabricator.wikimedia.org/T272998) (owner: 10Ottomata)
[12:02:56] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) for Hadoop analytics cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001
[12:02:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:12] <Urbanecm>	 sorry for disturbing the best kind of a deploy _joe_ :)
[12:03:21] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/27927/mw1331.eqiad.wmnet/fulldiff.html shows this applies cleanly on the appservers." [puppet] - 10https://gerrit.wikimedia.org/r/662716 (https://phabricator.wikimedia.org/T272998) (owner: 10Ottomata)
[12:05:42] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001
[12:05:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 10%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14259 and previous config saved to /var/cache/conftool/dbconfig/20210209-120627-root.json
[12:06:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:46] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, one comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/662765 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[12:09:30] <Daimona>	 Urbanecm: hey, in 30 minutes probably
[12:09:44] <Urbanecm>	 Daimona: perfect! PM me once around :)
[12:11:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think there is a minor flaw in the implementation (basically, we won't check for new hiera() calls at node scope). Otherwise LGTM, quite" (031 comment) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[12:15:47] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM. Do you want to schedule a deployment time?" [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) (owner: 10Jforrester)
[12:19:25] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] tegola: Add docker image. (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/654662 (https://phabricator.wikimedia.org/T270170) (owner: 10Hnowlan)
[12:21:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 13%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14260 and previous config saved to /var/cache/conftool/dbconfig/20210209-122131-root.json
[12:21:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:30:01] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:34:19] <wikibugs>	 10SRE, 10SRE-tools, 10tox-wikimedia, 10Patch-For-Review, 10User-Kormat: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10jbond) FYI i had to [[ https://github.com/psf/black/pull/1545 | apply a fix ]] to get the following black vim plugin setting to work `let g:black_s...
[12:36:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 15%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14261 and previous config saved to /var/cache/conftool/dbconfig/20210209-123634-root.json
[12:36:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:38] <wikibugs>	 (03PS18) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102
[12:37:52] <wikibugs>	 (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond)
[12:38:21] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:40:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond)
[12:43:39] <wikibugs>	 (03PS3) 10Jbond: (WIP) debdeploy: Add debdeploy functionality [software/spicerack] - 10https://gerrit.wikimedia.org/r/658626
[12:47:22] <wikibugs>	 (03PS19) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102
[12:51:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 20%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14262 and previous config saved to /var/cache/conftool/dbconfig/20210209-125138-root.json
[12:51:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:54:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] (WIP) debdeploy: Add debdeploy functionality [software/spicerack] - 10https://gerrit.wikimedia.org/r/658626 (owner: 10Jbond)
[13:00:41] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:01:57] <wikibugs>	 (03PS4) 10Jbond: Add check to error when calling to hiera() [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[13:02:10] <wikibugs>	 (03CR) 10Jbond: Add check to error when calling to hiera() (031 comment) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[13:02:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add check to error when calling to hiera() [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[13:03:55] <icinga-wm>	 PROBLEM - Check systemd state on an-airflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:04:00] <wikibugs>	 (03PS2) 10Base: Changing frwiktionary's wmgBabelMainCategory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662720 (https://phabricator.wikimedia.org/T274137)
[13:04:16] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139)
[13:04:33] <wikibugs>	 (03PS3) 10Base: Changing frwiktionary's wmgBabelMainCategory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662720 (https://phabricator.wikimedia.org/T274137)
[13:06:24] <wikibugs>	 10Puppet, 10SRE, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond)
[13:06:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 25%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14263 and previous config saved to /var/cache/conftool/dbconfig/20210209-130641-root.json
[13:06:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:20] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: front proxy: drop non-TLS support [puppet] - 10https://gerrit.wikimedia.org/r/662942 (https://phabricator.wikimedia.org/T274123)
[13:08:55] <twentyafterfour>	 !log restart phabricator daemons to free 3.5gb of ram (memory leak?)
[13:08:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toolforge: front proxy: drop non-TLS support [puppet] - 10https://gerrit.wikimedia.org/r/662942 (https://phabricator.wikimedia.org/T274123) (owner: 10Arturo Borrero Gonzalez)
[13:10:51] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: dynamicproxy: nginx.conf: drop duplicated nginx directives [puppet] - 10https://gerrit.wikimedia.org/r/662944
[13:11:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Initial client profile for unprivileged Cumin [puppet] - 10https://gerrit.wikimedia.org/r/662945
[13:13:35] <wikibugs>	 (03PS1) 10Kormat: tox: Add py3 env that uses default system python3 [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/662966
[13:13:38] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: toolforge: front proxy: drop non-TLS support [puppet] - 10https://gerrit.wikimedia.org/r/662942 (https://phabricator.wikimedia.org/T274123)
[13:14:19] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): wikidata: post edit constraint jobs on 50% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662967 (https://phabricator.wikimedia.org/T204031)
[13:15:30] <wikibugs>	 10Puppet, 10SRE, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) >>! In T265138#6813427, @Ladsgroup wrote: > @jbond: Hey, you wrote this as a checkbox >> []  migrate all cron types to systemd::timer::job...
[13:16:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] tox: Add py3 env that uses default system python3 [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/662966 (owner: 10Kormat)
[13:16:46] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Remove old scheduled job disabling [puppet] - 10https://gerrit.wikimedia.org/r/644861 (owner: 10Jcrespo)
[13:17:26] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139)
[13:20:07] <wikibugs>	 (03CR) 10Matthias Mullie: Add external entity search URI for new MediaSearch extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662792 (https://phabricator.wikimedia.org/T265939) (owner: 10Anne Tomasevich)
[13:21:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 30%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14264 and previous config saved to /var/cache/conftool/dbconfig/20210209-132145-root.json
[13:21:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:01] <wikibugs>	 10Puppet, 10SRE, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond)
[13:22:38] <wikibugs>	 10Puppet, 10SRE, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) FYI i have tried to clarify things in the task description
[13:23:03] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:23:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sysctl: reject undef values [puppet] - 10https://gerrit.wikimedia.org/r/662933 (https://phabricator.wikimedia.org/T274230) (owner: 10Jbond)
[13:24:07] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139)
[13:25:13] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001
[13:25:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:33] <icinga-wm>	 PROBLEM - Check systemd state on an-master1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:27:16] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] swift: limit rsync and swift-object-replicator memory to 5% in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/662907 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi)
[13:27:29] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139)
[13:30:55] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[13:31:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] swift: limit rsync and swift-object-replicator memory to 5% in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/662907 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi)
[13:31:42] <wikibugs>	 (03PS2) 10Filippo Giunchedi: swift: limit rsync and swift-object-replicator memory to 5% in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/662907 (https://phabricator.wikimedia.org/T221904)
[13:33:11] <icinga-wm>	 PROBLEM - Hadoop DataNode on analytics1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[13:33:39] <icinga-wm>	 PROBLEM - Hadoop DataNode on analytics1065 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[13:34:58] <elukey>	 added some downtime
[13:36:15] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1065 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[13:36:29] <wikibugs>	 (03PS1) 10ArielGlenn: WANObjectCache: throw on Closure [core] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662953 (https://phabricator.wikimedia.org/T273242)
[13:36:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 40%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14265 and previous config saved to /var/cache/conftool/dbconfig/20210209-133648-root.json
[13:36:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:43] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] WANObjectCache: throw on Closure [core] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662953 (https://phabricator.wikimedia.org/T273242) (owner: 10ArielGlenn)
[13:38:27] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1069 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[13:38:41] <wikibugs>	 10SRE, 10CAS-SSO, 10Discovery-Search (Current work), 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10jbond) @Zbyszko Thanks for looking into this its really appreciated  > we have similiar issues with it in WDQS streaming updater is `it` refering to Kyro he...
[13:40:53] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:45:07] * Urbanecm secdeploying
[13:47:51] <wikibugs>	 (03PS5) 10Jbond: CAS style changes [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/661734
[13:49:03] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 4:" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/661734 (owner: 10Jbond)
[13:49:31] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): wikidata: add Dagbani to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662970 (https://phabricator.wikimedia.org/T272242)
[13:51:21] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:51:33] <icinga-wm>	 RECOVERY - Check systemd state on an-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:51:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 50%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14266 and previous config saved to /var/cache/conftool/dbconfig/20210209-135152-root.json
[13:51:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:54] <Urbanecm>	 !log Deploy security patch (T274152)
[13:52:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:05] <jouncebot>	 hashar and twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - American+European Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210209T1400).
[14:02:31] <wikibugs>	 (03Merged) 10jenkins-bot: WANObjectCache: throw on Closure [core] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662953 (https://phabricator.wikimedia.org/T273242) (owner: 10ArielGlenn)
[14:03:51] <icinga-wm>	 PROBLEM - Hive Metastore on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[14:05:44] <elukey>	 downtimed --^
[14:06:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 60%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14267 and previous config saved to /var/cache/conftool/dbconfig/20210209-140655-root.json
[14:06:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:25] <hashar>	 some 1.36.0-wmf.29 backport got merged
[14:07:33] <hashar>	 I am refreshing the deployment server
[14:07:37] <hashar>	 and will promote group 0 then group 1
[14:07:41] <wikibugs>	 10SRE, 10CAS-SSO, 10Discovery-Search (Current work), 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10Zbyszko) >>! In T273867#6814932, @jbond wrote: > @Zbyszko Thanks for looking into this its really appreciated >  >> we have similiar issues with it in WDQS...
[14:08:38] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139)
[14:09:32] <wikibugs>	 10SRE, 10CAS-SSO, 10Discovery-Search (Current work), 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10jbond) >>! In T273867#6815032, @Zbyszko wrote: > I'd try to reorganize register() methods on kryo - so that new ones are after the old ones. I'm not deep en...
[14:09:34] <hashar>	 apergos: I am syncing the patch got merged
[14:09:42] <apergos>	 ok great
[14:09:46] <hashar>	 ohh
[14:09:54] <hashar>	 forgot about Urbanecm still running patches damn
[14:10:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Add dummy keytab for sretest1001 [labs/private] - 10https://gerrit.wikimedia.org/r/662972
[14:10:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) (owner: 10Arturo Borrero Gonzalez)
[14:10:26] <Urbanecm>	 hashar: oh, sorry, I'm done with sec patches now
[14:10:28] <logmsgbot>	 !log hashar@deploy1001 Synchronized php-1.36.0-wmf.29/includes/libs/objectcache/wancache/WANObjectCache.php: WANObjectCache: throw on Closure - T273242 (duration: 01m 08s)
[14:10:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:34] <stashbot>	 T273242: MemcachedPeclBagOStuff: Serialization of 'Closure' is not allowed - https://phabricator.wikimedia.org/T273242
[14:10:38] <Urbanecm>	 (since my !log statement)
[14:10:56] <hashar>	 Urbanecm: should have hold  and verified the backport window had completed sorry bout that
[14:11:15] <wikibugs>	 (03PS1) 10Volans: icinga: reduce contacts limit to 1 [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/662973 (https://phabricator.wikimedia.org/T273951)
[14:11:26] <wikibugs>	 (03CR) 10David Caro: "Did a quick review, looks nice, got some questions though. You can safely ignore the nits :)" (039 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond)
[14:11:37] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] "Thank you!  Merging." [puppet] - 10https://gerrit.wikimedia.org/r/662716 (https://phabricator.wikimedia.org/T272998) (owner: 10Ottomata)
[14:11:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] icinga: reduce contacts limit to 1 [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/662973 (https://phabricator.wikimedia.org/T273951) (owner: 10Volans)
[14:13:07] <apergos>	 my attention is unfortunately split between here and a meeting I am facilitating, but I will absolutely respond to pings here
[14:13:51] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] icinga: reduce contacts limit to 1 [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/662973 (https://phabricator.wikimedia.org/T273951) (owner: 10Volans)
[14:14:11] <wikibugs>	 (03CR) 10Ottomata: "Applied and tested on mw1331, works fine." [puppet] - 10https://gerrit.wikimedia.org/r/662716 (https://phabricator.wikimedia.org/T272998) (owner: 10Ottomata)
[14:14:36] <gehel>	 !log depooling wdqs1005, catching up on lag
[14:14:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:42] <gehel>	 ryankemper: ^^
[14:15:24] <wikibugs>	 (03PS2) 10Volans: icinga: reduce contacts limit to 1 [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/662973 (https://phabricator.wikimedia.org/T273951)
[14:15:26] <wikibugs>	 (03PS1) 10Volans: Fix tox invocation [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/662974
[14:15:55] <hashar>	 apergos: dont worry I will chceck logstash and rollback as needed :]
[14:16:12] <hashar>	 apergos: at least I know you are watching more or less which by itself is comforting!
[14:16:13] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:16:14] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Fix tox invocation [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/662974 (owner: 10Volans)
[14:16:17] <apergos>	 thank you (but I'm happy to know about it in real time too)
[14:17:16] <wikibugs>	 (03Merged) 10jenkins-bot: Fix tox invocation [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/662974 (owner: 10Volans)
[14:17:18] <wikibugs>	 (03PS1) 10Hashar: group0 wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662975
[14:17:20] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662975 (owner: 10Hashar)
[14:17:25] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139)
[14:17:35] <Majavah>	 I'm also around if I can be helpful in any way
[14:17:42] <hashar>	 \o/
[14:17:56] <hashar>	 Majavah: I am happy to see your patch got blessed :]
[14:18:13] <hashar>	 it is not like I understand anything about the issue beside something somehow triggering an attempt to serialize a closure
[14:18:15] <hashar>	 :-\
[14:18:38] <Majavah>	 we're still not sure if that was even the issue
[14:19:01] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662975 (owner: 10Hashar)
[14:19:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) (owner: 10Arturo Borrero Gonzalez)
[14:19:33] <Majavah>	 d.uesen was yesterday thinking about not merging it yet to see if the new logging worked
[14:20:04] <wikibugs>	 (03PS1) 10Andrew Bogott: Horizon: put into maintenance mode for Train upgrade [puppet] - 10https://gerrit.wikimedia.org/r/662978 (https://phabricator.wikimedia.org/T261135)
[14:20:19] <apergos>	 no we're not
[14:20:27] <apergos>	 so we're throwing various things at it hoping that e
[14:20:38] <apergos>	 it's either fixed or we get better logging enabling us to find the issue
[14:20:39] <hashar>	 apaches are syncing
[14:20:42] <apergos>	 ok
[14:21:19] <wikibugs>	 10SRE, 10CAS-SSO, 10Discovery-Search (Current work), 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10Zbyszko) >>! In T273867#6815035, @jbond wrote: >>>! In T273867#6815032, @Zbyszko wrote: >> I'd try to reorganize register() methods on kryo - so that new on...
[14:21:48] <hashar>	 fpm restarting
[14:21:49] <logmsgbot>	 !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.29
[14:21:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 75%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14268 and previous config saved to /var/cache/conftool/dbconfig/20210209-142159-root.json
[14:22:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: reduce contacts limit to 1 [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/662973 (https://phabricator.wikimedia.org/T273951) (owner: 10Volans)
[14:22:21] <wikibugs>	 (03CR) 10Volans: [C: 03+2] icinga: reduce contacts limit to 1 [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/662973 (https://phabricator.wikimedia.org/T273951) (owner: 10Volans)
[14:22:39] <wikibugs>	 (03PS7) 10Arturo Borrero Gonzalez: toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139)
[14:22:52] <wikibugs>	 (03Merged) 10jenkins-bot: icinga: reduce contacts limit to 1 [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/662973 (https://phabricator.wikimedia.org/T273951) (owner: 10Volans)
[14:23:31] <apergos>	 so that's just group0
[14:23:31] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud-vps: move eqiad1 from openstack 'stein' to 'train' [puppet] - 10https://gerrit.wikimedia.org/r/662980 (https://phabricator.wikimedia.org/T261135)
[14:23:58] <apergos>	 is there a plan to go also to group 1 later in the hour, as the email mentioned to group1 also?
[14:24:15] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:24:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) (owner: 10Arturo Borrero Gonzalez)
[14:24:35] <hashar>	 apergos: yeah i will do it right now
[14:24:45] <hashar>	 cause the group0 logs seems fine
[14:24:55] <apergos>	 ok, I expcted them to be quiet though
[14:25:23] <hashar>	 at least the logs are quiet
[14:25:26] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add dummy keytab for sretest1001 [labs/private] - 10https://gerrit.wikimedia.org/r/662972 (owner: 10Muehlenhoff)
[14:25:27] <hashar>	 beside a bunch of known issues
[14:25:42] <hashar>	 doing group1
[14:26:04] <wikibugs>	 (03PS1) 10Hashar: group1 wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662981
[14:26:06] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662981 (owner: 10Hashar)
[14:26:28] <volans>	 !log cd /srv/external-monitoring; git fetch/status/pull on wikitech-static - T273951
[14:26:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:32] <stashbot>	 T273951: Update Icinga meta-monitoring to account for "no pagers" in contacts - https://phabricator.wikimedia.org/T273951
[14:26:58] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662981 (owner: 10Hashar)
[14:27:21] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:28:47] <logmsgbot>	 !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.29
[14:28:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:29] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1102 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[14:29:31] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1114 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[14:29:54] <logmsgbot>	 !log hashar@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.29 (duration: 01m 06s)
[14:29:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:01] <hashar>	 e/F/i/FeaturedFeedChannel:45  Call to a member function getCode() on string
[14:31:26] <Majavah>	 uhhhh
[14:32:48] <hashar>	 Majavah: the full trace https://phabricator.wikimedia.org/T264391#6815153
[14:32:57] <Majavah>	 that is definitely a coding error in my featuredfeeds patch
[14:33:01] <wikibugs>	 10SRE, 10ops-codfw: codfw: relocate logstash2035 - https://phabricator.wikimedia.org/T274214 (10herron) @Papaul sure, sounds good.  This host is not yet in production so there will be no prep/depool needed before the re-rack.
[14:33:47] <wikibugs>	 (03PS8) 10Arturo Borrero Gonzalez: toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139)
[14:33:55] <hashar>	 then the feeds seem to work
[14:33:56] <Majavah>	 please rollback
[14:34:00] <wikibugs>	 (03PS1) 10Gehel: wait_reboot_since() is now using a constant backoff. [software/spicerack] - 10https://gerrit.wikimedia.org/r/662983
[14:34:05] <Majavah>	 multilingual feeds do not, see https://commons.wikimedia.org/w/api.php?action=featuredfeed&feed=potd&feedformat=rss&language=en for example
[14:35:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) (owner: 10Arturo Borrero Gonzalez)
[14:37:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 85%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14269 and previous config saved to /var/cache/conftool/dbconfig/20210209-143703-root.json
[14:37:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:07] <wikibugs>	 (03PS1) 10Hashar: Revert "group1 wikis to 1.36.0-wmf.29" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662984 (https://phabricator.wikimedia.org/T271343)
[14:37:09] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Revert "group1 wikis to 1.36.0-wmf.29" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662984 (https://phabricator.wikimedia.org/T271343) (owner: 10Hashar)
[14:37:27] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1114 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[14:37:28] <hashar>	 Majavah: and yeah that is rolling back
[14:37:37] <logmsgbot>	 !log hashar@deploy1001 rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.36.0-wmf.29"
[14:37:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:13] <Majavah>	 thank you
[14:38:24] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.36.0-wmf.29" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662984 (https://phabricator.wikimedia.org/T271343) (owner: 10Hashar)
[14:39:17] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, in fact given that the code inside the retry is cheap is actually nicer to have it more reactive and check constantly every N second" [software/spicerack] - 10https://gerrit.wikimedia.org/r/662983 (owner: 10Gehel)
[14:39:26] <apergos>	 options: revert the one patch in the branch, roll back out, or; daniel might be able to work up a fix very fast and we add that, merg,e backport, merge, and roll that out
[14:39:34] <apergos>	 hashar (or others): opinions?
[14:39:52] <hashar>	 I am guessing patching up the getCode() failure might be trivial enough?
[14:39:59] <Majavah>	 I think I also have another issue with FeaturedFeeds
[14:40:07] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1102 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[14:40:07] <hashar>	 cause without the patch the issue will still show up anyway
[14:40:52] <Majavah>	 I already have a fix for this specific issue but also found another in the process
[14:40:59] <apergos>	 oh?
[14:41:18] <Majavah>	 yeah, it breaks if you attempt to use Special:FeedItem with a date that does not have a feed item for that date
[14:41:49] <apergos>	 is tis a bug that existed before the cache issues showed up?
[14:41:58] <Majavah>	 no, new thing :/
[14:42:02] <apergos>	 ugh, ok
[14:42:13] <apergos>	 how easy is that going to be to fix?
[14:42:41] <Majavah>	 fairly simple, but I'm more worried that other edge cases might have gotten around code review too
[14:43:18] <gehel>	 !log rebooting wdqs1009 / 1010 for kernel upgrade
[14:43:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:36] <apergos>	 right :-(
[14:44:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wait_reboot_since() is now using a constant backoff. [software/spicerack] - 10https://gerrit.wikimedia.org/r/662983 (owner: 10Gehel)
[14:45:38] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/662983 (owner: 10Gehel)
[14:45:46] <wikibugs>	 (03PS20) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102
[14:46:05] <apergos>	 so, do we want to revert the one patch or do you want to try to fix it and get a careful code review etc?
[14:46:21] <Majavah>	 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FeaturedFeeds/+/662985/ fixes those specific issues
[14:46:37] <Majavah>	 some people wanted to not have that patch in this brain at all to see if the logging worked
[14:46:45] <apergos>	 right
[14:46:56] <apergos>	 let me discuss it right now in cpt
[14:46:58] <wikibugs>	 (03CR) 10Jbond: "thanks updated" (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond)
[14:48:42] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host pybal-test2001.codfw.wmnet
[14:48:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:44] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pybal-test2001.codfw.wmnet
[14:50:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:06] <wikibugs>	 (03PS21) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102
[14:51:08] <wikibugs>	 (03CR) 10Jbond: sre: convert the generic reboot functions to the cookbook class API (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond)
[14:52:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 100%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14270 and previous config saved to /var/cache/conftool/dbconfig/20210209-145206-root.json
[14:52:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:07] <apergos>	 ok, after having talked it over with a few people, Daniel and I are taking the responsibility for te decision, we would like the feeds patch to come out of the branch for now,
[14:54:15] <apergos>	 roll forward the train with just the logging fixes
[14:54:35] <apergos>	 and we will try to get the feeds fix patch plus the new patch out tomorrow 
[14:54:41] <apergos>	 hashar, does this work for you?
[14:54:59] <apergos>	 who can revert the feeds fix patch and roll the branch back out to groups 0/1?
[14:55:09] <Majavah>	 so revert the featuredfeeds caching patch for .29?
[14:55:13] <wikibugs>	 (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/662797 (https://phabricator.wikimedia.org/T274044) (owner: 10Bstorm)
[14:55:18] <wikibugs>	 (03CR) 10Jason Linehan: [C: 03+1] "Looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662767 (https://phabricator.wikimedia.org/T274172) (owner: 10Mholloway)
[14:55:19] <apergos>	 yes, assuming that hashar agrees
[14:55:21] <hashar>	 I dont understand anything about the issue at end
[14:55:27] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:55:30] <hashar>	 and my mediawiki knowledge is close to zero nowadays 
[14:55:31] <hashar>	 so 
[14:55:35] <wikibugs>	 (03PS1) 10Majavah: Revert "Caching fixes" [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662956
[14:55:37] <hashar>	 I cant really make any decision ;]
[14:55:58] <hashar>	 the idea is to drop the recent .29 backport we did like an hour ago
[14:56:04] <hashar>	 which would let the error occur again
[14:56:09] <hashar>	 but this time with better logging?
[14:56:11] <apergos>	 yes
[14:56:18] <hashar>	 so we are still blocked
[14:56:19] <hashar>	 but
[14:56:22] <hashar>	 have some better logging ;)
[14:56:25] <apergos>	 yes
[14:56:25] <hashar>	 is that correct?
[14:56:44] <apergos>	 and in the meantime: look at the additional feeds fix,
[14:56:57] <apergos>	 get it ready to go it,
[14:57:00] <apergos>	 *go in,
[14:57:13] <apergos>	 and use logging output to cross-check the issue
[14:57:21] <hashar>	 so we rollback FeaturedFeeds patch: * 8fc0f13 - (HEAD, origin/wmf/1.36.0-wmf.29) Caching fixes (14 hours ago) <Taavi Väänänen>
[14:57:32] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 10 hosts with reason: upgrading                  openstack
[14:57:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:35] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 10 hosts with reason: upgrading                  openstack
[14:57:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:39] <apergos>	 yes, Majavah has already put the revert patch in
[14:57:54] <apergos>	 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FeaturedFeeds/+/662956/
[14:58:07] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:58:33] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] "Causes:" [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662956 (owner: 10Majavah)
[14:58:46] <hashar>	 +2 ed
[14:59:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Horizon: put into maintenance mode for Train upgrade [puppet] - 10https://gerrit.wikimedia.org/r/662978 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott)
[15:00:32] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] wait_reboot_since() is now using a constant backoff. [software/spicerack] - 10https://gerrit.wikimedia.org/r/662983 (owner: 10Gehel)
[15:00:55] <apergos>	 hashar thanks for the +2, I have a meeting, in 30 mins I will be here again once the revert has merged through
[15:01:09] <apergos>	 daniel will be here hopefully in 30 also, but maybe a little later
[15:01:12] <hashar>	 will deploy prmote again
[15:01:18] <apergos>	 ping me if anything needed
[15:01:21] <wikibugs>	 (03PS1) 10Gehel: wdqs: explicit shutdown of Blazegraph during reboots. [cookbooks] - 10https://gerrit.wikimedia.org/r/662988
[15:01:22] <hashar>	 and capture whatever log is appearing ;)
[15:01:25] <apergos>	 ok!
[15:02:14] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: move eqiad1 from openstack 'stein' to 'train' [puppet] - 10https://gerrit.wikimedia.org/r/662980 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott)
[15:03:13] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] Add support for php deployments (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto)
[15:03:28] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Caching fixes" [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662956 (owner: 10Majavah)
[15:03:35] <wikibugs>	 (03PS9) 10Arturo Borrero Gonzalez: toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139)
[15:05:20] <hashar>	 Majavah: apergos: deploying the revert  and I will promote group1 again
[15:05:29] <hashar>	 so we should get enhanced error logs 
[15:05:47] <wikibugs>	 10SRE, 10CAS-SSO, 10Discovery-Search (Current work), 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10jbond) Thanks i sent you response upstream will see what they come back with https://groups.google.com/u/1/a/apereo.org/g/cas-user/c/MkpgAZZn-Mw
[15:06:32] <logmsgbot>	 !log hashar@deploy1001 Synchronized php-1.36.0-wmf.29/extensions/FeaturedFeeds: Revert "Caching fixes" T264391 (duration: 01m 25s)
[15:06:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:36] <stashbot>	 T264391: FeaturedFeedChannel must not contain a User object, since it cannot be serialized safely. - https://phabricator.wikimedia.org/T264391
[15:07:11] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/662988 (owner: 10Gehel)
[15:07:13] <wikibugs>	 (03PS1) 10Hashar: group1 wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662990
[15:07:15] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662990 (owner: 10Hashar)
[15:08:11] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662990 (owner: 10Hashar)
[15:10:05] <Majavah>	 verified that the feed commons potd feed is working now
[15:10:11] <logmsgbot>	 !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.29
[15:10:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:53] <moritzm>	 !log readding ganeti5002 to the eqsin Ganeti cluster following mainboard replacement/reinstall T261130
[15:10:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:57] <stashbot>	 T261130: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130
[15:11:07] <hashar>	 it is deploying
[15:11:23] <logmsgbot>	 !log hashar@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.29 (duration: 01m 11s)
[15:11:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:32] <hashar>	 [YCKk3ApAICkAABTZCOAAAACA] /wiki/MediaWiki   ArgumentCountError: Too few arguments to function FeaturedFeedChannel::__construct(), 3 passed in /srv/mediawiki/php-1.36.0-wmf.29/extensions/FeaturedFeeds/includes/FeaturedFeeds.php on line 214 and exactly 4 expected
[15:13:01] <Majavah>	 that's with the patch reverted?
[15:13:54] <hashar>	 should hav ebeen reverted yeah
[15:14:21] <apergos>	 uh wut
[15:14:48] <hashar>	 maybe that one is a one off error though
[15:14:55] <Majavah>	 FeaturedFeedChannel constructor has 3 parameters on master and .29, no idea how that's happening
[15:15:08] <hashar>	 yeah that happened only once
[15:15:08] <Majavah>	 you have the url for that?
[15:15:24] <wikibugs>	 (03CR) 10Mholloway: [C: 04-2] "Hold until 1.36.0-wmf.30 is live on all wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662767 (https://phabricator.wikimedia.org/T274172) (owner: 10Mholloway)
[15:15:54] <papaul>	 !log power down mw2220  for maintenance 
[15:15:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:50] <Lucas_WMDE>	 constructor looks like this on mw1320:
[15:16:53] <Lucas_WMDE>	 public function __construct( $name, $options, $lang, User $user ) {
[15:16:56] <hashar>	 pasting to the task
[15:17:00] <Lucas_WMDE>	 (/srv/mediawiki/php-1.36.0-wmf.29/extensions/FeaturedFeeds/includes/FeaturedFeedChannel.php)
[15:18:29] <hashar>	 https://phabricator.wikimedia.org/T264391#6815291
[15:18:41] <Majavah>	 and I somehow failed git pulling
[15:18:48] <Majavah>	 that four argument constructor is correct and wanted
[15:19:11] <hashar>	 it might just have been a off by one transient error
[15:19:20] <wikibugs>	 (03PS6) 10David Caro: Revert "elasticsearch: return the cluster name in __str__ for ElasticsearchCluster" [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294
[15:19:21] <Majavah>	 the only caller should be using four
[15:19:40] <Lucas_WMDE>	 yeah,  FeaturedFeeds.php calls it with four arguments as far as I can see (still on mw1320)
[15:19:47] <Lucas_WMDE>	 might have happened halfway through a sync?
[15:19:55] <hashar>	 yeah I think
[15:20:03] <hashar>	 cause our deploys are definitely not atomic
[15:20:10] <Majavah>	 lets hope that
[15:20:17] <Lucas_WMDE>	 ok
[15:20:17] <hashar>	 and it was a single error
[15:21:40] <hashar>	 so with .29 promoted to group 1 there are no new logs showing up
[15:21:45] <hashar>	 that is a good sign I guess
[15:21:47] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 95 hosts with reason: upgrading openstack
[15:21:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:06] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 10 hosts with reason: upgrading openstack
[15:22:08] <Majavah>	 anything about the original issue we're trying to solve?
[15:22:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:10] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 10 hosts with reason: upgrading openstack
[15:22:11] <hashar>	 though iirc the issue appearing on enwiki (which is not in group 1)
[15:22:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:22] <hashar>	 Majavah: nop. It is all quiet
[15:22:22] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 95 hosts with reason: upgrading openstack
[15:22:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:30] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 95 hosts with reason: upgrading openstack
[15:22:32] <Majavah>	 scary
[15:22:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:04] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 95 hosts with reason: upgrading openstack
[15:23:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:15] <hashar>	 maybe the issue is related to some cache expiry
[15:23:23] <hashar>	 and would only surface after some TTL expired
[15:23:34] <hashar>	 it is not like I know anything about what is going on though
[15:23:53] <apergos>	 I need to see which wikis had the error besides enwiki (later)
[15:24:04] <apergos>	 there were a total of 45 errors iirc, not a lot
[15:24:10] <wikibugs>	 (03CR) 10Kormat: "Fails because jenkins is currently using a debian stretch base image for this repo." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/662966 (owner: 10Kormat)
[15:24:10] <apergos>	 waiting and seeing sounds good
[15:24:38] <Majavah>	 FeaturedFeed might cache things for up to 24h
[15:24:43] <hashar>	 enwiki was one for sure
[15:24:53] <hashar>	 I cant find whether commonswiki was affected as well
[15:25:13] <hashar>	 I am going to catch my kids at school
[15:25:14] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "Getting it in, thanks for the understanding 😊" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294 (owner: 10David Caro)
[15:25:21] <hashar>	 will be back in ~ half an hour
[15:25:26] <hashar>	 but things seems stable
[15:25:41] <hashar>	 then I guess we can look at updating the remaining wikis 
[15:27:55] <hashar>	 be back in a few
[15:29:17] <apergos>	 ok, I will be here more or less until then (an through then)
[15:29:39] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:32:27] <papaul>	 !log power down logstash2035 for relocation 
[15:32:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:48] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "Had a meeting with Adam and can confirm it's really coming from him." [puppet] - 10https://gerrit.wikimedia.org/r/662661 (owner: 10Awight)
[15:33:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "elasticsearch: return the cluster name in __str__ for ElasticsearchCluster" [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294 (owner: 10David Caro)
[15:34:23] <wikibugs>	 (03PS1) 10Jbond: sso cloud: add wmfcloud.org service registration [puppet] - 10https://gerrit.wikimedia.org/r/662995
[15:35:09] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:35:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sso cloud: add wmfcloud.org service registration [puppet] - 10https://gerrit.wikimedia.org/r/662995 (owner: 10Jbond)
[15:35:36] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294 (owner: 10David Caro)
[15:35:38] <wikibugs>	 (03PS1) 10Herron: icinga: send fr-tech-ops alerts to victorops-fundraising [puppet] - 10https://gerrit.wikimedia.org/r/662996 (https://phabricator.wikimedia.org/T273065)
[15:37:19] <wikibugs>	 (03CR) 10David Caro: Revert "elasticsearch: return the cluster name in __str__ for ElasticsearchCluster" [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294 (owner: 10David Caro)
[15:37:22] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] Revert "elasticsearch: return the cluster name in __str__ for ElasticsearchCluster" [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294 (owner: 10David Caro)
[15:38:33] <icinga-wm>	 PROBLEM - ganeti-noded running on ganeti5002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[15:38:59] <icinga-wm>	 PROBLEM - ganeti-mond running on ganeti5002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti
[15:39:21] <icinga-wm>	 PROBLEM - ganeti-confd running on ganeti5002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti
[15:39:32] <wikibugs>	 (03PS1) 10Andrew Bogott: Train/buster: don't install python-pycadf [puppet] - 10https://gerrit.wikimedia.org/r/662997 (https://phabricator.wikimedia.org/T261135)
[15:39:58] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "Horizon: put into maintenance mode for Train upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/662957
[15:40:21] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Train/buster: don't install python-pycadf [puppet] - 10https://gerrit.wikimedia.org/r/662997 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott)
[15:40:31] <icinga-wm>	 PROBLEM - Host logstash2035.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:41:12] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] icinga: send fr-tech-ops alerts to victorops-fundraising [puppet] - 10https://gerrit.wikimedia.org/r/662996 (https://phabricator.wikimedia.org/T273065) (owner: 10Herron)
[15:42:00] <herron>	 logstash2035 is a planned re-rack jftr, will downtime now
[15:44:02] <apergos>	 back in 2 minutes (cat feeding)
[15:44:42] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "elasticsearch: return the cluster name in __str__ for ElasticsearchCluster" [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294 (owner: 10David Caro)
[15:46:37] <icinga-wm>	 RECOVERY - Host logstash2035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.53 ms
[15:47:16] <wikibugs>	 10SRE, 10ops-codfw: codfw: relocate logstash2035 - https://phabricator.wikimedia.org/T274214 (10Papaul) 05Open→03Resolved @herron this is complete the server is back up and Netbox update.
[15:48:30] <wikibugs>	 10SRE, 10ops-codfw: codfw: relocate logstash2035 - https://phabricator.wikimedia.org/T274214 (10herron) LGTM thanks @Papaul!
[15:48:36] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Sustainability (Incident Followup), 10User-fgiunchedi: ms-fe.svc.codfw.wmnet paged during Swift rebalance - https://phabricator.wikimedia.org/T273453 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Resolving, the other two items should be either done or tackled elsewhere...
[15:53:03] <apergos>	 Majavah: would you be willing to write tests for the new patch? 
[15:53:58] <icinga-wm>	 RECOVERY - Keyholder SSH agent on cumin2001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder
[15:54:33] <wikibugs>	 (03CR) 10Bstorm: "If we are at the point of directly patching the code with something we'd never even consider pushing upstream (like in Neutron), I think w" [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165) (owner: 10Andrew Bogott)
[15:54:39] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10Papaul)
[15:58:24] <hashar>	 more or less back . Multitasking with my kids homework
[15:58:34] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1032 - https://phabricator.wikimedia.org/T272209 (10Cmjohnson) 05Open→03Resolved @fgiunchedi I am resolving this but please open a decom task when you're ready to decommission this server.  Thanks
[15:59:04] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10Cmjohnson) thanks, @Bstorm this is on my list to do this week and will update you once...
[15:59:26] <logmsgbot>	 !log volker-e@deploy1001 Started deploy [design/style-guide@b9b7ee6]: Deploy design/style-guide: b9b7ee6 “Components”: Fix components overview SVG rendering glitch (#439)
[15:59:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:33] <logmsgbot>	 !log volker-e@deploy1001 Finished deploy [design/style-guide@b9b7ee6]: Deploy design/style-guide: b9b7ee6 “Components”: Fix components overview SVG rendering glitch (#439) (duration: 00m 07s)
[15:59:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup100[12] - https://phabricator.wikimedia.org/T274206 (10RobH)
[16:01:33] <wikibugs>	 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10AntiCompositeNumber) >>! In T273741#6814266, @Shizhao wrote: > Rename to a new filename?  The Commons community generally avoids moving files, a...
[16:01:56] <wikibugs>	 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Mvolz)
[16:02:13] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup200[12] - https://phabricator.wikimedia.org/T274202 (10RobH)
[16:02:36] <wikibugs>	 (03CR) 10Bstorm: "Do we have a diff from the packaged file somewhere or should I make one?" [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165) (owner: 10Andrew Bogott)
[16:03:29] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1032 - https://phabricator.wikimedia.org/T272209 (10fgiunchedi) 05Resolved→03Open >>! In T272209#6815416, @Cmjohnson wrote: > @fgiunchedi I am resolving this but please open a decom task when you're ready to decommission this server.  Thanks  [reopening]  This host...
[16:03:53] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Looks sane to me but I'm not familiar with our k6s puppetization. Maybe Luca or John are more familiar?" [puppet] - 10https://gerrit.wikimedia.org/r/662945 (owner: 10Muehlenhoff)
[16:04:08] <wikibugs>	 (03CR) 10Andrew Bogott: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165) (owner: 10Andrew Bogott)
[16:04:54] <apergos>	 hashar: see anything in the logs?
[16:06:00] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Papaul)
[16:06:08] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops-radar: mw2220 - broken IPMI / mgmt - https://phabricator.wikimedia.org/T273803 (10Papaul) 05Open→03Resolved @Dzahn   Drained power  and upgrade IDRAC firmware from 2.30.30.30 to 2.63.  All looks good now
[16:07:22] <wikibugs>	 (03PS2) 10CRusnov: openldap/offboard-user.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/662765 (https://phabricator.wikimedia.org/T247364)
[16:07:38] <wikibugs>	 (03CR) 10CRusnov: openldap/offboard-user.py: Port to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/662765 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[16:09:08] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:09:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/662765 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[16:09:47] <wikibugs>	 10SRE, 10ops-codfw: ms-be2031 repeated usb connect/disconnect message - https://phabricator.wikimedia.org/T273895 (10Papaul) @fgiunchedi is it necessary to spend time fixing this issue since your plan is to decom this server?
[16:10:00] <wikibugs>	 (03CR) 10Andrew Bogott: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165) (owner: 10Andrew Bogott)
[16:11:01] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[16:11:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:11:13] <wikibugs>	 (03CR) 10Bstorm: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165) (owner: 10Andrew Bogott)
[16:13:01] <hashar>	 apergos: checking logs 
[16:13:17] <hashar>	 kids homework done (after begging for random games instead of focusing on reading bah)
[16:13:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install aqs101[0-5] - https://phabricator.wikimedia.org/T267414 (10Cmjohnson) netbox updated with switch info and dns.
[16:13:27] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] "I'll make a task to track winding down our use cases." [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165) (owner: 10Andrew Bogott)
[16:14:00] <godog>	 !log swift eqiad-prod: decrease weight for SSDs on ms-be[1019-1026] - T272836
[16:14:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:04] <stashbot>	 T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836
[16:14:25] <hashar>	 apergos: nothing related to the closure issue or pagefeeds
[16:14:37] <apergos>	 what are you using to search?
[16:18:04] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond)
[16:20:03] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:20:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:18] <moritzm>	 !log installing wireshark security updates
[16:21:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:09] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:22:15] <wikibugs>	 (03PS22) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102
[16:22:29] <apergos>	 oh well, I have logstash open, filtering out some db stuff and allowing a bunch more, with wanobjectcache in the message, not realyl anything going on yet
[16:22:31] <wikibugs>	 (03CR) 10Jbond: sre: convert the generic reboot functions to the cookbook class API (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond)
[16:22:48] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] "https://phabricator.wikimedia.org/T274268 so we can add structure to that process over time and figure out when we are really ready to sto" [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165) (owner: 10Andrew Bogott)
[16:24:07] <wikibugs>	 (03PS7) 10Jcrespo: [WIP] Move database backups-related puppet code to its own profile/role [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562)
[16:24:55] <hashar>	 apergos: yeah that sounds fine
[16:25:12] <wikibugs>	 (03PS1) 10Ottomata: Bump Hadoop datanode heap to 8G [puppet] - 10https://gerrit.wikimedia.org/r/663003 (https://phabricator.wikimedia.org/T273711)
[16:25:35] <apergos>	 hashar: was there a plan to roll out to group2 or to $some_wiki later? 
[16:25:44] <apergos>	 where $some_wiki is one we know to have had issues
[16:25:45] <hashar>	 I haven't made any plan
[16:25:50] <apergos>	 ok
[16:25:51] <hashar>	 I am fine pushing it nowish I guess
[16:26:01] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Bump Hadoop datanode heap to 8G [puppet] - 10https://gerrit.wikimedia.org/r/663003 (https://phabricator.wikimedia.org/T273711) (owner: 10Ottomata)
[16:26:02] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Bump Hadoop datanode heap to 8G [puppet] - 10https://gerrit.wikimedia.org/r/663003 (https://phabricator.wikimedia.org/T273711) (owner: 10Ottomata)
[16:26:06] <apergos>	 well lemme see if there is a smallish good candidate
[16:26:33] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:27:08] <hashar>	 I am looking for the past Closure issue messages
[16:27:30] <wikibugs>	 10SRE, 10ops-codfw: ms-be2031 repeated usb connect/disconnect message - https://phabricator.wikimedia.org/T273895 (10fgiunchedi) >>! In T273895#6815460, @Papaul wrote: > @fgiunchedi is it necessary to spend time fixing this issue since your plan is to decom this server?   I don't know if it is a real problem,...
[16:27:31] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: upload-frontend: ban a specific url with no referer nor UA [puppet] - 10https://gerrit.wikimedia.org/r/663004 (https://phabricator.wikimedia.org/T273741)
[16:27:56] <hashar>	 there were some on commonswiki
[16:28:10] <hashar>	 but bulk of them were on enwiki / zhwiki / jawiki which are not from group 1
[16:28:16] <hashar>	 so yeah I guess we nede to promote everything
[16:28:26] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: upload-frontend: ban a specific url with no referer nor UA [puppet] - 10https://gerrit.wikimedia.org/r/663004 (https://phabricator.wikimedia.org/T273741)
[16:28:31] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup200[12] - https://phabricator.wikimedia.org/T274202 (10Papaul)
[16:29:12] <wikibugs>	 10SRE, 10ops-codfw: ms-be2031 repeated usb connect/disconnect message - https://phabricator.wikimedia.org/T273895 (10Papaul) @fgiunchedi ok.
[16:29:25] <wikibugs>	 10SRE, 10ops-codfw: ms-be2031 repeated usb connect/disconnect message - https://phabricator.wikimedia.org/T273895 (10Papaul) a:03Papaul
[16:29:35] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Temporarily enable read-only backups and disable rw backups of ES [puppet] - 10https://gerrit.wikimedia.org/r/663005 (https://phabricator.wikimedia.org/T79922)
[16:29:40] <hashar>	 guess we can promote now ?
[16:31:28] <wikibugs>	 (03CR) 10Tjones: "Do we have to worry about the debian-glue-non-voting failure?" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/662923 (https://phabricator.wikimedia.org/T274203) (owner: 10DCausse)
[16:32:40] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] bacula: Temporarily enable read-only backups and disable rw backups of ES [puppet] - 10https://gerrit.wikimedia.org/r/663005 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo)
[16:32:59] <icinga-wm>	 PROBLEM - SSH on logstash1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:33:18] <apergos>	 if there were "some" on group1 I would uh
[16:33:35] <apergos>	 I'm having trouble getting lgostash to show me some older examples >_<
[16:33:41] <hashar>	 ;D
[16:34:04] <hashar>	 you can use fixed date for Jan 29th
[16:34:29] <apergos>	 it keeps giving me little orange error boxes
[16:34:35] <icinga-wm>	 RECOVERY - SSH on logstash1022 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:34:39] <apergos>	 and then I can't even expand stack traces or add filters, it's maddening
[16:35:00] <wikibugs>	 (03CR) 10DCausse: "> Patch Set 1:" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/662923 (https://phabricator.wikimedia.org/T274203) (owner: 10DCausse)
[16:36:55] <hashar>	 apergos: https://logstash.wikimedia.org/goto/325a206bdc1b1fc9178c4893762fa936
[16:37:06] <hashar>	 occurences of message:Closure   for Jan 29
[16:37:07] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.154 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[16:37:12] <apergos>	 trying that, thanks
[16:37:13] <hashar>	 at some arbitrary time range
[16:37:22] <hashar>	 on mediawiki-errors dashboard
[16:38:01] <Majavah>	 uhh that logstash error doesn't sound good
[16:39:54] <hashar>	 https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&orgId=1&refresh=5m&from=now-24h&to=now
[16:40:31] <hashar>	 looks like it always has some business around that time of the day for some reason
[16:41:15] <wikibugs>	 (03PS1) 10Joal: Reduce Hadoop Yarn available memory from 4G [puppet] - 10https://gerrit.wikimedia.org/r/663009
[16:41:21] <joal>	 elukey, ottomata --^
[16:41:40] <wikibugs>	 (03PS2) 10Ottomata: Reduce Hadoop Yarn available memory from 4G [puppet] - 10https://gerrit.wikimedia.org/r/663009 (https://phabricator.wikimedia.org/T273711) (owner: 10Joal)
[16:42:45] <hashar>	 I am going to do the promote
[16:43:24] <apergos>	 yuck
[16:43:29] <wikibugs>	 (03PS1) 10Hashar: all wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663010
[16:43:32] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] all wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663010 (owner: 10Hashar)
[16:43:41] <apergos>	 maybe wait til logstash is happy again?
[16:43:50] <apergos>	 hashar: ^^
[16:44:57] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663010 (owner: 10Hashar)
[16:45:26] <hashar>	 apergos: I dont think it is much of an issue
[16:45:31] <apergos>	 ok
[16:45:38] <apergos>	 i that case roll em roll em roll em
[16:45:40] <hashar>	 seems to be oscillating between 5 and 8 indexing failures per seconds
[16:45:44] <hashar>	 based on our overall traffic
[16:45:54] <wikibugs>	 (03PS3) 10Joal: Reduce Hadoop Yarn available memory from 4G [puppet] - 10https://gerrit.wikimedia.org/r/663009
[16:46:06] <hashar>	 and since now is kind of peak hours (late in asia, prime in europe,  us is around),  I guess it is normal to have more events and more failures
[16:46:50] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Reduce Hadoop Yarn available memory from 4G [puppet] - 10https://gerrit.wikimedia.org/r/663009 (owner: 10Joal)
[16:47:09] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 9.2 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[16:47:15] <logmsgbot>	 !log hashar@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.29
[16:47:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:48:46] <hashar>	 the syslog for logstash does show there have been a peak of indexing error around 16:46 ( https://logstash.wikimedia.org/goto/8c3d1c10bc7574b41793e2b61dc2f151 )
[16:49:03] <hashar>	 baseline ~ 800 events / 30 seconds.  Peak at 1600 for a 30 seconds bucket
[16:49:49] <godog>	 indeed, there's a bunch of open tasks to tackle the persisting indexing errors :(
[16:50:02] <hashar>	 seems it is due to some fields having mismatching types
[16:50:20] <wikibugs>	 (03PS5) 10Ejegg: Disable CentralNotice on API portal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649942 (https://phabricator.wikimedia.org/T270308)
[16:50:34] <godog>	 that's correct yes
[16:50:52] <apergos>	 so we're on 29 everywhere
[16:50:56] <apergos>	 lemme get on the log host
[16:51:03] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:56:19] <hashar>	 and thank you so much for your patches
[16:56:23] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:56:33] <Urbanecm>	 yup, +\infty for those who do latex math :)
[16:56:58] <Majavah>	 no its 1900 local now, midnight UTC will be 2am local
[16:57:53] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/663011 (owner: 10David Caro)
[16:57:59] <icinga-wm>	 PROBLEM - Check systemd state on maps1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:58:04] <wikibugs>	 (03CR) 10David Caro: "Nice work! LGTM codewise but I'll leave for someone that knows the context to approve :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond)
[16:58:25] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Revert "Horizon: put into maintenance mode for Train upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/662957 (owner: 10Andrew Bogott)
[16:58:47] <icinga-wm>	 PROBLEM - tileratorui on maps1005 is CRITICAL: connect to address 10.64.0.12 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[16:58:51] <wikibugs>	 (03PS1) 10Hashar: Revert "Caching fixes" [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662964 (https://phabricator.wikimedia.org/T264391)
[16:58:56] <Lucas_WMDE>	 Majavah: I’ll skip the good night but say good luck with the exam! :)
[16:59:02] <hashar>	 ^^ reconciliating wmf.30
[16:59:04] <hashar>	 which lacks the rever
[16:59:05] <hashar>	 t
[16:59:06] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] Revert "elasticsearch: return the cluster name in __str__ for ElasticsearchCluster" part 2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/663011 (owner: 10David Caro)
[16:59:14] <Majavah>	 thank you Lucas_WMDE :D
[16:59:16] <Urbanecm>	 wdym hashar?
[16:59:21] <Urbanecm>	 and yeah, good luck with your exam :)
[16:59:23] <icinga-wm>	 PROBLEM - tilerator on maps1005 is CRITICAL: connect to address 10.64.0.12 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[16:59:30] <hashar>	 wdym ?
[16:59:39] <hashar>	 oh
[16:59:41] <Urbanecm>	 what do you mean by " reconciliating wmf.30"
[16:59:44] <wikibugs>	 (03PS2) 10Bstorm: wikireplicas: adjust logrotate for multiinstance on wmf-pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/662797 (https://phabricator.wikimedia.org/T274044)
[16:59:52] <hashar>	 that is non sense sorry
[17:00:00] <icinga-wm>	 PROBLEM - Maps HTTPS on maps1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.309 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[17:00:00] <Majavah>	 (i've also had exams every day since last wens)
[17:00:02] <Urbanecm>	 :(
[17:00:04] <jouncebot>	 jbond42 and cdanis: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210209T1700)
[17:00:24] <hashar>	 so in short we had a patch for FeaturedFeeds on wmf.29  but it was faulty and got reverted
[17:00:39] <hashar>	 however the faulty patch got included in the wmf.30 branch and thus has to be reverted
[17:00:43] <hashar>	 which is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FeaturedFeeds/+/662964 :)
[17:00:47] <Urbanecm>	 ah, makes sense
[17:01:29] <logmsgbot>	 !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0)
[17:01:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:02:03] <wikibugs>	 (03CR) 10Hashar: "The change made it to master and thus got included in the wmf.30 branch cut.  We have cherry picked to wmf.29 and found out it was causing" [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662964 (https://phabricator.wikimedia.org/T264391) (owner: 10Hashar)
[17:02:18] <hashar>	 meeting time
[17:02:32] <hashar>	 will watch logstash in parallel. Ping me if anything is needed
[17:02:42] <apergos>	 omg there are some log entries
[17:02:47] <apergos>	 now if only one has the key
[17:03:59] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[17:04:08] <wikibugs>	 (03CR) 10David Caro: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165) (owner: 10Andrew Bogott)
[17:04:13] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:04:28] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops-radar: mw2220 - broken IPMI / mgmt - https://phabricator.wikimedia.org/T273803 (10Dzahn) Thank you @Papaul. Reimaging it now and things look normal.
[17:04:32] <wikibugs>	 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10epriestley) > We need an easy way to tell logged-out traffic apart in our caching layer.  Can your caching layer examine cookie values?  If the `phsid` cookie...
[17:05:48] <wikibugs>	 (03CR) 10Jbond: "This change is ready for review." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov)
[17:06:57] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[17:07:22] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp1001.eqiad.wmnet
[17:07:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:52] <apergos>	 Async refresh failed for frwiki:featured-feeds:1:fr          for people following along!
[17:07:56] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[17:13:10] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1001.eqiad.wmnet
[17:13:11] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp1002.eqiad.wmnet
[17:13:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:02] <Majavah>	 what about doing a similar request but with a different language code to bypass the already-cached result and with mwdebug to hopefully get more detailed debug logging?
[17:14:31] <wikibugs>	 (03CR) 10Bstorm: "Adding Andrew as a reviewer just to poke for errors if you see any. I think this should add a log file per process and a logrotate." [puppet] - 10https://gerrit.wikimedia.org/r/662797 (https://phabricator.wikimedia.org/T274044) (owner: 10Bstorm)
[17:14:32] <Urbanecm>	 that...might reproduce it
[17:14:41] <Urbanecm>	 or simply purge the actual key :)
[17:14:50] <apergos>	 do it on another wiki?
[17:14:50] <apergos>	 um
[17:15:27] <Urbanecm>	 apergos: not all wikis have featuredfeeds enabled
[17:15:28] <apergos>	 ah there's only the one so far. 
[17:15:33] <apergos>	 let's see if we get another
[17:15:38] <apergos>	 I'd like to keep one intact
[17:15:44] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/663016
[17:15:48] <Majavah>	 the page took ages to load on mwdebug1001
[17:15:58] <Majavah>	 rendered successfully, no errors visible to the user
[17:16:10] <apergos>	 Async refresh failed for jawiki:featured-feeds:1:ja
[17:16:20] <Majavah>	 can anyone look if that reproduced it? cache key should be in theory frwiki:featured-feeds:1:fi with an I instead of R
[17:16:35] <apergos>	 um let's see
[17:17:10] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: upload-frontend: ban a specific url with no referer nor UA [puppet] - 10https://gerrit.wikimedia.org/r/663004 (https://phabricator.wikimedia.org/T273741)
[17:17:13] <apergos>	 don't see i in logstash yet
[17:17:19] <apergos>	 most recent is 3 minutes ago
[17:17:25] <apergos>	 the ja one.
[17:17:44] <wikibugs>	 10SRE, 10Commons, 10Traffic, 10Patch-For-Review: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10ssingh) Update:  Thank you for the interest in this task! Like we shared yesterday, we have identified that the traffic is...
[17:18:26] <apergos>	 still no.
[17:20:02] <Urbanecm>	 let me try something...
[17:20:09] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.388 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[17:20:16] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1002.eqiad.wmnet
[17:20:17] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp1003.eqiad.wmnet
[17:20:51] <apergos>	 oohhh there's som e fa and he wiki ones now
[17:21:23] <_joe_>	 uhm indexing errors on logstash is not a good sign usually
[17:23:00] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:23:02] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1035 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:23:04] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1012 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:23:10] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1017 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:23:30] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:23:32] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:23:32] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:23:42] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:23:42] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1029 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:23:50] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt-wdqs1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:23:51] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:23:51] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1037 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:23:53] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2264.codfw.wmnet with reason: REIMAGE
[17:24:08] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 95 hosts with reason: upgrading openstack
[17:24:10] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1025 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:24:13] <wikibugs>	 10SRE, 10Commons, 10Traffic, 10Patch-For-Review: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10fdans) @ssingh said it yesterday on chat but this is such stellar data detective work. Congrats on finding the culprit!!
[17:24:28] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt-wdqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:24:30] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1020 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:24:31] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:24:32] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:24:33] <andrewbogott>	 that's some downtime expiring, I'll renew
[17:24:43] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 95 hosts with reason: upgrading openstack
[17:25:28] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt-wdqs1001 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:25:30] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1020 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:25:30] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1036 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:25:31] <Urbanecm>	 why do i get `TypeError: Too few arguments to function FeaturedFeeds::getFeeds()`?
[17:25:35] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1003.eqiad.wmnet
[17:25:43] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1383.eqiad.wmnet with reason: REIMAGE
[17:25:44] <Urbanecm>	 it says `exactly 2 expected`
[17:25:57] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2264.codfw.wmnet with reason: REIMAGE
[17:25:57] <wikibugs>	 10SRE, 10Commons, 10Traffic, 10Patch-For-Review: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Majavah) Is the effect that the block will have in the app known?
[17:26:19] <Urbanecm>	 but that function...needs only one argment?
[17:26:21] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:26:23] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1035 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:26:25] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:26:25] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1025 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:26:31] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:26:39] <wikibugs>	 10SRE, 10Platform Engineering, 10Release Pipeline, 10Release-Engineering-Team, and 5 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10thcipriani)
[17:26:42] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:26:45] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:26:55] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:26:55] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1029 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:28:02] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1383.eqiad.wmnet with reason: REIMAGE
[17:28:24] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt-wdqs1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:28:52] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: varnish: fix escaping of variables in test run script [puppet] - 10https://gerrit.wikimedia.org/r/663017
[17:29:15] <wikibugs>	 10SRE, 10Scap, 10Release-Engineering-Team-TODO, 10Sustainability (Incident Followup), 10User-brennen: scap's logstash_checker.py is blissfully unaware of any logstash indexing latency - https://phabricator.wikimedia.org/T255197 (10thcipriani)
[17:33:20] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2220.codfw.wmnet with reason: REIMAGE
[17:33:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:05] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1300.eqiad.wmnet with reason: REIMAGE
[17:35:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:24] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2220.codfw.wmnet with reason: REIMAGE
[17:35:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:28] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1300.eqiad.wmnet with reason: REIMAGE
[17:37:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:54] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:39:59] <wikibugs>	 (03PS29) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[17:40:55] <wikibugs>	 (03CR) 10Hnowlan: start using imposm as OSM sync tool (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[17:43:29] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on maps1005.eqiad.wmnet with reason: Resyncing database, still
[17:43:30] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on maps1005.eqiad.wmnet with reason: Resyncing database, still
[17:43:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:43:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:14] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:47:21] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2264.codfw.wmnet'] `  an...
[17:48:06] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1037 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:50:11] <wikibugs>	 10SRE, 10Commons, 10Traffic, 10Patch-For-Review: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10calbon) Just to second what @fdans said, the data detective work was great and this was such a fun ticket to watch.
[17:51:27] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1383.eqiad.wmnet'] `  an...
[17:51:40] <wikibugs>	 10SRE, 10Data-Persistence-Backup: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) Something that may or may not be related, but we will want to correct is that backup2002 is resolved on dns...
[17:55:57] <wikibugs>	 (03PS30) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[17:58:36] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Keystone: a new (but still terrible) approach to making projectid==projectname [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165) (owner: 10Andrew Bogott)
[18:00:04] <jouncebot>	 chrisalbon and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210209T1800).
[18:01:15] <wikibugs>	 (03PS31) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[18:02:20] <wikibugs>	 10SRE, 10Commons, 10Traffic, 10Patch-For-Review: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Joe) >>! In T273741#6815874, @Majavah wrote: > Is the effect that the block will have in the app known?  No, hence we trie...
[18:02:47] <hashar>	 grbmbl
[18:02:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[18:14:37] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001
[18:14:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:20] <wikibugs>	 (03PS1) 10Andrew Bogott: Remove old OpenStack Rocky files/templates/manifests [puppet] - 10https://gerrit.wikimedia.org/r/663027
[18:20:05] <wikibugs>	 (03PS2) 10DLynch: Enable DiscussionTools Reply Tool A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661373 (https://phabricator.wikimedia.org/T273554) (owner: 10Bartosz Dziewoński)
[18:20:57] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0)
[18:21:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:15] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1300.eqiad.wmnet'] `  an...
[18:21:33] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001
[18:21:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:10] <icinga-wm>	 PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:22:28] <icinga-wm>	 PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:22:57] <elukey>	 this is probably related to jupyter, we are upgrading
[18:23:39] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2220.codfw.wmnet'] `  an...
[18:24:01] <Majavah>	 Urbanecm: did you figure out the parameter issue?
[18:24:28] <Urbanecm>	 Majavah: yes, I was stupid, I opened my local master version, and the server used wmf.29 obviously, and the number of params was different
[18:25:38] <icinga-wm>	 RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:25:58] <icinga-wm>	 PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:26:24] <icinga-wm>	 RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:27:02] <icinga-wm>	 RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:27:35] <hasharAway>	 apergos: Majavah: we are letting the state as is for now
[18:27:51] <apergos>	 ok (I'm also pretty done for the day)
[18:27:52] <hasharAway>	 guess we will want to backport the cache fix again
[18:27:59] <hasharAway>	 or just https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FeaturedFeeds/+/662985
[18:28:03] <hasharAway>	 (which is for master)
[18:28:16] <apergos>	 there is the additional fix that needs to go in on top of that one, after testing and code review
[18:28:55] <apergos>	 yeah I think it's both those combined, anyways I'll be back to looking at things tomorrow 
[18:29:33] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001
[18:29:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:21] <hasharAway>	 so I guess yes two patches got to be carried
[18:30:41] <hasharAway>	 will check with this week train conductors in ~ 1H30 and that will follow from there
[18:30:45] <hasharAway>	 for now it is dinner time
[18:32:56] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001
[18:32:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:34:10] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:34:12] <icinga-wm>	 PROBLEM - Oozie Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.catalina.startup.Bootstrap https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie
[18:34:31] <elukey>	 ah snap downtime
[18:34:57] <elukey>	 we are upgrading :)
[18:37:00] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2220.codfw.wmnet
[18:37:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:14] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1300.eqiad.wmnet
[18:37:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:24] <icinga-wm>	 RECOVERY - Oozie Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.catalina.startup.Bootstrap https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie
[18:37:24] <icinga-wm>	 RECOVERY - Hive Metastore on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[18:37:40] <ryankemper>	 !log T267927 [WDQS Data Reload] Clearing old wikidata journal file to free disk space before beginning data reload:`sudo systemctl status wdqs-blazegraph && sudo systemctl stop wdqs-blazegraph && sudo rm -fv /srv/wdqs/wikidata.jnl && sudo systemctl start wdqs-blazegraph` on `wdqs100[9,10]`
[18:37:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:46] <stashbot>	 T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927
[18:39:05] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001
[18:39:07] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload
[18:39:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:39:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:39:16] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload
[18:39:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:25] <ryankemper>	 !log T267927 [WDQS Data Reload] `sudo cookbook sre.wdqs.data-reload wdqs1009.eqiad.wmnet --reuse-downloaded-dump --reload-data wikidata --skolemize --reason 'T267927: Reload wikidata jnl from fresh dumps' --task-id T267927` on `ryankemper@cumin1001` tmux session `wdqs_data_reload_1009`
[18:40:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:57] <ryankemper>	 !log T267927 [WDQS Data Reload] `sudo cookbook sre.wdqs.data-reload wdqs1010.eqiad.wmnet --reuse-downloaded-dump --reload-data wikidata                 --reason 'T267927: Reload wikidata jnl from fresh dumps' --task-id T267927` on `ryankemper@cumin1001` tmux session `wdqs_data_reload_1009`
[18:41:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:41:09] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001
[18:41:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:41:53] <ryankemper>	 !log T267927 [WDQS Data Reload] Small typo in previous SAL log message, see subsequent SAL line for correction:
[18:41:54] <wikibugs>	 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10Gilles) a:0320after4
[18:41:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:04] <ryankemper>	 !log T267927 [WDQS Data Reload] `sudo cookbook sre.wdqs.data-reload wdqs1010.eqiad.wmnet --reuse-downloaded-dump --reload-data wikidata --reason 'T267927: Reload wikidata jnl from fresh dumps' --task-id T267927` on `ryankemper@cumin1001` tmux session `wdqs_data_reload_1010`
[18:42:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:01] <wikibugs>	 10SRE, 10Data-Persistence-Backup, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) **backup2002 -> backup1002** (please note this was while large backups were running in the backg...
[18:45:36] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001
[18:45:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:25] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001
[18:46:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:49] <wikibugs>	 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10mmodell) Indeed I think it's caching static files correctly.
[18:47:45] <wikibugs>	 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10epriestley) > Eventually Phabricator ran out of local sockets (30k limit):  I'm not familiar with the particulars of your infrastructure, but if this was on th...
[18:48:30] <wikibugs>	 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10Gilles) a:0520after4→03mmodell
[18:52:18] <wikibugs>	 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10Majavah) Hmm. I'm looking at the Firefox developer console, and it looks like this: {F34098763}  Note that styles and scripts are marked as "cached" instead of...
[18:57:00] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001
[18:57:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:57:58] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001
[18:58:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210209T1900)
[19:01:22] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2264.codfw.wmnet
[19:01:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:33] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001
[19:01:34] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1383.eqiad.wmnet
[19:01:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:02:23] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001
[19:02:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:04:14] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid's jvm daemons. - elukey@cumin1001
[19:04:15] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.druid.roll-restart-workers (exit_code=99) for Druid analytics cluster: Roll restart of Druid's jvm daemons. - elukey@cumin1001
[19:04:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:04:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:04:34] <elukey>	 ufff
[19:08:23] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001
[19:08:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:15] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001
[19:10:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:12:10] <wikibugs>	 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10epriestley) I suspect you're seeing that because when you reload a page by issuing a "Reload" command in your browser, most (all?) modern browsers interpret th...
[19:13:42] <wikibugs>	 (03PS9) 10Ryan Kemper: relforge: service impl of relforge100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211)
[19:15:03] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2220.codfw.wmnet
[19:15:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:15:25] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1300.eqiad.wmnet
[19:15:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:51] <wikibugs>	 (03CR) 10Ryan Kemper: relforge: service impl of relforge100[3,4] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper)
[19:18:53] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] relforge: service impl of relforge100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper)
[19:19:53] <ryankemper>	 !log T262211 Attempting to bring `relforge100[3,4]` into service; merging https://gerrit.wikimedia.org/r/661229
[19:19:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:58] <stashbot>	 T262211: Service implementation for relforge100[34] - https://phabricator.wikimedia.org/T262211
[19:21:14] <ryankemper>	 !log T262211 `sudo cumin 'P{relforge*}' 'sudo run-puppet-agent'` on `ryankemper@cumin1001`
[19:21:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:22:01] <wikibugs>	 10SRE, 10netops: tcp handshake failure between pfw3-eqiad and frlog1001:6514 - https://phabricator.wikimedia.org/T263833 (10ayounsi)
[19:23:16] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001
[19:23:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:24:07] <wikibugs>	 (03CR) 1020after4: [C: 03+2] Revert "Caching fixes" [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662964 (https://phabricator.wikimedia.org/T264391) (owner: 10Hashar)
[19:26:53] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001
[19:26:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:27:01] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1383.eqiad.wmnet
[19:27:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:29:04] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Caching fixes" [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662964 (https://phabricator.wikimedia.org/T264391) (owner: 10Hashar)
[19:30:53] <wikibugs>	 (03PS1) 10Razzi: sre.druid.roll-restart-workers: properly pass commands list [cookbooks] - 10https://gerrit.wikimedia.org/r/663033 (https://phabricator.wikimedia.org/T269925)
[19:31:31] <wikibugs>	 (03PS1) 10Andrew Bogott: profile::labs::downloadserver: remove lvm class [puppet] - 10https://gerrit.wikimedia.org/r/663034 (https://phabricator.wikimedia.org/T274208)
[19:33:15] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] sre.druid.roll-restart-workers: properly pass commands list [cookbooks] - 10https://gerrit.wikimedia.org/r/663033 (https://phabricator.wikimedia.org/T269925) (owner: 10Razzi)
[19:33:50] <wikibugs>	 (03CR) 10Kosta Harlan: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan)
[19:33:53] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] sre.druid.roll-restart-workers: properly pass commands list [cookbooks] - 10https://gerrit.wikimedia.org/r/663033 (https://phabricator.wikimedia.org/T269925) (owner: 10Razzi)
[19:33:59] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] profile::labs::downloadserver: remove lvm class [puppet] - 10https://gerrit.wikimedia.org/r/663034 (https://phabricator.wikimedia.org/T274208) (owner: 10Andrew Bogott)
[19:35:24] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid's jvm daemons. - razzi@cumin1001
[19:35:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:46] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2264.codfw.wmnet
[19:35:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:18] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[19:40:29] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[19:40:48] <wikibugs>	 (03PS1) 10Andrew Bogott: profile::labs::downloadserver: remove lvm class [puppet] - 10https://gerrit.wikimedia.org/r/663036 (https://phabricator.wikimedia.org/T274208)
[19:41:01] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[19:41:26] <wikibugs>	 (03CR) 10Ppchelko: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan)
[19:41:54] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[19:42:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::labs::downloadserver: remove lvm class [puppet] - 10https://gerrit.wikimedia.org/r/663036 (https://phabricator.wikimedia.org/T274208) (owner: 10Andrew Bogott)
[19:44:59] <wikibugs>	 (03PS2) 10Andrew Bogott: profile::labs::downloadserver: remove lvm class [puppet] - 10https://gerrit.wikimedia.org/r/663036 (https://phabricator.wikimedia.org/T274208)
[19:46:33] <wikibugs>	 (03CR) 10Kosta Harlan: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan)
[19:47:09] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] profile::labs::downloadserver: remove lvm class [puppet] - 10https://gerrit.wikimedia.org/r/663036 (https://phabricator.wikimedia.org/T274208) (owner: 10Andrew Bogott)
[19:51:17] <icinga-wm>	 PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:53:09] <icinga-wm>	 PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:56:13] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2263.codfw.wmnet with reason: REIMAGE
[19:56:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:57:53] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1382.eqiad.wmnet with reason: REIMAGE
[19:57:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:58:17] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2263.codfw.wmnet with reason: REIMAGE
[19:58:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:58:50] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1385.eqiad.wmnet with reason: REIMAGE
[19:58:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:04] <jouncebot>	 twentyafterfour and hashar: May I have your attention please! Mediawiki train - American+European Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210209T2000)
[20:00:25] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1382.eqiad.wmnet with reason: REIMAGE
[20:00:27] <twentyafterfour>	 !log prepping 1.36.0-wmf.30 
[20:00:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:01:09] <mutante>	 reimages are ongoing but ideally you dont notice because they get removed from scap dsh groups in time  and I scap pull before repooling
[20:02:28] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1385.eqiad.wmnet with reason: REIMAGE
[20:02:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:05:27] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:05:33] <icinga-wm>	 PROBLEM - Hive Metastore on an-coord1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[20:05:44] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/27938/" [puppet] - 10https://gerrit.wikimedia.org/r/662033 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[20:05:54] <wikibugs>	 (03PS2) 10Dzahn: profile::rsyslog::udp_json_logback_compat: hiera -> lookup [puppet] - 10https://gerrit.wikimedia.org/r/662033 (https://phabricator.wikimedia.org/T209953)
[20:06:07] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001
[20:06:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:26] <elukey>	 the hive errors are me sorry
[20:07:21] <mutante>	 thanks, ack
[20:07:32] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1299.eqiad.wmnet with reason: REIMAGE
[20:07:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:17] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001
[20:08:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:43] <wikibugs>	 (03PS1) 10SBassett: Link to non-wiki privacy policy [puppet] - 10https://gerrit.wikimedia.org/r/663040 (https://phabricator.wikimedia.org/T207244)
[20:09:35] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1299.eqiad.wmnet with reason: REIMAGE
[20:09:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:34] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001
[20:10:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:57] <wikibugs>	 (03CR) 10Jforrester: "recheck" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/662966 (owner: 10Kormat)
[20:11:19] <logmsgbot>	 !log otto@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop analytics cluster: Change Hadoop distribution - otto@cumin1001
[20:11:21] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001
[20:11:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:12:07] <logmsgbot>	 !log otto@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - otto@cumin1001
[20:12:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:12:54] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001
[20:13:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:36] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001
[20:13:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:35] <wikibugs>	 (03CR) 10Dzahn: "noop on wdqs1003, restbase1022, thumbor2004,.." [puppet] - 10https://gerrit.wikimedia.org/r/662033 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[20:14:57] <wikibugs>	 (03PS2) 10Dzahn: wmcs::monitoring: replace hiera inside hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/662026 (https://phabricator.wikimedia.org/T209953)
[20:19:40] <wikibugs>	 (03CR) 10Jfishback: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/663040 (https://phabricator.wikimedia.org/T207244) (owner: 10SBassett)
[20:20:38] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2263.codfw.wmnet'] `  an...
[20:21:07] <wikibugs>	 (03CR) 1020after4: [C: 03+2] Branch commit for wmf/1.36.0-wmf.30 [core] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662803 (owner: 10TrainBranchBot)
[20:21:26] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid's jvm daemons. - razzi@cumin1001
[20:21:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:50] <icinga-wm>	 RECOVERY - Hive Metastore on an-coord1002 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[20:23:48] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:24:05] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1385.eqiad.wmnet'] `  an...
[20:24:06] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:24:32] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1382.eqiad.wmnet'] `  an...
[20:24:44] <icinga-wm>	 RECOVERY - Check systemd state on an-airflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:33:12] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2263.codfw.wmnet
[20:33:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:29] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1385.eqiad.wmnet
[20:33:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:40] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1382.eqiad.wmnet
[20:33:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:10] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1385.eqiad.wmnet
[20:41:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:42:10] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1382.eqiad.wmnet
[20:42:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:43:19] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2263.codfw.wmnet
[20:43:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:46] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.30 [core] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662803 (owner: 10TrainBranchBot)
[20:45:04] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:45:29] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:46:01] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:49:52] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:52:53] <wikibugs>	 (03CR) 10Dzahn: "compiler shows noop on cloudmetrics1002 - https://puppet-compiler.wmflabs.org/compiler1002/27939/" [puppet] - 10https://gerrit.wikimedia.org/r/662026 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[20:54:58] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1299.eqiad.wmnet'] `  an...
[20:55:55] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/27940/grafana2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/662008 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[20:56:28] <wikibugs>	 (03PS2) 10Dzahn: grafana: replace hiera inside hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/662008 (https://phabricator.wikimedia.org/T209953)
[20:56:47] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1299.eqiad.wmnet
[20:56:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:31] <wikibugs>	 (03CR) 10Dzahn: "noop on grafana1002" [puppet] - 10https://gerrit.wikimedia.org/r/662008 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[20:58:58] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1299.eqiad.wmnet
[20:59:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:00:29] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[21:01:25] <wikibugs>	 (03PS2) 10Dzahn: netmon: replace hiera within hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/662013 (https://phabricator.wikimedia.org/T209953)
[21:01:56] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1384.eqiad.wmnet with reason: REIMAGE
[21:01:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:02:23] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1381.eqiad.wmnet with reason: REIMAGE
[21:02:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:02:56] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2260.codfw.wmnet with reason: REIMAGE
[21:02:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:03:27] <wikibugs>	 10SRE, 10ops-codfw: ms-be2031 repeated usb connect/disconnect message - https://phabricator.wikimedia.org/T273895 (10Papaul) p:05Triage→03Medium
[21:04:04] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1384.eqiad.wmnet with reason: REIMAGE
[21:04:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:06:03] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1381.eqiad.wmnet with reason: REIMAGE
[21:06:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:07:53] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2260.codfw.wmnet with reason: REIMAGE
[21:07:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:09:28] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Link to non-wiki privacy policy [puppet] - 10https://gerrit.wikimedia.org/r/663040 (https://phabricator.wikimedia.org/T207244) (owner: 10SBassett)
[21:10:02] <elukey>	 !log Analytics Hadoop cluster upgrade completed
[21:10:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:11:56] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/27941/" [puppet] - 10https://gerrit.wikimedia.org/r/662013 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[21:12:51] <wikibugs>	 (03CR) 1020after4: "This change is ready for review." [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662965 (https://phabricator.wikimedia.org/T264391) (owner: 1020after4)
[21:13:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fix issues with recent caching update [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662965 (https://phabricator.wikimedia.org/T264391) (owner: 1020after4)
[21:14:10] <wikibugs>	 (03CR) 10Dzahn: "noop on netmon1002, netmon2001" [puppet] - 10https://gerrit.wikimedia.org/r/662013 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[21:14:39] <wikibugs>	 (03Abandoned) 1020after4: Fix issues with recent caching update [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662965 (https://phabricator.wikimedia.org/T264391) (owner: 1020after4)
[21:15:04] <wikibugs>	 (03PS2) 10Dzahn: netbox: replace hiera inside hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/662022 (https://phabricator.wikimedia.org/T209953)
[21:17:20] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/27943/" [puppet] - 10https://gerrit.wikimedia.org/r/662022 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[21:18:52] <wikibugs>	 (03CR) 10Dzahn: "noop on netbox1001,netbox2001" [puppet] - 10https://gerrit.wikimedia.org/r/662022 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[21:26:03] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:27:00] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1003/27944/  but I am kind of expecting some alerts after this gets merged.. a new timer on EV" [puppet] - 10https://gerrit.wikimedia.org/r/661189 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[21:27:32] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1298.eqiad.wmnet with reason: REIMAGE
[21:27:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:29:15] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1384.eqiad.wmnet'] `  an...
[21:29:21] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2260.codfw.wmnet'] `  an...
[21:29:33] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1298.eqiad.wmnet with reason: REIMAGE
[21:29:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:29:49] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1381.eqiad.wmnet'] `  an...
[21:30:22] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1381.eqiad.wmnet
[21:30:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:30:33] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1384.eqiad.wmnet
[21:30:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:30:54] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2260.codfw.wmnet
[21:30:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:35:29] <icinga-wm>	 RECOVERY - WDQS high update lag on wdqs1005 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.154e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[21:35:40] <wikibugs>	 (03CR) 10Dzahn: "not just mwdebug but also other canaries https://puppet-compiler.wmflabs.org/compiler1001/27945/mw2271.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/662798 (https://phabricator.wikimedia.org/T274023) (owner: 10Dzahn)
[21:36:57] <wikibugs>	 (03PS2) 10Legoktm: docker_registry_ha: Properly override nginx timeouts [puppet] - 10https://gerrit.wikimedia.org/r/662806
[21:36:59] <wikibugs>	 (03PS2) 10Legoktm: [WIP] docker_registry_ha: Have restricted/ images that are limited read/write [puppet] - 10https://gerrit.wikimedia.org/r/662807 (https://phabricator.wikimedia.org/T273521)
[21:38:23] <wikibugs>	 10SRE, 10dev-images, 10docker-pkg, 10Release-Engineering-Team (Local Dev): docker-pkg: "certificate verify failed: unable to get local issuer certificate" for docker-registry.discovery.wmnet when publishing dev-images from contint2001 - https://phabricator.wikimedia.org/T274306 (10brennen)
[21:39:24] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1384.eqiad.wmnet
[21:39:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:31] <wikibugs>	 10SRE, 10dev-images, 10docker-pkg, 10Release-Engineering-Team (Local Dev), 10User-brennen: docker-pkg: "certificate verify failed: unable to get local issuer certificate" for docker-registry.discovery.wmnet when publishing dev-images from contint2001 - https://phabricator.wikimedia.org/T274306 (10brennen...
[21:40:15] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[21:40:40] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1381.eqiad.wmnet
[21:40:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:41:18] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[21:42:57] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2260.codfw.wmnet
[21:43:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:43:31] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[21:45:44] <wikibugs>	 (03PS3) 10Legoktm: [WIP] docker_registry_ha: Have restricted/ images that are limited read/write [puppet] - 10https://gerrit.wikimedia.org/r/662807 (https://phabricator.wikimedia.org/T273521)
[21:48:31] <wikibugs>	 (03PS2) 10Dzahn: gerrit: replace certbot cron for cloud with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/662035 (https://phabricator.wikimedia.org/T273673)
[21:56:56] <wikibugs>	 (03CR) 10Dzahn: "cloud-only, not affecting prod: https://puppet-compiler.wmflabs.org/compiler1001/27949/" [puppet] - 10https://gerrit.wikimedia.org/r/662035 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[21:57:08] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1373.eqiad.wmnet with reason: REIMAGE
[21:57:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:58:12] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1380.eqiad.wmnet with reason: REIMAGE
[21:58:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:59:09] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1373.eqiad.wmnet with reason: REIMAGE
[21:59:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:00:04] <jouncebot>	 Legoktm and DannyS712: It is that lovely time of the day again! You are hereby commanded to deploy GlobalWatchlist deployment to production. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210209T2200).
[22:00:27] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2259.codfw.wmnet with reason: REIMAGE
[22:00:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:00:56] <DannyS712>	 legoktm ready when you are
[22:01:01] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:01:10] <wikibugs>	 (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655774 (https://phabricator.wikimedia.org/T260862) (owner: 10DannyS712)
[22:01:11] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1380.eqiad.wmnet with reason: REIMAGE
[22:01:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:01:13] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: replace certbot cron for cloud with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/662035 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[22:02:49] <legoktm>	 give me a minute
[22:03:16] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2259.codfw.wmnet with reason: REIMAGE
[22:03:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:03:32] <legoktm>	 DannyS712: did all the patches make it in the train?
[22:04:00] <DannyS712>	 {{checking}}
[22:04:33] <wikibugs>	 (03CR) 10Volans: "Code looks sane and in line with the current style, that as discussed we both agree should be refactored, but that's out of scope for this" (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/662762 (https://phabricator.wikimedia.org/T263768) (owner: 10CRusnov)
[22:04:52] <DannyS712>	 everything that has merged already was merged before the branch cut, per https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/GlobalWatchlist, but I'd like to get https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GlobalWatchlist/+/661103 in before the deployment
[22:05:51] <DannyS712>	 except something happened with the train? testwiki is only on wmf.29, not .30
[22:07:21] <wikibugs>	 (03PS2) 10Dzahn: gerrit: remove code that absented cron [puppet] - 10https://gerrit.wikimedia.org/r/662036 (https://phabricator.wikimedia.org/T273673)
[22:07:27] <legoktm>	 I guess hashar only went to .29 today?
[22:07:49] <icinga-wm>	 RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:07:58] <legoktm>	 https://lists.wikimedia.org/pipermail/wikitech-l/2021-February/094254.html
[22:08:13] <DannyS712>	 https://sal.toolforge.org/log/ntJhiHcBhxWNv8gIhjka has twentyafterfour "prepping 1.36.0-wmf.30"
[22:08:13] <DannyS712>	 I propose we deploy to testwiki with .29, but avoid enabling on metawiki, until .30 is live there
[22:08:34] <legoktm>	 hmm
[22:08:36] <DannyS712>	 that one remaining patch can be merged on master and then backported to .30, but we don't need to wait for that now
[22:08:53] <twentyafterfour>	 I'm working on backporting patches now
[22:09:12] <legoktm>	 twentyafterfour: when are you planning to rollout .30 to group0?
[22:09:39] <twentyafterfour>	 legoktm: as soon as I can sort out the mess that I'm in with https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FeaturedFeeds/+/662985
[22:09:50] <legoktm>	 ack
[22:10:17] <legoktm>	 DannyS712: so lets enable it on testwiki, but remove the logging portion until that patch is deployed with wmf.30
[22:11:23] <icinga-wm>	 PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:12:03] <DannyS712>	 why? The logging portion specifies `info`, and the patch that cleaned up the logging moved stuff that were previously `debug` to `info` - if we enable the logging at `info`, without that logging patch the only consequence is that (in theory, if I understand correctly) *nothing* will get logged
[22:12:28] <legoktm>	 oh, you're right
[22:12:31] <legoktm>	 I confused myself
[22:12:32] <legoktm>	 okay
[22:12:38] <legoktm>	 let's go :)
[22:13:35] <wikibugs>	 (03PS7) 10Legoktm: Enable GlobalWatchlist extension on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655774 (https://phabricator.wikimedia.org/T260862) (owner: 10DannyS712)
[22:13:44] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Enable GlobalWatchlist extension on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655774 (https://phabricator.wikimedia.org/T260862) (owner: 10DannyS712)
[22:14:43] <wikibugs>	 (03Merged) 10jenkins-bot: Enable GlobalWatchlist extension on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655774 (https://phabricator.wikimedia.org/T260862) (owner: 10DannyS712)
[22:14:45] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1298.eqiad.wmnet'] `  an...
[22:16:25] <legoktm>	 DannyS712: try on mwdebug1002
[22:16:32] <DannyS712>	 trying
[22:17:03] <legoktm>	 https://test.wikipedia.org/wiki/Special:GlobalWatchlist
[22:17:13] <legoktm>	 I see it on Special:Version \o/
[22:17:33] <DannyS712>	 its working
[22:18:06] <wikibugs>	 (03CR) 10Dzahn: "things looking good on gerrit-prod-1001.devtools in cloud.  just noticed there is already a timer doing the same thing that comes from the" [puppet] - 10https://gerrit.wikimedia.org/r/662035 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[22:18:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: remove code that absented cron [puppet] - 10https://gerrit.wikimedia.org/r/662036 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[22:18:38] <DannyS712>	 might be a bug with the sidebar on Special:Watchlist (it should have a link to the special page) but that also looks to be broken on the beta cluster... should be good to sync
[22:18:47] <wikibugs>	 (03Restored) 1020after4: Fix issues with recent caching update [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662965 (https://phabricator.wikimedia.org/T264391) (owner: 1020after4)
[22:19:25] <wikibugs>	 (03PS2) 1020after4: Fix issues with recent caching update [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662965 (https://phabricator.wikimedia.org/T264391)
[22:20:12] <legoktm>	 syncing
[22:23:02] <logmsgbot>	 !log legoktm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable GlobalWatchlist extension on testwiki (T260862) (duration: 02m 51s)
[22:23:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:23:06] <stashbot>	 T260862: Deploy GlobalWatchlist extension to production (Meta only) - https://phabricator.wikimedia.org/T260862
[22:23:34] <wikibugs>	 10SRE, 10dev-images, 10docker-pkg, 10Release-Engineering-Team (Local Dev), 10User-brennen: docker-pkg: "certificate verify failed: unable to get local issuer certificate" for docker-registry.discovery.wmnet when publishing dev-images from contint2001 - https://phabricator.wikimedia.org/T274306 (10brennen...
[22:23:39] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1298.eqiad.wmnet
[22:23:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:23:45] <DannyS712>	 legoktm confirmed to be working :)
[22:23:55] <DannyS712>	 will figure out the sidebar link in a minute
[22:24:02] <legoktm>	 DannyS712: :)) I think you should send an email to wikitech-l asking for people to test / provide feedback
[22:24:08] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1373.eqiad.wmnet'] `  an...
[22:24:36] <DannyS712>	 I will once we're done - if I can figure out the issue with the sidebar link first I'd like to
[22:24:50] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:25:03] <legoktm>	 done? was there anything else to deploy?
[22:25:30] <DannyS712>	 I mean once I'm done figuring it out - there shouldn't be anything else to deploy now
[22:25:32] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2259.codfw.wmnet'] `  an...
[22:25:58] <DannyS712>	 thanks for the help
[22:26:08] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1298.eqiad.wmnet
[22:26:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:26:15] <legoktm>	 yw!
[22:26:43] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[22:28:15] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1373.eqiad.wmnet
[22:28:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:29:39] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1373.eqiad.wmnet
[22:29:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:30:03] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[22:31:01] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2259.codfw.wmnet
[22:31:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:34:41] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2259.codfw.wmnet
[22:34:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:34:49] <wikibugs>	 (03PS13) 10Kosta Harlan: [WIP] linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893)
[22:35:29] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[22:37:07] <wikibugs>	 (03PS2) 10Volans: mysql_legacy.py: Add x2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/662631 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui)
[22:40:30] <wikibugs>	 (03CR) 10Dzahn: [V: 04-1] "Failed to execute generator /usr/bin/systemd-analyze: Execution of '/usr/bin/systemd-analyze calendar  *-*-1 0:0:00' returned 1: Failed to" [puppet] - 10https://gerrit.wikimedia.org/r/661536 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[22:44:34] <wikibugs>	 (03CR) 10Dzahn: [V: 04-1] "/usr/bin/systemd-analyze calendar '*-*-1 0:0:00'" [puppet] - 10https://gerrit.wikimedia.org/r/661536 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[22:46:58] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1372.eqiad.wmnet with reason: REIMAGE
[22:47:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:49:00] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1372.eqiad.wmnet with reason: REIMAGE
[22:49:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:54:52] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1296.eqiad.wmnet with reason: REIMAGE
[22:54:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:57:02] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1296.eqiad.wmnet with reason: REIMAGE
[22:57:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:01:01] <wikibugs>	 (03Abandoned) 1020after4: Fix issues with recent caching update [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662965 (https://phabricator.wikimedia.org/T264391) (owner: 1020after4)
[23:01:07] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:01:07] <wikibugs>	 (03PS2) 10Dzahn: phabricator: convert statistics mail crons to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/661536 (https://phabricator.wikimedia.org/T273673)
[23:03:44] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2250.codfw.wmnet with reason: REIMAGE
[23:03:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:05:46] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2250.codfw.wmnet with reason: REIMAGE
[23:05:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:09:23] <wikibugs>	 (03PS1) 10Dzahn: wmflib: add data type for 'day of the week' in systemd timers/caledar [puppet] - 10https://gerrit.wikimedia.org/r/663051
[23:12:07] <wikibugs>	 (03PS2) 10Dzahn: wmflib: add data type for 'day of the week' in systemd timers/calendar [puppet] - 10https://gerrit.wikimedia.org/r/663051
[23:13:20] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1372.eqiad.wmnet'] `  an...
[23:16:59] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1001/27951/phab1001.eqiad.wmnet/change.phab1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/661536 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[23:17:10] <wikibugs>	 (03PS1) 10Ryan Kemper: relforge: New hosts are relforge100[3,4] [homer/public] - 10https://gerrit.wikimedia.org/r/663054 (https://phabricator.wikimedia.org/T274314)
[23:18:02] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1372.eqiad.wmnet
[23:18:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:18:14] <wikibugs>	 10Puppet, 10SRE, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10Ladsgroup) Thanks. I don't know much about puppet so I'm probably wrong but it's just it's mentioned in list of core types: https://puppet.com/do...
[23:19:55] <wikibugs>	 (03CR) 10Kosta Harlan: "Tested this out, and it's working! Still WIP until we can sort out the credential / env variables, see inline comment." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan)
[23:20:01] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "needs some code to set it to "*" by default if weekday or monthday is not set, will follow-up later" [puppet] - 10https://gerrit.wikimedia.org/r/661536 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[23:21:54] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1380.eqiad.wmnet'] `  Of...
[23:23:03] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1372.eqiad.wmnet
[23:23:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:28:02] <mutante>	 !log mw1380 - powercycling after it did not come back from normal reboot during reimaging
[23:28:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:30:40] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:32:07] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1380.eqiad.wmnet
[23:32:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:33:11] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1380.eqiad.wmnet
[23:33:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:35:33] <wikibugs>	 (03PS1) 1020after4: all wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663056
[23:35:35] <wikibugs>	 (03CR) 1020after4: [C: 03+2] all wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663056 (owner: 1020after4)
[23:36:29] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663056 (owner: 1020after4)
[23:39:59] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1296.eqiad.wmnet'] `  an...
[23:40:30] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1296.eqiad.wmnet
[23:40:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:42:31] <wikibugs>	 (03PS2) 10Cwhite: profile: remove deprecated syslog input [puppet] - 10https://gerrit.wikimedia.org/r/662009 (https://phabricator.wikimedia.org/T217032)
[23:42:40] <wikibugs>	 (03PS1) 10Jdlrobson: Add inline documentation to configuration about updating logos regarding labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663057
[23:43:20] <wikibugs>	 (03PS1) 10Legoktm: Add hiera for docker_registry_ha I76a6fc9d21380 [labs/private] - 10https://gerrit.wikimedia.org/r/663058
[23:44:53] <wikibugs>	 (03PS2) 10Jdlrobson: Add inline documentation to configuration about updating logos regarding labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663057
[23:44:58] <wikibugs>	 (03PS3) 10Jdlrobson: Add inline documentation to configuration about updating logos regarding labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663057
[23:45:41] <Kemayo>	 Did everything get rolled back to .27 again?
[23:45:58] <thcipriani>	 yes
[23:46:00] <logmsgbot>	 !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.27
[23:46:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:46:08] <icinga-wm>	 PROBLEM - Apache HTTP on mw1382 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 944 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[23:46:16] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 3543 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:46:30] <mutante>	 looking at 1382
[23:46:50] <mutante>	 scap pulling
[23:47:02] <Kemayo>	 Is the intention to roll things forward again shortly, or are we stuck for a while again? (I ask entirely because it affects whether I go ahead with the config patch I have scheduled for the current backport window.)
[23:47:10] <logmsgbot>	 !log twentyafterfour@deploy1001 Started scap: (no justification provided)
[23:47:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:47:29] <twentyafterfour>	 Kemayo: stuck until .30 
[23:47:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw2220 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 944 bytes in 0.094 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[23:47:36] <wikibugs>	 (03CR) 10Cwhite: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/662009 (https://phabricator.wikimedia.org/T217032) (owner: 10Cwhite)
[23:47:36] <twentyafterfour>	 mutante: I'm running a sync-world 
[23:47:45] <twentyafterfour>	 !log running scap sync-world
[23:47:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:48:06] <Kemayo>	 twentyafterfour: Okay, got it.
[23:48:11] <thcipriani>	 Kemayo: I'll update the task here in a second, but we're staying on wmf.27 for now, will move forward once we get a backport for wmf.30 figured out. wmf.29 won't go out.
[23:48:17] <twentyafterfour>	 .30 tomorrow, hopefully
[23:48:25] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1296.eqiad.wmnet
[23:48:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:48:32] <icinga-wm>	 RECOVERY - Apache HTTP on mw1382 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[23:48:51] <mutante>	 twentyafterfour: thanks and that's good
[23:49:07] <mutante>	 1296 repooled as well.. and pulled
[23:49:08] <icinga-wm>	 PROBLEM - Apache HTTP on mw1385 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 944 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[23:49:10] <icinga-wm>	 PROBLEM - Apache HTTP on mw1383 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 944 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[23:49:55] <thcipriani>	 ugh, this is: https://phabricator.wikimedia.org/T273334
[23:49:56] <icinga-wm>	 RECOVERY - Apache HTTP on mw2220 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[23:50:44] <mutante>	 !log mw1383,mw1385 - scap pull, php
[23:50:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:51:37] <icinga-wm>	 RECOVERY - Apache HTTP on mw1383 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[23:51:43] <icinga-wm>	 RECOVERY - Apache HTTP on mw1385 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[23:52:29] <mutante>	 twentyafterfour: should be good now
[23:52:39] <mutante>	 rescheduled icinga after pull etc
[23:52:51] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2250.codfw.wmnet'] `  an...
[23:52:55] <mutante>	 (pulling also runs the restart check)
[23:52:59] <legoktm>	 DannyS712: is GlobalWatchlist is OK on wmf.27 or should we disable it?
[23:53:38] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2250.codfw.wmnet
[23:53:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:55:25] <logmsgbot>	 !log twentyafterfour@deploy1001 Finished scap: (no justification provided) (duration: 08m 43s)
[23:55:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:55:40] <ryankemper>	 !log Depooled `wdqs1005` - it's catching up on hours of lag
[23:55:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:56:03] <wikibugs>	 (03Restored) 10Thcipriani: Fix issues with recent caching update [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662965 (https://phabricator.wikimedia.org/T264391) (owner: 1020after4)
[23:56:09] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2250.codfw.wmnet
[23:56:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:57:07] <wikibugs>	 (03PS1) 1020after4: testwikis wikis to 1.36.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663060
[23:57:09] <wikibugs>	 (03CR) 1020after4: [C: 03+2] testwikis wikis to 1.36.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663060 (owner: 1020after4)
[23:57:55] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663060 (owner: 1020after4)
[23:58:09] <logmsgbot>	 !log twentyafterfour@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.30
[23:58:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:59:15] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops