[00:00:04] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Evening backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210209T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:00:48] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1302.eqiad.wmnet with reason: REIMAGE [00:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:44] (03CR) 10Dzahn: [C: 04-1] wikilabels: replace cron with systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/662781 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [00:02:49] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1301.eqiad.wmnet with reason: REIMAGE [00:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:00] (03CR) 10Dzahn: [C: 03+1] "looks good to me, afaict. will let Bryan review though and please test a restart after merging this" [puppet] - 10https://gerrit.wikimedia.org/r/662764 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [00:09:08] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1387.eqiad.wmnet'] ` an... [00:09:55] (03CR) 10Dzahn: [C: 04-1] "deploy2001.codfw.wmnet: Evaluation Error: Resource type not found: Stdlib::Path (Stdlib::Unixpath)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/659941 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [00:10:10] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1001 is CRITICAL: 6.514e+06 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [00:10:50] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1008 is CRITICAL: 1.026e+07 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1008 [00:11:47] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1386.eqiad.wmnet'] ` an... [00:14:48] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1009 is CRITICAL: 1.555e+07 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1009 [00:15:30] (03PS4) 10Dzahn: kubernetes::deployment_server: add yaml to configure MediaWiki sites [puppet] - 10https://gerrit.wikimedia.org/r/659941 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [00:15:32] (03CR) 10Dzahn: [C: 04-1] kubernetes::deployment_server: add yaml to configure MediaWiki sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/659941 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [00:19:08] (03CR) 10Dzahn: [C: 04-1] "PS4 fixed an issue but there is also "Failed to parse inline template: undefined method `content'"" [puppet] - 10https://gerrit.wikimedia.org/r/659941 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [00:22:04] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1387.eqiad.wmnet [00:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:20] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1386.eqiad.wmnet [00:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:32] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1386.eqiad.wmnet [00:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:16] 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Joe) Another suggestion coming from twitter is https://play.google.com/store/apps/details?id=com.app.rcn, which anyways doesn't seem popular eno... [00:28:10] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1387.eqiad.wmnet [00:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:38] 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Ladsgroup) I don't have much knowledge about India's internet infrastructure but from experience of Iran and blocking apps/websites. They show y... [00:34:26] (03PS1) 10Bstorm: wikireplicas: adjust logrotate for multiinstance on wmf-pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/662797 (https://phabricator.wikimedia.org/T274044) [00:35:15] (03CR) 10BryanDavis: [C: 03+1] "The diffs look reasonable to me. I can't remember off the top of my head if there have been api changes in irc.bot that would require othe" [puppet] - 10https://gerrit.wikimedia.org/r/662764 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [00:36:36] (03CR) 10Bstorm: "I have zero idea if the logging is even working right now. Currently, the daemons on each server both point at the same log file. I figure" [puppet] - 10https://gerrit.wikimedia.org/r/662797 (https://phabricator.wikimedia.org/T274044) (owner: 10Bstorm) [00:41:56] ^ re kafka max lag: a slow migration extended past the downtime window but is going smoothly [00:45:13] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1302.eqiad.wmnet'] ` an... [00:46:08] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1301.eqiad.wmnet'] ` an... [00:48:22] (03PS1) 10Dzahn: mwdebug: allow rsyncing home dirs from any mwdebug* to mwdebug1003 [puppet] - 10https://gerrit.wikimedia.org/r/662798 (https://phabricator.wikimedia.org/T274023) [00:48:27] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1301.eqiad.wmnet [00:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:44] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1302.eqiad.wmnet [00:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:51] (03CR) 10jerkins-bot: [V: 04-1] mwdebug: allow rsyncing home dirs from any mwdebug* to mwdebug1003 [puppet] - 10https://gerrit.wikimedia.org/r/662798 (https://phabricator.wikimedia.org/T274023) (owner: 10Dzahn) [00:59:26] (03PS2) 10Dzahn: mwdebug: allow rsyncing home dirs from any mwdebug* to a backup host [puppet] - 10https://gerrit.wikimedia.org/r/662798 (https://phabricator.wikimedia.org/T274023) [00:59:42] 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Preinheimer) Going to the TikTok website from India results in the regular TikTok page loading, with a banner from TikTok saying that the servic... [01:01:42] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1008 is OK: (C)5e+06 ge (W)1e+06 ge 9.3e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1008 [01:01:48] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1001 is OK: (C)5e+06 ge (W)1e+06 ge 8.736e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [01:02:00] (03PS3) 10Dzahn: mwdebug: allow rsyncing home dirs from any mwdebug* to a backup host [puppet] - 10https://gerrit.wikimedia.org/r/662798 (https://phabricator.wikimedia.org/T274023) [01:06:49] (03PS1) 10Andrew Bogott: Keystone: a new (but still terrible) approach to making projectid==projectname [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165) [01:07:38] (03CR) 10jerkins-bot: [V: 04-1] Keystone: a new (but still terrible) approach to making projectid==projectname [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165) (owner: 10Andrew Bogott) [01:09:00] (03PS2) 10Andrew Bogott: Keystone: a new (but still terrible) approach to making projectid==projectname [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165) [01:10:56] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1009 is OK: (C)5e+06 ge (W)1e+06 ge 8.019e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1009 [01:12:51] (03PS3) 10Andrew Bogott: Keystone: a new (but still terrible) approach to making projectid==projectname [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165) [01:13:43] 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Dzahn) https://newshimalaya.com/2021/02/09/%E2%9A%93-t273741-investigate-unusual-media-traffic-pattern-for-asternovi-belgii-flower-1mb-jpg-on-co... [01:13:50] (03PS1) 10Tim Starling: Caching fixes [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662677 (https://phabricator.wikimedia.org/T264391) [01:14:41] (03CR) 10Tim Starling: [C: 03+2] Caching fixes [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662677 (https://phabricator.wikimedia.org/T264391) (owner: 10Tim Starling) [01:15:41] dpifke: ping on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/597654 [01:20:40] (03Merged) 10jenkins-bot: Caching fixes [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662677 (https://phabricator.wikimedia.org/T264391) (owner: 10Tim Starling) [01:35:16] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1007 is OK: (C)5e+06 ge (W)1e+06 ge 7.903e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1007 [01:36:00] 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10varenc) >>! In T273741#6813823, @Preinheimer wrote: > Going to the TikTok website from India results in the regular TikTok page loading, with a... [01:46:06] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1301.eqiad.wmnet [01:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:46:19] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1302.eqiad.wmnet [01:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:56:55] !log tstarling@deploy1001 Synchronized php-1.36.0-wmf.29/extensions/FeaturedFeeds: probable fix for UBN T273242 (duration: 01m 06s) [01:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:56:59] T273242: MemcachedPeclBagOStuff: Serialization of 'Closure' is not allowed - https://phabricator.wikimedia.org/T273242 [02:07:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.30 [core] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662803 [02:35:56] (03PS1) 10Andrew Bogott: profile::labs::downloadserver: remove lvm class [puppet] - 10https://gerrit.wikimedia.org/r/662805 (https://phabricator.wikimedia.org/T274208) [02:37:41] (03CR) 10Andrew Bogott: [C: 03+2] profile::labs::downloadserver: remove lvm class [puppet] - 10https://gerrit.wikimedia.org/r/662805 (https://phabricator.wikimedia.org/T274208) (owner: 10Andrew Bogott) [02:57:02] (03PS1) 10Legoktm: docker_registry_ha: Properly override nginx timeouts [puppet] - 10https://gerrit.wikimedia.org/r/662806 [02:57:04] (03PS1) 10Legoktm: [WIP] docker_registry_ha: Have restricted/ images that are limited read/write [puppet] - 10https://gerrit.wikimedia.org/r/662807 (https://phabricator.wikimedia.org/T273521) [02:57:28] (03CR) 10jerkins-bot: [V: 04-1] [WIP] docker_registry_ha: Have restricted/ images that are limited read/write [puppet] - 10https://gerrit.wikimedia.org/r/662807 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [03:17:10] 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10tomglynch) Hi all, I've been doing a bit of research into possible apps that could be causing this and found two potential culprits that I am cu... [03:29:47] 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10ssingh) Thank you everyone for the comments and suggestions. I just wanted to share that we have identified the app and will update this task to... [04:33:17] 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10mmodell) >>! In T273741#6813839, @Dzahn wrote: > ^ wut? I tried to search for links to this image and found... this Phabricator ticket content... [04:34:07] 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10mmodell) Also hello hacker news! [04:56:16] 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10TheOv3rminD) >>! In T273741#6813995, @mmodell wrote: > Also, hello hacker news! https://news.ycombinator.com/item?id=26072025 Hello From us Hac... [05:02:09] !log krinkle@deploy1001 Started deploy [integration/docroot@fdfb265]: I271e6054880, T273247 [05:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:14] T273247: Publish wikimedia/minify as its own repo and package - https://phabricator.wikimedia.org/T273247 [05:02:16] !log krinkle@deploy1001 Finished deploy [integration/docroot@fdfb265]: I271e6054880, T273247 (duration: 00m 06s) [05:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:23] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Krinkle) I suspect this might have broken deplo... [06:03:13] (03PS1) 10Marostegui: Revert "db1111: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/662680 [06:04:46] (03CR) 10Marostegui: [C: 03+2] Revert "db1111: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/662680 (owner: 10Marostegui) [06:05:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 5%: Slowly repooling db1111 after onsite maintenance', diff saved to https://phabricator.wikimedia.org/P14233 and previous config saved to /var/cache/conftool/dbconfig/20210209-060520-root.json [06:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:36] 10SRE, 10ops-eqiad, 10DBA: eqiad: move db1111 to rack A8 - https://phabricator.wikimedia.org/T273982 (10Marostegui) I have started to repool this host back. [06:18:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1090:3312, db1090:3317 T258361', diff saved to https://phabricator.wikimedia.org/P14234 and previous config saved to /var/cache/conftool/dbconfig/20210209-061822-marostegui.json [06:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:28] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [06:20:24] !log Stop mysql on s2 and s7 on db1090 to clone db1170 T258361 [06:20:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 10%: Slowly repooling db1111 after onsite maintenance', diff saved to https://phabricator.wikimedia.org/P14236 and previous config saved to /var/cache/conftool/dbconfig/20210209-062024-root.json [06:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:29] (03PS1) 10Marostegui: mariadb: Productionize db1170 [puppet] - 10https://gerrit.wikimedia.org/r/662818 (https://phabricator.wikimedia.org/T258361) [06:28:49] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1170 [puppet] - 10https://gerrit.wikimedia.org/r/662818 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:34:31] (03CR) 10Marostegui: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/662797 (https://phabricator.wikimedia.org/T274044) (owner: 10Bstorm) [06:35:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 25%: Slowly repooling db1111 after onsite maintenance', diff saved to https://phabricator.wikimedia.org/P14237 and previous config saved to /var/cache/conftool/dbconfig/20210209-063527-root.json [06:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:20] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.63`. Pre-deploy tests passing on canary `wdqs1003` [06:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:21] (03PS1) 10Ayounsi: Depool esams for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/662822 (https://phabricator.wikimedia.org/T272342) [06:40:27] !log Pooled `wdqs1007` and depooled `wdqs1005` (`1005` is ~12 hours behind) [06:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:39] !log ryankemper@deploy1001 Started deploy [wdqs/wdqs@582b070]: 0.3.63 [06:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:28] !log [WDQS Deploy] Tests passing following deploy of `0.3.63` on canary `wdqs1003`; proceeding to rest of fleet [06:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:27] (03PS2) 10Ayounsi: Depool esams for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/662822 (https://phabricator.wikimedia.org/T272342) [06:43:06] (03CR) 10Ayounsi: [C: 03+2] Depool esams for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/662822 (https://phabricator.wikimedia.org/T272342) (owner: 10Ayounsi) [06:44:12] (03CR) 10Marostegui: "Just tested clouddb1014 getting a query there doesn't seem to appear on the log:" [puppet] - 10https://gerrit.wikimedia.org/r/662797 (https://phabricator.wikimedia.org/T274044) (owner: 10Bstorm) [06:44:15] !log depool esams for network maintenance - T272342 [06:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:25] !log ryankemper@deploy1001 Finished deploy [wdqs/wdqs@582b070]: 0.3.63 (duration: 06m 46s) [06:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:27] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [06:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:31] !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [06:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:37] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [06:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 50%: Slowly repooling db1111 after onsite maintenance', diff saved to https://phabricator.wikimedia.org/P14238 and previous config saved to /var/cache/conftool/dbconfig/20210209-065031-root.json [06:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:10] (03PS7) 10Elukey: Set Apache Bigtop 1.5 as default hadoop distro [puppet] - 10https://gerrit.wikimedia.org/r/661974 (https://phabricator.wikimedia.org/T273711) [06:53:34] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:00:06] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [07:04:20] !log depool disable 2 uplinks on asw2-esams - T272342 [07:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 75%: Slowly repooling db1111 after onsite maintenance', diff saved to https://phabricator.wikimedia.org/P14239 and previous config saved to /var/cache/conftool/dbconfig/20210209-070534-root.json [07:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:45] (03CR) 10Marostegui: "Some more testing:" [puppet] - 10https://gerrit.wikimedia.org/r/662797 (https://phabricator.wikimedia.org/T274044) (owner: 10Bstorm) [07:09:44] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 85, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:12:34] planned maintenance ^ [07:15:14] (03CR) 10Elukey: [C: 03+2] Set Apache Bigtop 1.5 as default hadoop distro [puppet] - 10https://gerrit.wikimedia.org/r/661974 (https://phabricator.wikimedia.org/T273711) (owner: 10Elukey) [07:15:22] RECOVERY - Check systemd state on clouddb1014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:17:30] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 12.51 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:20:02] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 85, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:20:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 100%: Slowly repooling db1111 after onsite maintenance', diff saved to https://phabricator.wikimedia.org/P14240 and previous config saved to /var/cache/conftool/dbconfig/20210209-072038-root.json [07:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:49] !log [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good [07:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:11] 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Zardula) Hello from hacker news [07:30:42] 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Majavah) [07:33:38] (03PS1) 10Marostegui: instances.yaml: Remove db1081 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/662828 (https://phabricator.wikimedia.org/T273040) [07:34:11] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1081 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/662828 (https://phabricator.wikimedia.org/T273040) (owner: 10Marostegui) [07:34:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1081 from dbctl T273040', diff saved to https://phabricator.wikimedia.org/P14241 and previous config saved to /var/cache/conftool/dbconfig/20210209-073455-marostegui.json [07:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:00] T273040: decommission db1081.eqiad.wmnet - https://phabricator.wikimedia.org/T273040 [07:35:10] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 87, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:38:49] 10SRE, 10Wikimedia-Portals, 10Patch-For-Review, 10Regression: www.wikipedia.org/robots.txt should not be a redirect - https://phabricator.wikimedia.org/T242500 (10SarthakKundra) a:05SarthakKundraβ†’03None [07:39:03] 10SRE, 10Wikimedia-Portals, 10Regression: www.wikipedia.org/robots.txt should not be a redirect - https://phabricator.wikimedia.org/T242500 (10SarthakKundra) [07:41:48] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 83, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:43:12] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:43:40] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:43:52] 10SRE, 10netops: cr3-esams linecard diversity issue - https://phabricator.wikimedia.org/T262524 (10ayounsi) Disabling the interface on the asw2 side (via homer) `lang=diff [edit interfaces interface-range disabled] + member et-4/0/51; member ge-4/0/27 { ... } [edit interfaces interface-range disabled]... [07:44:16] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [07:45:01] 10SRE, 10netops: cr3-esams linecard diversity issue - https://phabricator.wikimedia.org/T262524 (10ayounsi) 05Openβ†’03Resolved Remote hands did the recabling. Then pushed the cleanup via Homer as well. Everything is done. [07:45:46] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:46:49] routers recabling done, now moving on to upgrading asw2-esams OS [07:47:01] !log hashar@deploy1001 Started deploy [integration/docroot@672e79f]: build: Add /scap/log to gitignore [07:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:07] !log hashar@deploy1001 Finished deploy [integration/docroot@672e79f]: build: Add /scap/log to gitignore (duration: 00m 06s) [07:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:22] !log redirect ns2 to authdns1001 - T252631 [07:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:26] T252631: Upgrade Junos on asw2-esams - https://phabricator.wikimedia.org/T252631 [07:55:51] (03PS5) 10Giuseppe Lavagetto: Add support for php deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 [08:02:55] !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on 32 hosts with reason: switch upgrade [08:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:07] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on 32 hosts with reason: switch upgrade [08:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:12] 10SRE, 10netops: Upgrade Junos on asw2-esams - https://phabricator.wikimedia.org/T252631 (10ops-monitoring-bot) Icinga downtime set by ayounsi@cumin1001 for 1:30:00 32 host(s) and their services with reason: switch upgrade ` bast[3004-3005].wikimedia.org,cp[3050-3065].esams.wmnet,dns[3001-3002].wikimedia.org,g... [08:09:50] !log alright, brace yourself, esams switch stack is going to go down [08:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:36] (03CR) 10David Caro: [C: 03+2] style: this introduces black+isort as autoformatter [software/spicerack] - 10https://gerrit.wikimedia.org/r/659785 (https://phabricator.wikimedia.org/T211750) (owner: 10David Caro) [08:15:27] PROBLEM - Router interfaces on cr3-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 63, down: 4, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:15:42] that's fine ^ [08:15:47] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:15:55] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 69, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:16:05] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal, AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:16:10] that too ^ [08:16:43] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal, AS64600/IPv4: Connect - PyBal, AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:16:47] PROBLEM - VRRP status on cr3-esams is CRITICAL: VRRP CRITICAL - 3 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [08:17:41] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 69, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:21:19] RECOVERY - VRRP status on cr3-esams is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [08:22:17] RECOVERY - Router interfaces on cr3-knams is OK: OK: host 91.198.174.246, interfaces up: 79, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:22:19] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:22:20] 10SRE, 10CAS-SSO, 10Discovery-Search (Current work), 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10Zbyszko) @jbond About Kryo - from what I see in Kryo src: > /** Registers the class using the lowest, next available integer ID and the {@link Kryo#getD... [08:22:45] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:22:57] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:23:07] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 426, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:23:08] (03Merged) 10jenkins-bot: style: this introduces black+isort as autoformatter [software/spicerack] - 10https://gerrit.wikimedia.org/r/659785 (https://phabricator.wikimedia.org/T211750) (owner: 10David Caro) [08:23:45] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:28:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_arclamp site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:29:55] phabricator can't reach its database, guess that is related to the above [08:30:10] hashar: which above? [08:30:11] !log rollback redirect ns2 to authdns1001 - T252631 [08:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:51] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [08:30:53] no idea [08:31:02] hashar: can you give us more info? [08:31:04] phabricator can't reach its db m3-master.eqiad.wmnet failed with error #2002: Cannot assign requested address. [08:31:04] <_joe_> XioNoX: is that esams? [08:31:17] _joe_: yeah, all my work is esams [08:31:18] <_joe_> same error here [08:31:32] so maybe that is not related :] [08:32:17] Attempt to connect to phabricatorphd@m3-master.eqiad.wmnet failed with error #2002: Cannot assign requested address. [08:32:23] mmmm [08:32:40] marostegui: hola hola, are you around? [08:32:50] Hello, can we either discuss here or on -sre? [08:32:51] (just to be sure) [08:33:08] Right now we are discussing this on -databases, -sre and here [08:33:09] ahh sorry my bad [08:33:11] So let's try to focus [08:33:18] database and proxy look up [08:33:23] I can reach the master from the proxy [08:33:23] ahh sorry my bad [08:33:39] <_joe_> elukey: go look at the phab machine, I suspect the problem is there [08:33:52] I just reloaded the proxies and they keep seeing the DB as up [08:34:01] phab is now back for me [08:34:02] phabricator should be on phab1001.eqiad.wmnet and "Cannot assign requested address." looks strange [08:34:05] <_joe_> marostegui: maybe too many connections? [08:34:07] yep I am on it, probably something weird [08:34:41] <_joe_> sorry I really gtg in 5 minutes :/ [08:34:42] _joe_: huge spike on connections right now [08:34:54] yeah https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=phab1001&var-datasource=thanos&var-cluster=misc&from=now-1h&to=now :-\ [08:35:27] socket usage jumped from 10k to 30k [08:36:02] XioNoX: the Phabricator/db issue does not seem related to the network stuff. Sorry :] [08:36:57] (phab issue being followed up in private #mediawiki_security [08:38:52] (03PS1) 10Ayounsi: Revert "Depool esams for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/662681 [08:43:52] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool esams for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/662681 (owner: 10Ayounsi) [08:44:30] !log repool esams - T272342 [08:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:52] (03PS11) 10Kosta Harlan: [WIP] linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) [08:46:04] (03CR) 10jerkins-bot: [V: 04-1] [WIP] linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [08:46:06] (03CR) 10Kosta Harlan: [WIP] linkrecommendation: Cron job to load datasets (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [08:46:47] (03PS12) 10Kosta Harlan: [WIP] linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) [08:47:53] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T274106 (10Aklapper) Thank you. Please see https://phabricator.wikimedia.org/project/profile/1564/ for info that is usually requested. [08:52:07] 10SRE, 10netops: Upgrade Junos on asw2-esams - https://phabricator.wikimedia.org/T252631 (10ayounsi) 05Openβ†’03Resolved a:03ayounsi All done here. Note that we're hitting something similar to https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1363186 But it says "These log messages are ha... [08:55:10] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T274106 (10VeronicaThamaini) Thank you @Aklapper. Is this the additional information needed. Username: Veronica Thamaini Shell access: I have a shell name. Should I share the name here... [08:57:11] (03CR) 10GergΕ‘ Tisza: [WIP] linkrecommendation: Cron job to load datasets (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [08:59:53] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 36.89 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:00:08] (03CR) 10Volans: [C: 03+2] documentation: add a development page [software/spicerack] - 10https://gerrit.wikimedia.org/r/662783 (https://phabricator.wikimedia.org/T211750) (owner: 10Volans) [09:00:49] 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Shizhao) Rename to a new filename? [09:01:07] (03CR) 10Alexandros Kosiaris: [WIP] linkrecommendation: Cron job to load datasets (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [09:09:11] 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10MoritzMuehlenhoff) [09:09:31] (03PS1) 10Jbond: P:phabricator::httpd add tcp_tw_reuse and increase empherical ports [puppet] - 10https://gerrit.wikimedia.org/r/662901 [09:09:39] (03Merged) 10jenkins-bot: documentation: add a development page [software/spicerack] - 10https://gerrit.wikimedia.org/r/662783 (https://phabricator.wikimedia.org/T211750) (owner: 10Volans) [09:11:25] (03PS2) 10Jbond: P:phabricator::httpd add tcp_tw_reuse and increase empherical ports [puppet] - 10https://gerrit.wikimedia.org/r/662901 [09:11:36] (03CR) 10jerkins-bot: [V: 04-1] P:phabricator::httpd add tcp_tw_reuse and increase empherical ports [puppet] - 10https://gerrit.wikimedia.org/r/662901 (owner: 10Jbond) [09:12:19] 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10hashar) [09:12:52] (03CR) 10jerkins-bot: [V: 04-1] P:phabricator::httpd add tcp_tw_reuse and increase empherical ports [puppet] - 10https://gerrit.wikimedia.org/r/662901 (owner: 10Jbond) [09:13:07] (03PS3) 10Jbond: P:phabricator::httpd add tcp_tw_reuse and increase empherical ports [puppet] - 10https://gerrit.wikimedia.org/r/662901 [09:14:40] (03PS1) 10Elukey: phabricator: add network performance tuning settings [puppet] - 10https://gerrit.wikimedia.org/r/662903 [09:15:57] (03PS2) 10Jbond: phabricator: add network performance tuning settings [puppet] - 10https://gerrit.wikimedia.org/r/662903 (owner: 10Elukey) [09:16:15] (03CR) 10Jbond: [C: 03+1] "LGTM (minor fix to whitespace)" [puppet] - 10https://gerrit.wikimedia.org/r/662903 (owner: 10Elukey) [09:16:52] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27922/console" [puppet] - 10https://gerrit.wikimedia.org/r/662903 (owner: 10Elukey) [09:18:18] (03CR) 10Ayounsi: [C: 03+1] "LGTM, 1 nit, maybe also point to cacheproxy::performance for additional comments/explanations." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/662903 (owner: 10Elukey) [09:19:02] (03PS3) 10Elukey: phabricator: add network performance tuning settings [puppet] - 10https://gerrit.wikimedia.org/r/662903 [09:19:07] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. I'll open a task to create a common performance profile to that we can align all the various hacks sprinkled over our Puppet t" [puppet] - 10https://gerrit.wikimedia.org/r/662903 (owner: 10Elukey) [09:20:17] (03PS4) 10Elukey: phabricator: add network performance tuning settings [puppet] - 10https://gerrit.wikimedia.org/r/662903 [09:20:30] 10SRE, 10Traffic: Create a generic network proformance profile - https://phabricator.wikimedia.org/T274230 (10jbond) [09:20:38] (03CR) 10Elukey: "Fixed the comments to reflect the Phab specific use case :)" [puppet] - 10https://gerrit.wikimedia.org/r/662903 (owner: 10Elukey) [09:22:03] (03CR) 10Elukey: [C: 03+2] phabricator: add network performance tuning settings [puppet] - 10https://gerrit.wikimedia.org/r/662903 (owner: 10Elukey) [09:22:12] (03PS1) 10Volans: git: exclude black refactor from git blame [software/spicerack] - 10https://gerrit.wikimedia.org/r/662904 (https://phabricator.wikimedia.org/T211750) [09:22:34] !log swift eqiad-prod: decrease weight for SSDs on ms-be[1019-1026] - T272836 [09:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:39] T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836 [09:23:46] 10SRE, 10Traffic: Create a generic network proformance profile - https://phabricator.wikimedia.org/T274230 (10ayounsi) [09:23:57] 10SRE, 10Traffic: Create a generic network performance profile - https://phabricator.wikimedia.org/T274230 (10ayounsi) [09:24:02] (03CR) 10David Caro: [C: 03+2] git: exclude black refactor from git blame [software/spicerack] - 10https://gerrit.wikimedia.org/r/662904 (https://phabricator.wikimedia.org/T211750) (owner: 10Volans) [09:25:02] (03PS1) 10David Caro: tox: Fix runs when system setuptools is old [software/spicerack] - 10https://gerrit.wikimedia.org/r/662905 [09:26:16] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] hieradata: add defaults for profile::swift::storage [puppet] - 10https://gerrit.wikimedia.org/r/662703 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi) [09:26:41] (03CR) 10Kormat: "Big πŸ‘ for the direction. I had a quick look, and don't see any obvious issues with the approach." [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [09:27:44] (03PS2) 10David Caro: tox: Fix runs when system setuptools is old [software/spicerack] - 10https://gerrit.wikimedia.org/r/662905 [09:27:46] (03PS11) 10David Caro: remote: allow prepending every command with sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (https://phabricator.wikimedia.org/T267412) [09:30:03] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 72.74 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:30:04] (03PS1) 10Filippo Giunchedi: swift: limit rsync and swift-object-replicator memory to 5% in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/662907 (https://phabricator.wikimedia.org/T221904) [09:30:19] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [09:30:35] 10SRE, 10Beta-Cluster-Infrastructure, 10SRE-swift-storage, 10Patch-For-Review: Beta cluster Swift backend instances are missing profile::swift::storage::rsync_limit_memory_percent (puppet fails) - https://phabricator.wikimedia.org/T274092 (10fgiunchedi) 05Openβ†’03Resolved a:03fgiunchedi Issue should b... [09:31:15] 10SRE, 10Traffic: Create a generic network performance profile - https://phabricator.wikimedia.org/T274230 (10MoritzMuehlenhoff) Other classes in our Puppet tree which already apply some of the generic settings: * swift * profile::mediawiki::api * base::mysterious_sysctl * profile::mediawiki::common * profile... [09:31:27] 10SRE, 10Traffic, 10User-MoritzMuehlenhoff: Create a generic network performance profile - https://phabricator.wikimedia.org/T274230 (10MoritzMuehlenhoff) [09:31:41] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup100[12] - https://phabricator.wikimedia.org/T274206 (10jcrespo) a:05jcrespoβ†’03Jclark-ctr @robh it got resolved. It will go in the regular production internal vlan (same as ms-fe hosts). In the future there is a chance it will... [09:32:35] (03CR) 10jerkins-bot: [V: 04-1] git: exclude black refactor from git blame [software/spicerack] - 10https://gerrit.wikimedia.org/r/662904 (https://phabricator.wikimedia.org/T211750) (owner: 10Volans) [09:32:54] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup200[12] - https://phabricator.wikimedia.org/T274202 (10jcrespo) a:05jcrespoβ†’03Papaul Internal production vlan, same as ms-fe hosts. [09:33:25] (03CR) 10David Caro: "Just a question, not a blocker in any case." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165) (owner: 10Andrew Bogott) [09:33:43] (03CR) 10Jcrespo: "Thanks, that was exactly what I needed, a quick thumbs up." [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [09:34:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1090:3312 (re)pooling @ 10%: Slowly repooling db1090:3312 after cloning db1170', diff saved to https://phabricator.wikimedia.org/P14242 and previous config saved to /var/cache/conftool/dbconfig/20210209-093400-root.json [09:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:16] (03CR) 10Jcrespo: "Removing you to spare you from testing spam." [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [09:34:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1090:3317 (re)pooling @ 10%: Slowly repooling db1090:3317 after cloning db1170', diff saved to https://phabricator.wikimedia.org/P14243 and previous config saved to /var/cache/conftool/dbconfig/20210209-093429-root.json [09:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:43] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1170:3312 and db1170:3317 is now replicating [09:39:55] (03PS1) 10Muehlenhoff: Remove obsolete netfilter sysctl setting [puppet] - 10https://gerrit.wikimedia.org/r/662908 [09:40:09] (03PS2) 10Muehlenhoff: Remove obsolete netfilter sysctl setting [puppet] - 10https://gerrit.wikimedia.org/r/662908 [09:40:09] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:40:14] (03CR) 10jerkins-bot: [V: 04-1] Remove obsolete netfilter sysctl setting [puppet] - 10https://gerrit.wikimedia.org/r/662908 (owner: 10Muehlenhoff) [09:40:39] (03CR) 10jerkins-bot: [V: 04-1] Remove obsolete netfilter sysctl setting [puppet] - 10https://gerrit.wikimedia.org/r/662908 (owner: 10Muehlenhoff) [09:42:48] (03PS3) 10Muehlenhoff: Remove obsolete netfilter sysctl setting [puppet] - 10https://gerrit.wikimedia.org/r/662908 [09:43:20] (03CR) 10Volans: "question inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/662905 (owner: 10David Caro) [09:43:56] 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10jcrespo) I will open a new task for this issue and add you there. While this is not a blocker for backup generation, it would be for an emergency, and we should... [09:49:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1090:3312 (re)pooling @ 25%: Slowly repooling db1090:3312 after cloning db1170', diff saved to https://phabricator.wikimedia.org/P14244 and previous config saved to /var/cache/conftool/dbconfig/20210209-094904-root.json [09:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1090:3317 (re)pooling @ 25%: Slowly repooling db1090:3317 after cloning db1170', diff saved to https://phabricator.wikimedia.org/P14245 and previous config saved to /var/cache/conftool/dbconfig/20210209-094932-root.json [09:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:04] (03CR) 10David Caro: tox: Fix runs when system setuptools is old (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/662905 (owner: 10David Caro) [09:51:30] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/662905 (owner: 10David Caro) [09:59:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_arclamp site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:59:52] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper) [10:00:10] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/662659 (owner: 10Hnowlan) [10:04:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1090:3312 (re)pooling @ 50%: Slowly repooling db1090:3312 after cloning db1170', diff saved to https://phabricator.wikimedia.org/P14246 and previous config saved to /var/cache/conftool/dbconfig/20210209-100407-root.json [10:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1090:3317 (re)pooling @ 50%: Slowly repooling db1090:3317 after cloning db1170', diff saved to https://phabricator.wikimedia.org/P14247 and previous config saved to /var/cache/conftool/dbconfig/20210209-100436-root.json [10:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:41] 10SRE, 10serviceops, 10Epic: Track and remove jessie based container images from production - https://phabricator.wikimedia.org/T249724 (10hashar) [10:10:32] (03CR) 10David Caro: [C: 03+2] tox: Fix runs when system setuptools is old [software/spicerack] - 10https://gerrit.wikimedia.org/r/662905 (owner: 10David Caro) [10:12:24] !log gehel@cumin1001 START - Cookbook sre.wdqs.reboot [10:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:36] 10SRE, 10ops-eqiad, 10observability: eqiad: Move logstash1020 to rack A8 - https://phabricator.wikimedia.org/T273984 (10fgiunchedi) 05Resolvedβ†’03Open `logstash1020.mgmt` is shown as down in icinga, reopening ` logstash1020.mgmt View Service Details For This Host DOWN 2021-02-09 10:08:30 0d 18h 33m 24s... [10:13:03] ryankemper: ^^^ restarting of the wdqs public and internal clusters in progress [10:13:04] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1019.eqiad.wmnet [10:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:14] (03PS1) 10Marostegui: instances: Add db1157 [puppet] - 10https://gerrit.wikimedia.org/r/662915 (https://phabricator.wikimedia.org/T258361) [10:14:03] (03CR) 10Marostegui: [C: 03+2] instances: Add db1157 [puppet] - 10https://gerrit.wikimedia.org/r/662915 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [10:14:18] (03PS3) 10Jcrespo: mariadb-backups: Remove old scheduled job disabling [puppet] - 10https://gerrit.wikimedia.org/r/644861 [10:14:20] (03PS1) 10Elukey: sre.hadoop.stop-cluster: avoid context managers to apply downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/662916 [10:14:58] 10SRE, 10serviceops, 10Epic: Track and remove jessie based container images from production - https://phabricator.wikimedia.org/T249724 (10hashar) CI still had some usage of `docker-registry.wikimedia.org/wikimedia-jessie` which got removed in July 2020. I have missed the deletion of the image until docker-p... [10:15:15] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/662916 (owner: 10Elukey) [10:15:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1157 to dbctl, depooled T258361', diff saved to https://phabricator.wikimedia.org/P14248 and previous config saved to /var/cache/conftool/dbconfig/20210209-101556-marostegui.json [10:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:02] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [10:16:51] (03PS1) 10Marostegui: db1157: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/662917 (https://phabricator.wikimedia.org/T258361) [10:16:53] (03CR) 10Elukey: [C: 03+2] sre.hadoop.stop-cluster: avoid context managers to apply downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/662916 (owner: 10Elukey) [10:17:44] (03CR) 10Marostegui: [C: 03+2] db1157: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/662917 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [10:18:05] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove obsolete netfilter sysctl setting [puppet] - 10https://gerrit.wikimedia.org/r/662908 (owner: 10Muehlenhoff) [10:19:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1090:3312 (re)pooling @ 75%: Slowly repooling db1090:3312 after cloning db1170', diff saved to https://phabricator.wikimedia.org/P14249 and previous config saved to /var/cache/conftool/dbconfig/20210209-101911-root.json [10:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:22] (03Merged) 10jenkins-bot: tox: Fix runs when system setuptools is old [software/spicerack] - 10https://gerrit.wikimedia.org/r/662905 (owner: 10David Caro) [10:19:37] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1019.eqiad.wmnet [10:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1090:3317 (re)pooling @ 75%: Slowly repooling db1090:3317 after cloning db1170', diff saved to https://phabricator.wikimedia.org/P14250 and previous config saved to /var/cache/conftool/dbconfig/20210209-101939-root.json [10:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:13] 10SRE, 10Data-Persistence-Backup: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) [10:21:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1157 for the first time in s3 T258361', diff saved to https://phabricator.wikimedia.org/P14251 and previous config saved to /var/cache/conftool/dbconfig/20210209-102109-marostegui.json [10:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:14] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [10:23:00] 10SRE, 10Beta-Cluster-Infrastructure, 10SRE-swift-storage: Beta cluster Swift backend instances are missing profile::swift::storage::rsync_limit_memory_percent (puppet fails) - https://phabricator.wikimedia.org/T274092 (10hashar) Puppet is all happy on both instances indeed. Thank you! [10:23:35] PROBLEM - Check systemd state on wdqs2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:24:27] 10SRE, 10serviceops, 10Epic: Track and remove jessie based container images from production - https://phabricator.wikimedia.org/T249724 (10akosiaris) 05Openβ†’03Resolved a:03akosiaris And with that, I think indeed we can close this task. Production has dropped jessie support for some time now and doesn't... [10:24:29] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10akosiaris) [10:29:32] (03CR) 10Marostegui: "This was actually: Enable notifications :)" [puppet] - 10https://gerrit.wikimedia.org/r/662917 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [10:31:00] (03PS1) 10Muehlenhoff: Swift: Stop setting net.ipv4.tcp_tw_recycle for buster and later [puppet] - 10https://gerrit.wikimedia.org/r/662918 [10:31:19] PROBLEM - Check systemd state on wdqs2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:34:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1090:3312 (re)pooling @ 100%: Slowly repooling db1090:3312 after cloning db1170', diff saved to https://phabricator.wikimedia.org/P14252 and previous config saved to /var/cache/conftool/dbconfig/20210209-103414-root.json [10:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1090:3317 (re)pooling @ 100%: Slowly repooling db1090:3317 after cloning db1170', diff saved to https://phabricator.wikimedia.org/P14253 and previous config saved to /var/cache/conftool/dbconfig/20210209-103443-root.json [10:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:01] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:35:24] (03PS2) 10David Caro: ceph.osd: Allow setting the io scheduler of the osd disks [puppet] - 10https://gerrit.wikimedia.org/r/662689 (https://phabricator.wikimedia.org/T273791) [10:36:44] (03CR) 10JMeybohm: [C: 04-1] Add support for php deployments (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto) [10:39:08] (03CR) 10David Caro: ceph.osd: Allow setting the io scheduler of the osd disks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/662689 (https://phabricator.wikimedia.org/T273791) (owner: 10David Caro) [10:40:28] (03PS3) 10David Caro: ceph.osd: Allow setting the io scheduler of the osd disks [puppet] - 10https://gerrit.wikimedia.org/r/662689 (https://phabricator.wikimedia.org/T273791) [10:40:30] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host cumin2001.codfw.wmnet [10:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:58] (03PS4) 10David Caro: ceph.osd: Allow setting the io scheduler of the osd disks [puppet] - 10https://gerrit.wikimedia.org/r/662689 (https://phabricator.wikimedia.org/T273791) [10:41:37] !log rolling restart of esams LVS instances to catch up on kernel upgrades [10:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:07] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs3007.esams.wmnet [10:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:33] RECOVERY - Check systemd state on wdqs2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:08] 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10Vgutierrez) p:05Triageβ†’03Medium [10:48:11] (03PS1) 10DCausse: Add extra-analysis-khmer [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/662923 (https://phabricator.wikimedia.org/T274203) [10:48:50] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3007.esams.wmnet [10:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:39] RECOVERY - Check systemd state on wdqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:50:00] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs3006.esams.wmnet [10:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 2%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14254 and previous config saved to /var/cache/conftool/dbconfig/20210209-105109-root.json [10:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:44] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [10:53:08] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/662904 (https://phabricator.wikimedia.org/T211750) (owner: 10Volans) [10:53:17] (03PS2) 10Volans: git: exclude black refactor from git blame [software/spicerack] - 10https://gerrit.wikimedia.org/r/662904 (https://phabricator.wikimedia.org/T211750) [10:53:21] (03PS1) 10Jbond: sysctl: Allow Boolean values [puppet] - 10https://gerrit.wikimedia.org/r/662924 (https://phabricator.wikimedia.org/T273175) [10:53:24] (03CR) 10Volans: [C: 03+2] git: exclude black refactor from git blame [software/spicerack] - 10https://gerrit.wikimedia.org/r/662904 (https://phabricator.wikimedia.org/T211750) (owner: 10Volans) [10:53:49] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [10:53:55] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin2001.codfw.wmnet [10:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:35] PROBLEM - Keyholder SSH agent on cumin2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [10:54:58] (03CR) 10Filippo Giunchedi: [C: 03+1] Swift: Stop setting net.ipv4.tcp_tw_recycle for buster and later [puppet] - 10https://gerrit.wikimedia.org/r/662918 (owner: 10Muehlenhoff) [10:55:00] (03CR) 10jerkins-bot: [V: 04-1] sysctl: Allow Boolean values [puppet] - 10https://gerrit.wikimedia.org/r/662924 (https://phabricator.wikimedia.org/T273175) (owner: 10Jbond) [10:55:14] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3006.esams.wmnet [10:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:12] (03PS2) 10Jbond: sysctl: Allow Boolean values [puppet] - 10https://gerrit.wikimedia.org/r/662924 (https://phabricator.wikimedia.org/T273175) [10:57:37] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on maps1005.eqiad.wmnet with reason: Resyncing database, still [10:57:37] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on maps1005.eqiad.wmnet with reason: Resyncing database, still [10:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:48] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro) [11:02:25] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs3005.esams.wmnet [11:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:37] 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10Majavah) Phabricator doesn't also seem to be caching user profile pictures at all, which are also stored in the `phabricator_file` database. [11:02:41] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:03:08] XioNoX: ^^ [11:04:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ceph.osd: Allow setting the io scheduler of the osd disks [puppet] - 10https://gerrit.wikimedia.org/r/662689 (https://phabricator.wikimedia.org/T273791) (owner: 10David Caro) [11:05:17] PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:05:40] (03CR) 10David Caro: [C: 03+2] remote: allow prepending every command with sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro) [11:05:57] volans: that's triggered by me [11:06:01] ack [11:06:06] and expected (lvs3005 being restarted) [11:06:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 3%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14255 and previous config saved to /var/cache/conftool/dbconfig/20210209-110613-root.json [11:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:04] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3005.esams.wmnet [11:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:13] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:07:15] :) [11:11:55] (03CR) 10David Caro: [C: 03+2] ceph.osd: Allow setting the io scheduler of the osd disks [puppet] - 10https://gerrit.wikimedia.org/r/662689 (https://phabricator.wikimedia.org/T273791) (owner: 10David Caro) [11:17:39] !log rolling restart of eqiad LVS instances to catch up on kernel upgrades [11:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:12] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1016.eqiad.wmnet [11:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 4%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14256 and previous config saved to /var/cache/conftool/dbconfig/20210209-112116-root.json [11:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:42] (03PS12) 10David Caro: remote: allow prepending every command with sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (https://phabricator.wikimedia.org/T267412) [11:23:40] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1016.eqiad.wmnet [11:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Remove obsolete netfilter sysctl setting [puppet] - 10https://gerrit.wikimedia.org/r/662908 (owner: 10Muehlenhoff) [11:27:56] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1015.eqiad.wmnet [11:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:00] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:28:28] ^^ that's me again, and it's expected [11:28:50] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.012 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [11:30:56] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 63, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:32:37] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1015.eqiad.wmnet [11:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:37] !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster for Hadoop analytics cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [11:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:37] (03CR) 10Urbanecm: [C: 04-1] "see https://phabricator.wikimedia.org/T274137#6812206" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662720 (https://phabricator.wikimedia.org/T274137) (owner: 10Base) [11:34:38] RECOVERY - Check systemd state on wdqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:46] !log start the upgrade process for Hadoop Analytics [11:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:10] (03PS1) 10Jbond: profile::performance: add a new profile for tweaking sysctl parameters [puppet] - 10https://gerrit.wikimedia.org/r/662932 (https://phabricator.wikimedia.org/T273175) [11:36:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 5%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14257 and previous config saved to /var/cache/conftool/dbconfig/20210209-113620-root.json [11:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:38] 10SRE, 10Traffic, 10User-MoritzMuehlenhoff: Create a generic network performance profile - https://phabricator.wikimedia.org/T274230 (10jbond) I have created a starting point [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/662932 | profile ]] using mostly based on `cacheproxy::performance ` below ill... [11:37:52] (03CR) 10jerkins-bot: [V: 04-1] profile::performance: add a new profile for tweaking sysctl parameters [puppet] - 10https://gerrit.wikimedia.org/r/662932 (https://phabricator.wikimedia.org/T273175) (owner: 10Jbond) [11:39:04] (03CR) 10Jbond: [C: 03+2] sysctl: Allow Boolean values [puppet] - 10https://gerrit.wikimedia.org/r/662924 (https://phabricator.wikimedia.org/T273175) (owner: 10Jbond) [11:40:26] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1014.eqiad.wmnet [11:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:35] (03PS3) 10David Caro: toolforge.etcdctl: add new etcdctl module [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412) [11:41:48] (03PS2) 10Jbond: profile::performance: add a new profile for tweaking sysctl parameters [puppet] - 10https://gerrit.wikimedia.org/r/662932 (https://phabricator.wikimedia.org/T273175) [11:42:20] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:43:22] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:24] (03CR) 10jerkins-bot: [V: 04-1] profile::performance: add a new profile for tweaking sysctl parameters [puppet] - 10https://gerrit.wikimedia.org/r/662932 (https://phabricator.wikimedia.org/T273175) (owner: 10Jbond) [11:44:24] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 63, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:46:07] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1014.eqiad.wmnet [11:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:50] RECOVERY - Check systemd state on wdqs2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:20] (03PS1) 10Jbond: sysctl: reject undef values [puppet] - 10https://gerrit.wikimedia.org/r/662933 (https://phabricator.wikimedia.org/T274230) [11:50:02] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1001.eqiad.wmnet [11:50:04] (03PS3) 10Jbond: profile::performance: add a new profile for tweaking sysctl parameters [puppet] - 10https://gerrit.wikimedia.org/r/662932 (https://phabricator.wikimedia.org/T274230) [11:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:52] (03CR) 10Jbond: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27926" [puppet] - 10https://gerrit.wikimedia.org/r/662933 (https://phabricator.wikimedia.org/T274230) (owner: 10Jbond) [11:50:55] (03CR) 10David Caro: [C: 04-1] toolforge.etcdctl: add new etcdctl module (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro) [11:51:04] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs1013.eqiad.wmnet [11:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 8%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14258 and previous config saved to /var/cache/conftool/dbconfig/20210209-115124-root.json [11:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:27] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: name=maps1006.eqiad.wmnet [11:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:40] (03CR) 10jerkins-bot: [V: 04-1] profile::performance: add a new profile for tweaking sysctl parameters [puppet] - 10https://gerrit.wikimedia.org/r/662932 (https://phabricator.wikimedia.org/T274230) (owner: 10Jbond) [11:51:50] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: name=maps1007.eqiad.wmnet [11:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:13] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: name=maps1008.eqiad.wmnet [11:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:53] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: name=maps1010.eqiad.wmnet [11:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:46] (03CR) 10jerkins-bot: [V: 04-1] toolforge.etcdctl: add new etcdctl module [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro) [11:53:52] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:54:54] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 63, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:55:21] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1013.eqiad.wmnet [11:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:22] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/662765 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [11:56:49] (03Abandoned) 10Jbond: P:phabricator::httpd add tcp_tw_reuse and increase empherical ports [puppet] - 10https://gerrit.wikimedia.org/r/662901 (owner: 10Jbond) [11:57:06] 10SRE, 10ops-eqiad: eqiad: Move maps1001 same rack A4 - https://phabricator.wikimedia.org/T273983 (10hnowlan) Thanks for the heads up @CDanis - I've repooled. It appears there were some issues with the weights of other maps hosts that should have prevented this having an impact, I've rectified that now too. [11:57:54] !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: name=maps2005.codfw.wmnet [11:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:21] !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: name=maps2006.codfw.wmnet [11:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:27] !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: name=maps2008.codfw.wmnet [11:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:31] !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: name=maps2009.codfw.wmnet [11:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:37] !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: name=maps2010.codfw.wmnet [11:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: May I have your attention please! European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210209T1200) [12:00:04] No GERRIT patches in the queue for this window AFAICS. [12:00:28] yup, nothing to do [12:00:35] <_joe_> the best kind of deploy [12:00:38] hehe [12:00:44] <_joe_> the one that doesn't happen [12:00:51] I might deploy sth anyway [12:00:56] Daimona: hi, around for some more security patches in AF? [12:01:23] <_joe_> dang [12:01:26] <_joe_> :) [12:01:29] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27927/console" [puppet] - 10https://gerrit.wikimedia.org/r/662716 (https://phabricator.wikimedia.org/T272998) (owner: 10Ottomata) [12:02:56] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) for Hadoop analytics cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [12:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:12] sorry for disturbing the best kind of a deploy _joe_ :) [12:03:21] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/27927/mw1331.eqiad.wmnet/fulldiff.html shows this applies cleanly on the appservers." [puppet] - 10https://gerrit.wikimedia.org/r/662716 (https://phabricator.wikimedia.org/T272998) (owner: 10Ottomata) [12:05:42] !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001 [12:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 10%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14259 and previous config saved to /var/cache/conftool/dbconfig/20210209-120627-root.json [12:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:46] (03CR) 10Muehlenhoff: "Looks good, one comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/662765 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [12:09:30] Urbanecm: hey, in 30 minutes probably [12:09:44] Daimona: perfect! PM me once around :) [12:11:40] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think there is a minor flaw in the implementation (basically, we won't check for new hiera() calls at node scope). Otherwise LGTM, quite" (031 comment) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [12:15:47] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM. Do you want to schedule a deployment time?" [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) (owner: 10Jforrester) [12:19:25] (03CR) 10Giuseppe Lavagetto: [C: 04-1] tegola: Add docker image. (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/654662 (https://phabricator.wikimedia.org/T270170) (owner: 10Hnowlan) [12:21:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 13%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14260 and previous config saved to /var/cache/conftool/dbconfig/20210209-122131-root.json [12:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:01] PROBLEM - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:19] 10SRE, 10SRE-tools, 10tox-wikimedia, 10Patch-For-Review, 10User-Kormat: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10jbond) FYI i had to [[ https://github.com/psf/black/pull/1545 | apply a fix ]] to get the following black vim plugin setting to work `let g:black_s... [12:36:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 15%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14261 and previous config saved to /var/cache/conftool/dbconfig/20210209-123634-root.json [12:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:38] (03PS18) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 [12:37:52] (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [12:38:21] RECOVERY - Check systemd state on wdqs2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:40:58] (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [12:43:39] (03PS3) 10Jbond: (WIP) debdeploy: Add debdeploy functionality [software/spicerack] - 10https://gerrit.wikimedia.org/r/658626 [12:47:22] (03PS19) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 [12:51:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 20%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14262 and previous config saved to /var/cache/conftool/dbconfig/20210209-125138-root.json [12:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:15] (03CR) 10jerkins-bot: [V: 04-1] (WIP) debdeploy: Add debdeploy functionality [software/spicerack] - 10https://gerrit.wikimedia.org/r/658626 (owner: 10Jbond) [13:00:41] PROBLEM - Check systemd state on wdqs2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:57] (03PS4) 10Jbond: Add check to error when calling to hiera() [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [13:02:10] (03CR) 10Jbond: Add check to error when calling to hiera() (031 comment) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [13:02:12] (03CR) 10jerkins-bot: [V: 04-1] Add check to error when calling to hiera() [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [13:03:55] PROBLEM - Check systemd state on an-airflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:04:00] (03PS2) 10Base: Changing frwiktionary's wmgBabelMainCategory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662720 (https://phabricator.wikimedia.org/T274137) [13:04:16] (03PS1) 10Arturo Borrero Gonzalez: toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) [13:04:33] (03PS3) 10Base: Changing frwiktionary's wmgBabelMainCategory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662720 (https://phabricator.wikimedia.org/T274137) [13:06:24] 10Puppet, 10SRE, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) [13:06:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 25%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14263 and previous config saved to /var/cache/conftool/dbconfig/20210209-130641-root.json [13:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:20] (03PS1) 10Arturo Borrero Gonzalez: toolforge: front proxy: drop non-TLS support [puppet] - 10https://gerrit.wikimedia.org/r/662942 (https://phabricator.wikimedia.org/T274123) [13:08:55] !log restart phabricator daemons to free 3.5gb of ram (memory leak?) [13:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:04] (03CR) 10jerkins-bot: [V: 04-1] toolforge: front proxy: drop non-TLS support [puppet] - 10https://gerrit.wikimedia.org/r/662942 (https://phabricator.wikimedia.org/T274123) (owner: 10Arturo Borrero Gonzalez) [13:10:51] (03PS1) 10Arturo Borrero Gonzalez: dynamicproxy: nginx.conf: drop duplicated nginx directives [puppet] - 10https://gerrit.wikimedia.org/r/662944 [13:11:09] (03PS1) 10Muehlenhoff: Initial client profile for unprivileged Cumin [puppet] - 10https://gerrit.wikimedia.org/r/662945 [13:13:35] (03PS1) 10Kormat: tox: Add py3 env that uses default system python3 [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/662966 [13:13:38] (03PS2) 10Arturo Borrero Gonzalez: toolforge: front proxy: drop non-TLS support [puppet] - 10https://gerrit.wikimedia.org/r/662942 (https://phabricator.wikimedia.org/T274123) [13:14:19] (03PS1) 10Lucas Werkmeister (WMDE): wikidata: post edit constraint jobs on 50% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662967 (https://phabricator.wikimedia.org/T204031) [13:15:30] 10Puppet, 10SRE, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) >>! In T265138#6813427, @Ladsgroup wrote: > @jbond: Hey, you wrote this as a checkbox >> [] migrate all cron types to systemd::timer::job... [13:16:13] (03CR) 10jerkins-bot: [V: 04-1] tox: Add py3 env that uses default system python3 [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/662966 (owner: 10Kormat) [13:16:46] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Remove old scheduled job disabling [puppet] - 10https://gerrit.wikimedia.org/r/644861 (owner: 10Jcrespo) [13:17:26] (03PS2) 10Arturo Borrero Gonzalez: toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) [13:20:07] (03CR) 10Matthias Mullie: Add external entity search URI for new MediaSearch extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662792 (https://phabricator.wikimedia.org/T265939) (owner: 10Anne Tomasevich) [13:21:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 30%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14264 and previous config saved to /var/cache/conftool/dbconfig/20210209-132145-root.json [13:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:01] 10Puppet, 10SRE, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) [13:22:38] 10Puppet, 10SRE, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) FYI i have tried to clarify things in the task description [13:23:03] RECOVERY - Check systemd state on wdqs2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:23:44] (03CR) 10Jbond: [C: 03+2] sysctl: reject undef values [puppet] - 10https://gerrit.wikimedia.org/r/662933 (https://phabricator.wikimedia.org/T274230) (owner: 10Jbond) [13:24:07] (03PS3) 10Arturo Borrero Gonzalez: toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) [13:25:13] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001 [13:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:33] PROBLEM - Check systemd state on an-master1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:27:16] (03CR) 10CDanis: [C: 03+1] swift: limit rsync and swift-object-replicator memory to 5% in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/662907 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi) [13:27:29] (03PS4) 10Arturo Borrero Gonzalez: toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) [13:30:55] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [13:31:36] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: limit rsync and swift-object-replicator memory to 5% in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/662907 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi) [13:31:42] (03PS2) 10Filippo Giunchedi: swift: limit rsync and swift-object-replicator memory to 5% in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/662907 (https://phabricator.wikimedia.org/T221904) [13:33:11] PROBLEM - Hadoop DataNode on analytics1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:33:39] PROBLEM - Hadoop DataNode on analytics1065 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:34:58] added some downtime [13:36:15] RECOVERY - Hadoop DataNode on analytics1065 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:36:29] (03PS1) 10ArielGlenn: WANObjectCache: throw on Closure [core] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662953 (https://phabricator.wikimedia.org/T273242) [13:36:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 40%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14265 and previous config saved to /var/cache/conftool/dbconfig/20210209-133648-root.json [13:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:43] (03CR) 10ArielGlenn: [C: 03+2] WANObjectCache: throw on Closure [core] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662953 (https://phabricator.wikimedia.org/T273242) (owner: 10ArielGlenn) [13:38:27] RECOVERY - Hadoop DataNode on analytics1069 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:38:41] 10SRE, 10CAS-SSO, 10Discovery-Search (Current work), 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10jbond) @Zbyszko Thanks for looking into this its really appreciated > we have similiar issues with it in WDQS streaming updater is `it` refering to Kyro he... [13:40:53] PROBLEM - Check systemd state on wdqs2007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:07] * Urbanecm secdeploying [13:47:51] (03PS5) 10Jbond: CAS style changes [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/661734 [13:49:03] (03CR) 10Jbond: "> Patch Set 4:" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/661734 (owner: 10Jbond) [13:49:31] (03PS1) 10Lucas Werkmeister (WMDE): wikidata: add Dagbani to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662970 (https://phabricator.wikimedia.org/T272242) [13:51:21] RECOVERY - Check systemd state on wdqs2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:33] RECOVERY - Check systemd state on an-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 50%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14266 and previous config saved to /var/cache/conftool/dbconfig/20210209-135152-root.json [13:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:54] !log Deploy security patch (T274152) [13:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:05] hashar and twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - American+European Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210209T1400). [14:02:31] (03Merged) 10jenkins-bot: WANObjectCache: throw on Closure [core] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662953 (https://phabricator.wikimedia.org/T273242) (owner: 10ArielGlenn) [14:03:51] PROBLEM - Hive Metastore on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [14:05:44] downtimed --^ [14:06:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 60%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14267 and previous config saved to /var/cache/conftool/dbconfig/20210209-140655-root.json [14:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:25] some 1.36.0-wmf.29 backport got merged [14:07:33] I am refreshing the deployment server [14:07:37] and will promote group 0 then group 1 [14:07:41] 10SRE, 10CAS-SSO, 10Discovery-Search (Current work), 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10Zbyszko) >>! In T273867#6814932, @jbond wrote: > @Zbyszko Thanks for looking into this its really appreciated > >> we have similiar issues with it in WDQS... [14:08:38] (03PS5) 10Arturo Borrero Gonzalez: toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) [14:09:32] 10SRE, 10CAS-SSO, 10Discovery-Search (Current work), 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10jbond) >>! In T273867#6815032, @Zbyszko wrote: > I'd try to reorganize register() methods on kryo - so that new ones are after the old ones. I'm not deep en... [14:09:34] apergos: I am syncing the patch got merged [14:09:42] ok great [14:09:46] ohh [14:09:54] forgot about Urbanecm still running patches damn [14:10:13] (03PS1) 10Muehlenhoff: Add dummy keytab for sretest1001 [labs/private] - 10https://gerrit.wikimedia.org/r/662972 [14:10:20] (03CR) 10jerkins-bot: [V: 04-1] toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) (owner: 10Arturo Borrero Gonzalez) [14:10:26] hashar: oh, sorry, I'm done with sec patches now [14:10:28] !log hashar@deploy1001 Synchronized php-1.36.0-wmf.29/includes/libs/objectcache/wancache/WANObjectCache.php: WANObjectCache: throw on Closure - T273242 (duration: 01m 08s) [14:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:34] T273242: MemcachedPeclBagOStuff: Serialization of 'Closure' is not allowed - https://phabricator.wikimedia.org/T273242 [14:10:38] (since my !log statement) [14:10:56] Urbanecm: should have hold and verified the backport window had completed sorry bout that [14:11:15] (03PS1) 10Volans: icinga: reduce contacts limit to 1 [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/662973 (https://phabricator.wikimedia.org/T273951) [14:11:26] (03CR) 10David Caro: "Did a quick review, looks nice, got some questions though. You can safely ignore the nits :)" (039 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [14:11:37] (03CR) 10Ottomata: [C: 03+2] "Thank you! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/662716 (https://phabricator.wikimedia.org/T272998) (owner: 10Ottomata) [14:11:46] (03CR) 10jerkins-bot: [V: 04-1] icinga: reduce contacts limit to 1 [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/662973 (https://phabricator.wikimedia.org/T273951) (owner: 10Volans) [14:13:07] my attention is unfortunately split between here and a meeting I am facilitating, but I will absolutely respond to pings here [14:13:51] (03CR) 10CDanis: [C: 03+1] icinga: reduce contacts limit to 1 [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/662973 (https://phabricator.wikimedia.org/T273951) (owner: 10Volans) [14:14:11] (03CR) 10Ottomata: "Applied and tested on mw1331, works fine." [puppet] - 10https://gerrit.wikimedia.org/r/662716 (https://phabricator.wikimedia.org/T272998) (owner: 10Ottomata) [14:14:36] !log depooling wdqs1005, catching up on lag [14:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:42] ryankemper: ^^ [14:15:24] (03PS2) 10Volans: icinga: reduce contacts limit to 1 [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/662973 (https://phabricator.wikimedia.org/T273951) [14:15:26] (03PS1) 10Volans: Fix tox invocation [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/662974 [14:15:55] apergos: dont worry I will chceck logstash and rollback as needed :] [14:16:12] apergos: at least I know you are watching more or less which by itself is comforting! [14:16:13] PROBLEM - Check systemd state on wdqs2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:14] (03CR) 10Volans: [C: 03+2] Fix tox invocation [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/662974 (owner: 10Volans) [14:16:17] thank you (but I'm happy to know about it in real time too) [14:17:16] (03Merged) 10jenkins-bot: Fix tox invocation [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/662974 (owner: 10Volans) [14:17:18] (03PS1) 10Hashar: group0 wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662975 [14:17:20] (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662975 (owner: 10Hashar) [14:17:25] (03PS6) 10Arturo Borrero Gonzalez: toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) [14:17:35] I'm also around if I can be helpful in any way [14:17:42] \o/ [14:17:56] Majavah: I am happy to see your patch got blessed :] [14:18:13] it is not like I understand anything about the issue beside something somehow triggering an attempt to serialize a closure [14:18:15] :-\ [14:18:38] we're still not sure if that was even the issue [14:19:01] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662975 (owner: 10Hashar) [14:19:05] (03CR) 10jerkins-bot: [V: 04-1] toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) (owner: 10Arturo Borrero Gonzalez) [14:19:33] d.uesen was yesterday thinking about not merging it yet to see if the new logging worked [14:20:04] (03PS1) 10Andrew Bogott: Horizon: put into maintenance mode for Train upgrade [puppet] - 10https://gerrit.wikimedia.org/r/662978 (https://phabricator.wikimedia.org/T261135) [14:20:19] no we're not [14:20:27] so we're throwing various things at it hoping that e [14:20:38] it's either fixed or we get better logging enabling us to find the issue [14:20:39] apaches are syncing [14:20:42] ok [14:21:19] 10SRE, 10CAS-SSO, 10Discovery-Search (Current work), 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10Zbyszko) >>! In T273867#6815035, @jbond wrote: >>>! In T273867#6815032, @Zbyszko wrote: >> I'd try to reorganize register() methods on kryo - so that new on... [14:21:48] fpm restarting [14:21:49] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.29 [14:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 75%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14268 and previous config saved to /var/cache/conftool/dbconfig/20210209-142159-root.json [14:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:11] (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: reduce contacts limit to 1 [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/662973 (https://phabricator.wikimedia.org/T273951) (owner: 10Volans) [14:22:21] (03CR) 10Volans: [C: 03+2] icinga: reduce contacts limit to 1 [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/662973 (https://phabricator.wikimedia.org/T273951) (owner: 10Volans) [14:22:39] (03PS7) 10Arturo Borrero Gonzalez: toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) [14:22:52] (03Merged) 10jenkins-bot: icinga: reduce contacts limit to 1 [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/662973 (https://phabricator.wikimedia.org/T273951) (owner: 10Volans) [14:23:31] so that's just group0 [14:23:31] (03PS1) 10Andrew Bogott: cloud-vps: move eqiad1 from openstack 'stein' to 'train' [puppet] - 10https://gerrit.wikimedia.org/r/662980 (https://phabricator.wikimedia.org/T261135) [14:23:58] is there a plan to go also to group 1 later in the hour, as the email mentioned to group1 also? [14:24:15] RECOVERY - Check systemd state on wdqs2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:18] (03CR) 10jerkins-bot: [V: 04-1] toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) (owner: 10Arturo Borrero Gonzalez) [14:24:35] apergos: yeah i will do it right now [14:24:45] cause the group0 logs seems fine [14:24:55] ok, I expcted them to be quiet though [14:25:23] at least the logs are quiet [14:25:26] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add dummy keytab for sretest1001 [labs/private] - 10https://gerrit.wikimedia.org/r/662972 (owner: 10Muehlenhoff) [14:25:27] beside a bunch of known issues [14:25:42] doing group1 [14:26:04] (03PS1) 10Hashar: group1 wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662981 [14:26:06] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662981 (owner: 10Hashar) [14:26:28] !log cd /srv/external-monitoring; git fetch/status/pull on wikitech-static - T273951 [14:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:32] T273951: Update Icinga meta-monitoring to account for "no pagers" in contacts - https://phabricator.wikimedia.org/T273951 [14:26:58] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662981 (owner: 10Hashar) [14:27:21] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:47] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.29 [14:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:29] PROBLEM - Hadoop DataNode on an-worker1102 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [14:29:31] PROBLEM - Hadoop DataNode on an-worker1114 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [14:29:54] !log hashar@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.29 (duration: 01m 06s) [14:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:01] e/F/i/FeaturedFeedChannel:45 Call to a member function getCode() on string [14:31:26] uhhhh [14:32:48] Majavah: the full trace https://phabricator.wikimedia.org/T264391#6815153 [14:32:57] that is definitely a coding error in my featuredfeeds patch [14:33:01] 10SRE, 10ops-codfw: codfw: relocate logstash2035 - https://phabricator.wikimedia.org/T274214 (10herron) @Papaul sure, sounds good. This host is not yet in production so there will be no prep/depool needed before the re-rack. [14:33:47] (03PS8) 10Arturo Borrero Gonzalez: toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) [14:33:55] then the feeds seem to work [14:33:56] please rollback [14:34:00] (03PS1) 10Gehel: wait_reboot_since() is now using a constant backoff. [software/spicerack] - 10https://gerrit.wikimedia.org/r/662983 [14:34:05] multilingual feeds do not, see https://commons.wikimedia.org/w/api.php?action=featuredfeed&feed=potd&feedformat=rss&language=en for example [14:35:27] (03CR) 10jerkins-bot: [V: 04-1] toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) (owner: 10Arturo Borrero Gonzalez) [14:37:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 85%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14269 and previous config saved to /var/cache/conftool/dbconfig/20210209-143703-root.json [14:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:07] (03PS1) 10Hashar: Revert "group1 wikis to 1.36.0-wmf.29" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662984 (https://phabricator.wikimedia.org/T271343) [14:37:09] (03CR) 10Hashar: [C: 03+2] Revert "group1 wikis to 1.36.0-wmf.29" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662984 (https://phabricator.wikimedia.org/T271343) (owner: 10Hashar) [14:37:27] RECOVERY - Hadoop DataNode on an-worker1114 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [14:37:28] Majavah: and yeah that is rolling back [14:37:37] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.36.0-wmf.29" [14:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:13] thank you [14:38:24] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.36.0-wmf.29" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662984 (https://phabricator.wikimedia.org/T271343) (owner: 10Hashar) [14:39:17] (03CR) 10Volans: [C: 03+1] "LGTM, in fact given that the code inside the retry is cheap is actually nicer to have it more reactive and check constantly every N second" [software/spicerack] - 10https://gerrit.wikimedia.org/r/662983 (owner: 10Gehel) [14:39:26] options: revert the one patch in the branch, roll back out, or; daniel might be able to work up a fix very fast and we add that, merg,e backport, merge, and roll that out [14:39:34] hashar (or others): opinions? [14:39:52] I am guessing patching up the getCode() failure might be trivial enough? [14:39:59] I think I also have another issue with FeaturedFeeds [14:40:07] RECOVERY - Hadoop DataNode on an-worker1102 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [14:40:07] cause without the patch the issue will still show up anyway [14:40:52] I already have a fix for this specific issue but also found another in the process [14:40:59] oh? [14:41:18] yeah, it breaks if you attempt to use Special:FeedItem with a date that does not have a feed item for that date [14:41:49] is tis a bug that existed before the cache issues showed up? [14:41:58] no, new thing :/ [14:42:02] ugh, ok [14:42:13] how easy is that going to be to fix? [14:42:41] fairly simple, but I'm more worried that other edge cases might have gotten around code review too [14:43:18] !log rebooting wdqs1009 / 1010 for kernel upgrade [14:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:36] right :-( [14:44:21] (03CR) 10jerkins-bot: [V: 04-1] wait_reboot_since() is now using a constant backoff. [software/spicerack] - 10https://gerrit.wikimedia.org/r/662983 (owner: 10Gehel) [14:45:38] (03CR) 10Volans: [C: 03+1] "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/662983 (owner: 10Gehel) [14:45:46] (03PS20) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 [14:46:05] so, do we want to revert the one patch or do you want to try to fix it and get a careful code review etc? [14:46:21] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FeaturedFeeds/+/662985/ fixes those specific issues [14:46:37] some people wanted to not have that patch in this brain at all to see if the logging worked [14:46:45] right [14:46:56] let me discuss it right now in cpt [14:46:58] (03CR) 10Jbond: "thanks updated" (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [14:48:42] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host pybal-test2001.codfw.wmnet [14:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:44] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pybal-test2001.codfw.wmnet [14:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:06] (03PS21) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 [14:51:08] (03CR) 10Jbond: sre: convert the generic reboot functions to the cookbook class API (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [14:52:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 100%: Slowly pool db1157 into s3', diff saved to https://phabricator.wikimedia.org/P14270 and previous config saved to /var/cache/conftool/dbconfig/20210209-145206-root.json [14:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:07] ok, after having talked it over with a few people, Daniel and I are taking the responsibility for te decision, we would like the feeds patch to come out of the branch for now, [14:54:15] roll forward the train with just the logging fixes [14:54:35] and we will try to get the feeds fix patch plus the new patch out tomorrow [14:54:41] hashar, does this work for you? [14:54:59] who can revert the feeds fix patch and roll the branch back out to groups 0/1? [14:55:09] so revert the featuredfeeds caching patch for .29? [14:55:13] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/662797 (https://phabricator.wikimedia.org/T274044) (owner: 10Bstorm) [14:55:18] (03CR) 10Jason Linehan: [C: 03+1] "Looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662767 (https://phabricator.wikimedia.org/T274172) (owner: 10Mholloway) [14:55:19] yes, assuming that hashar agrees [14:55:21] I dont understand anything about the issue at end [14:55:27] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:30] and my mediawiki knowledge is close to zero nowadays [14:55:31] so [14:55:35] (03PS1) 10Majavah: Revert "Caching fixes" [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662956 [14:55:37] I cant really make any decision ;] [14:55:58] the idea is to drop the recent .29 backport we did like an hour ago [14:56:04] which would let the error occur again [14:56:09] but this time with better logging? [14:56:11] yes [14:56:18] so we are still blocked [14:56:19] but [14:56:22] have some better logging ;) [14:56:25] yes [14:56:25] is that correct? [14:56:44] and in the meantime: look at the additional feeds fix, [14:56:57] get it ready to go it, [14:57:00] *go in, [14:57:13] and use logging output to cross-check the issue [14:57:21] so we rollback FeaturedFeeds patch: * 8fc0f13 - (HEAD, origin/wmf/1.36.0-wmf.29) Caching fixes (14 hours ago) [14:57:32] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 10 hosts with reason: upgrading openstack [14:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:35] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 10 hosts with reason: upgrading openstack [14:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:39] yes, Majavah has already put the revert patch in [14:57:54] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FeaturedFeeds/+/662956/ [14:58:07] RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:33] (03CR) 10Hashar: [C: 03+2] "Causes:" [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662956 (owner: 10Majavah) [14:58:46] +2 ed [14:59:13] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: put into maintenance mode for Train upgrade [puppet] - 10https://gerrit.wikimedia.org/r/662978 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott) [15:00:32] (03CR) 10Gehel: [C: 03+2] wait_reboot_since() is now using a constant backoff. [software/spicerack] - 10https://gerrit.wikimedia.org/r/662983 (owner: 10Gehel) [15:00:55] hashar thanks for the +2, I have a meeting, in 30 mins I will be here again once the revert has merged through [15:01:09] daniel will be here hopefully in 30 also, but maybe a little later [15:01:12] will deploy prmote again [15:01:18] ping me if anything needed [15:01:21] (03PS1) 10Gehel: wdqs: explicit shutdown of Blazegraph during reboots. [cookbooks] - 10https://gerrit.wikimedia.org/r/662988 [15:01:22] and capture whatever log is appearing ;) [15:01:25] ok! [15:02:14] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: move eqiad1 from openstack 'stein' to 'train' [puppet] - 10https://gerrit.wikimedia.org/r/662980 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott) [15:03:13] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add support for php deployments (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto) [15:03:28] (03Merged) 10jenkins-bot: Revert "Caching fixes" [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662956 (owner: 10Majavah) [15:03:35] (03PS9) 10Arturo Borrero Gonzalez: toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) [15:05:20] Majavah: apergos: deploying the revert and I will promote group1 again [15:05:29] so we should get enhanced error logs [15:05:47] 10SRE, 10CAS-SSO, 10Discovery-Search (Current work), 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10jbond) Thanks i sent you response upstream will see what they come back with https://groups.google.com/u/1/a/apereo.org/g/cas-user/c/MkpgAZZn-Mw [15:06:32] !log hashar@deploy1001 Synchronized php-1.36.0-wmf.29/extensions/FeaturedFeeds: Revert "Caching fixes" T264391 (duration: 01m 25s) [15:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:36] T264391: FeaturedFeedChannel must not contain a User object, since it cannot be serialized safely. - https://phabricator.wikimedia.org/T264391 [15:07:11] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/662988 (owner: 10Gehel) [15:07:13] (03PS1) 10Hashar: group1 wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662990 [15:07:15] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662990 (owner: 10Hashar) [15:08:11] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662990 (owner: 10Hashar) [15:10:05] verified that the feed commons potd feed is working now [15:10:11] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.29 [15:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:53] !log readding ganeti5002 to the eqsin Ganeti cluster following mainboard replacement/reinstall T261130 [15:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:57] T261130: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 [15:11:07] it is deploying [15:11:23] !log hashar@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.29 (duration: 01m 11s) [15:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:32] [YCKk3ApAICkAABTZCOAAAACA] /wiki/MediaWiki ArgumentCountError: Too few arguments to function FeaturedFeedChannel::__construct(), 3 passed in /srv/mediawiki/php-1.36.0-wmf.29/extensions/FeaturedFeeds/includes/FeaturedFeeds.php on line 214 and exactly 4 expected [15:13:01] that's with the patch reverted? [15:13:54] should hav ebeen reverted yeah [15:14:21] uh wut [15:14:48] maybe that one is a one off error though [15:14:55] FeaturedFeedChannel constructor has 3 parameters on master and .29, no idea how that's happening [15:15:08] yeah that happened only once [15:15:08] you have the url for that? [15:15:24] (03CR) 10Mholloway: [C: 04-2] "Hold until 1.36.0-wmf.30 is live on all wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662767 (https://phabricator.wikimedia.org/T274172) (owner: 10Mholloway) [15:15:54] !log power down mw2220 for maintenance [15:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:50] constructor looks like this on mw1320: [15:16:53] public function __construct( $name, $options, $lang, User $user ) { [15:16:56] pasting to the task [15:17:00] (/srv/mediawiki/php-1.36.0-wmf.29/extensions/FeaturedFeeds/includes/FeaturedFeedChannel.php) [15:18:29] https://phabricator.wikimedia.org/T264391#6815291 [15:18:41] and I somehow failed git pulling [15:18:48] that four argument constructor is correct and wanted [15:19:11] it might just have been a off by one transient error [15:19:20] (03PS6) 10David Caro: Revert "elasticsearch: return the cluster name in __str__ for ElasticsearchCluster" [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294 [15:19:21] the only caller should be using four [15:19:40] yeah, FeaturedFeeds.php calls it with four arguments as far as I can see (still on mw1320) [15:19:47] might have happened halfway through a sync? [15:19:55] yeah I think [15:20:03] cause our deploys are definitely not atomic [15:20:10] lets hope that [15:20:17] ok [15:20:17] and it was a single error [15:21:40] so with .29 promoted to group 1 there are no new logs showing up [15:21:45] that is a good sign I guess [15:21:47] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 95 hosts with reason: upgrading openstack [15:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:06] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 10 hosts with reason: upgrading openstack [15:22:08] anything about the original issue we're trying to solve? [15:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:10] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 10 hosts with reason: upgrading openstack [15:22:11] though iirc the issue appearing on enwiki (which is not in group 1) [15:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:22] Majavah: nop. It is all quiet [15:22:22] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 95 hosts with reason: upgrading openstack [15:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:30] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 95 hosts with reason: upgrading openstack [15:22:32] scary [15:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:04] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 95 hosts with reason: upgrading openstack [15:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:15] maybe the issue is related to some cache expiry [15:23:23] and would only surface after some TTL expired [15:23:34] it is not like I know anything about what is going on though [15:23:53] I need to see which wikis had the error besides enwiki (later) [15:24:04] there were a total of 45 errors iirc, not a lot [15:24:10] (03CR) 10Kormat: "Fails because jenkins is currently using a debian stretch base image for this repo." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/662966 (owner: 10Kormat) [15:24:10] waiting and seeing sounds good [15:24:38] FeaturedFeed might cache things for up to 24h [15:24:43] enwiki was one for sure [15:24:53] I cant find whether commonswiki was affected as well [15:25:13] I am going to catch my kids at school [15:25:14] (03CR) 10David Caro: [C: 03+2] "Getting it in, thanks for the understanding 😊" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294 (owner: 10David Caro) [15:25:21] will be back in ~ half an hour [15:25:26] but things seems stable [15:25:41] then I guess we can look at updating the remaining wikis [15:27:55] be back in a few [15:29:17] ok, I will be here more or less until then (an through then) [15:29:39] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:32:27] !log power down logstash2035 for relocation [15:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:48] (03CR) 10Ladsgroup: [C: 03+1] "Had a meeting with Adam and can confirm it's really coming from him." [puppet] - 10https://gerrit.wikimedia.org/r/662661 (owner: 10Awight) [15:33:52] (03CR) 10jerkins-bot: [V: 04-1] Revert "elasticsearch: return the cluster name in __str__ for ElasticsearchCluster" [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294 (owner: 10David Caro) [15:34:23] (03PS1) 10Jbond: sso cloud: add wmfcloud.org service registration [puppet] - 10https://gerrit.wikimedia.org/r/662995 [15:35:09] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:35:13] (03CR) 10Jbond: [C: 03+2] sso cloud: add wmfcloud.org service registration [puppet] - 10https://gerrit.wikimedia.org/r/662995 (owner: 10Jbond) [15:35:36] (03CR) 10David Caro: [C: 03+2] "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294 (owner: 10David Caro) [15:35:38] (03PS1) 10Herron: icinga: send fr-tech-ops alerts to victorops-fundraising [puppet] - 10https://gerrit.wikimedia.org/r/662996 (https://phabricator.wikimedia.org/T273065) [15:37:19] (03CR) 10David Caro: Revert "elasticsearch: return the cluster name in __str__ for ElasticsearchCluster" [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294 (owner: 10David Caro) [15:37:22] (03CR) 10David Caro: [C: 03+2] Revert "elasticsearch: return the cluster name in __str__ for ElasticsearchCluster" [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294 (owner: 10David Caro) [15:38:33] PROBLEM - ganeti-noded running on ganeti5002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [15:38:59] PROBLEM - ganeti-mond running on ganeti5002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [15:39:21] PROBLEM - ganeti-confd running on ganeti5002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [15:39:32] (03PS1) 10Andrew Bogott: Train/buster: don't install python-pycadf [puppet] - 10https://gerrit.wikimedia.org/r/662997 (https://phabricator.wikimedia.org/T261135) [15:39:58] (03PS1) 10Andrew Bogott: Revert "Horizon: put into maintenance mode for Train upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/662957 [15:40:21] (03CR) 10Andrew Bogott: [C: 03+2] Train/buster: don't install python-pycadf [puppet] - 10https://gerrit.wikimedia.org/r/662997 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott) [15:40:31] PROBLEM - Host logstash2035.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:41:12] (03CR) 10Jgreen: [C: 03+2] icinga: send fr-tech-ops alerts to victorops-fundraising [puppet] - 10https://gerrit.wikimedia.org/r/662996 (https://phabricator.wikimedia.org/T273065) (owner: 10Herron) [15:42:00] logstash2035 is a planned re-rack jftr, will downtime now [15:44:02] back in 2 minutes (cat feeding) [15:44:42] (03Merged) 10jenkins-bot: Revert "elasticsearch: return the cluster name in __str__ for ElasticsearchCluster" [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294 (owner: 10David Caro) [15:46:37] RECOVERY - Host logstash2035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.53 ms [15:47:16] 10SRE, 10ops-codfw: codfw: relocate logstash2035 - https://phabricator.wikimedia.org/T274214 (10Papaul) 05Openβ†’03Resolved @herron this is complete the server is back up and Netbox update. [15:48:30] 10SRE, 10ops-codfw: codfw: relocate logstash2035 - https://phabricator.wikimedia.org/T274214 (10herron) LGTM thanks @Papaul! [15:48:36] 10SRE, 10SRE-swift-storage, 10Sustainability (Incident Followup), 10User-fgiunchedi: ms-fe.svc.codfw.wmnet paged during Swift rebalance - https://phabricator.wikimedia.org/T273453 (10fgiunchedi) 05Openβ†’03Resolved a:03fgiunchedi Resolving, the other two items should be either done or tackled elsewhere... [15:53:03] Majavah: would you be willing to write tests for the new patch? [15:53:58] RECOVERY - Keyholder SSH agent on cumin2001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [15:54:33] (03CR) 10Bstorm: "If we are at the point of directly patching the code with something we'd never even consider pushing upstream (like in Neutron), I think w" [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165) (owner: 10Andrew Bogott) [15:54:39] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10Papaul) [15:58:24] more or less back . Multitasking with my kids homework [15:58:34] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1032 - https://phabricator.wikimedia.org/T272209 (10Cmjohnson) 05Openβ†’03Resolved @fgiunchedi I am resolving this but please open a decom task when you're ready to decommission this server. Thanks [15:59:04] 10SRE, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10Cmjohnson) thanks, @Bstorm this is on my list to do this week and will update you once... [15:59:26] !log volker-e@deploy1001 Started deploy [design/style-guide@b9b7ee6]: Deploy design/style-guide: b9b7ee6 β€œComponents”: Fix components overview SVG rendering glitch (#439) [15:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:33] !log volker-e@deploy1001 Finished deploy [design/style-guide@b9b7ee6]: Deploy design/style-guide: b9b7ee6 β€œComponents”: Fix components overview SVG rendering glitch (#439) (duration: 00m 07s) [15:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:26] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup100[12] - https://phabricator.wikimedia.org/T274206 (10RobH) [16:01:33] 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10AntiCompositeNumber) >>! In T273741#6814266, @Shizhao wrote: > Rename to a new filename? The Commons community generally avoids moving files, a... [16:01:56] 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Mvolz) [16:02:13] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup200[12] - https://phabricator.wikimedia.org/T274202 (10RobH) [16:02:36] (03CR) 10Bstorm: "Do we have a diff from the packaged file somewhere or should I make one?" [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165) (owner: 10Andrew Bogott) [16:03:29] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1032 - https://phabricator.wikimedia.org/T272209 (10fgiunchedi) 05Resolvedβ†’03Open >>! In T272209#6815416, @Cmjohnson wrote: > @fgiunchedi I am resolving this but please open a decom task when you're ready to decommission this server. Thanks [reopening] This host... [16:03:53] (03CR) 10Volans: [C: 03+1] "Looks sane to me but I'm not familiar with our k6s puppetization. Maybe Luca or John are more familiar?" [puppet] - 10https://gerrit.wikimedia.org/r/662945 (owner: 10Muehlenhoff) [16:04:08] (03CR) 10Andrew Bogott: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165) (owner: 10Andrew Bogott) [16:04:54] hashar: see anything in the logs? [16:06:00] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Papaul) [16:06:08] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops-radar: mw2220 - broken IPMI / mgmt - https://phabricator.wikimedia.org/T273803 (10Papaul) 05Openβ†’03Resolved @Dzahn Drained power and upgrade IDRAC firmware from 2.30.30.30 to 2.63. All looks good now [16:07:22] (03PS2) 10CRusnov: openldap/offboard-user.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/662765 (https://phabricator.wikimedia.org/T247364) [16:07:38] (03CR) 10CRusnov: openldap/offboard-user.py: Port to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/662765 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [16:09:08] PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:09:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/662765 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [16:09:47] 10SRE, 10ops-codfw: ms-be2031 repeated usb connect/disconnect message - https://phabricator.wikimedia.org/T273895 (10Papaul) @fgiunchedi is it necessary to spend time fixing this issue since your plan is to decom this server? [16:10:00] (03CR) 10Andrew Bogott: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165) (owner: 10Andrew Bogott) [16:11:01] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:13] (03CR) 10Bstorm: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165) (owner: 10Andrew Bogott) [16:13:01] apergos: checking logs [16:13:17] kids homework done (after begging for random games instead of focusing on reading bah) [16:13:25] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install aqs101[0-5] - https://phabricator.wikimedia.org/T267414 (10Cmjohnson) netbox updated with switch info and dns. [16:13:27] (03CR) 10Bstorm: [C: 03+1] "I'll make a task to track winding down our use cases." [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165) (owner: 10Andrew Bogott) [16:14:00] !log swift eqiad-prod: decrease weight for SSDs on ms-be[1019-1026] - T272836 [16:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:04] T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836 [16:14:25] apergos: nothing related to the closure issue or pagefeeds [16:14:37] what are you using to search? [16:18:04] RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:43] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [16:20:03] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:18] !log installing wireshark security updates [16:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:09] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:22:15] (03PS22) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 [16:22:29] oh well, I have logstash open, filtering out some db stuff and allowing a bunch more, with wanobjectcache in the message, not realyl anything going on yet [16:22:31] (03CR) 10Jbond: sre: convert the generic reboot functions to the cookbook class API (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [16:22:48] (03CR) 10Bstorm: [C: 03+1] "https://phabricator.wikimedia.org/T274268 so we can add structure to that process over time and figure out when we are really ready to sto" [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165) (owner: 10Andrew Bogott) [16:24:07] (03PS7) 10Jcrespo: [WIP] Move database backups-related puppet code to its own profile/role [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) [16:24:55] apergos: yeah that sounds fine [16:25:12] (03PS1) 10Ottomata: Bump Hadoop datanode heap to 8G [puppet] - 10https://gerrit.wikimedia.org/r/663003 (https://phabricator.wikimedia.org/T273711) [16:25:35] hashar: was there a plan to roll out to group2 or to $some_wiki later? [16:25:44] where $some_wiki is one we know to have had issues [16:25:45] I haven't made any plan [16:25:50] ok [16:25:51] I am fine pushing it nowish I guess [16:26:01] (03CR) 10Elukey: [C: 03+1] Bump Hadoop datanode heap to 8G [puppet] - 10https://gerrit.wikimedia.org/r/663003 (https://phabricator.wikimedia.org/T273711) (owner: 10Ottomata) [16:26:02] (03CR) 10Ottomata: [C: 03+2] Bump Hadoop datanode heap to 8G [puppet] - 10https://gerrit.wikimedia.org/r/663003 (https://phabricator.wikimedia.org/T273711) (owner: 10Ottomata) [16:26:06] well lemme see if there is a smallish good candidate [16:26:33] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:27:08] I am looking for the past Closure issue messages [16:27:30] 10SRE, 10ops-codfw: ms-be2031 repeated usb connect/disconnect message - https://phabricator.wikimedia.org/T273895 (10fgiunchedi) >>! In T273895#6815460, @Papaul wrote: > @fgiunchedi is it necessary to spend time fixing this issue since your plan is to decom this server? I don't know if it is a real problem,... [16:27:31] (03PS1) 10Giuseppe Lavagetto: upload-frontend: ban a specific url with no referer nor UA [puppet] - 10https://gerrit.wikimedia.org/r/663004 (https://phabricator.wikimedia.org/T273741) [16:27:56] there were some on commonswiki [16:28:10] but bulk of them were on enwiki / zhwiki / jawiki which are not from group 1 [16:28:16] so yeah I guess we nede to promote everything [16:28:26] (03PS2) 10Giuseppe Lavagetto: upload-frontend: ban a specific url with no referer nor UA [puppet] - 10https://gerrit.wikimedia.org/r/663004 (https://phabricator.wikimedia.org/T273741) [16:28:31] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup200[12] - https://phabricator.wikimedia.org/T274202 (10Papaul) [16:29:12] 10SRE, 10ops-codfw: ms-be2031 repeated usb connect/disconnect message - https://phabricator.wikimedia.org/T273895 (10Papaul) @fgiunchedi ok. [16:29:25] 10SRE, 10ops-codfw: ms-be2031 repeated usb connect/disconnect message - https://phabricator.wikimedia.org/T273895 (10Papaul) a:03Papaul [16:29:35] (03PS1) 10Jcrespo: bacula: Temporarily enable read-only backups and disable rw backups of ES [puppet] - 10https://gerrit.wikimedia.org/r/663005 (https://phabricator.wikimedia.org/T79922) [16:29:40] guess we can promote now ? [16:31:28] (03CR) 10Tjones: "Do we have to worry about the debian-glue-non-voting failure?" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/662923 (https://phabricator.wikimedia.org/T274203) (owner: 10DCausse) [16:32:40] (03CR) 10Jcrespo: [C: 03+2] bacula: Temporarily enable read-only backups and disable rw backups of ES [puppet] - 10https://gerrit.wikimedia.org/r/663005 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [16:32:59] PROBLEM - SSH on logstash1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:33:18] if there were "some" on group1 I would uh [16:33:35] I'm having trouble getting lgostash to show me some older examples >_< [16:33:41] ;D [16:34:04] you can use fixed date for Jan 29th [16:34:29] it keeps giving me little orange error boxes [16:34:35] RECOVERY - SSH on logstash1022 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:34:39] and then I can't even expand stack traces or add filters, it's maddening [16:35:00] (03CR) 10DCausse: "> Patch Set 1:" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/662923 (https://phabricator.wikimedia.org/T274203) (owner: 10DCausse) [16:36:55] apergos: https://logstash.wikimedia.org/goto/325a206bdc1b1fc9178c4893762fa936 [16:37:06] occurences of message:Closure for Jan 29 [16:37:07] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.154 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [16:37:12] trying that, thanks [16:37:13] at some arbitrary time range [16:37:22] on mediawiki-errors dashboard [16:38:01] uhh that logstash error doesn't sound good [16:39:54] https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&orgId=1&refresh=5m&from=now-24h&to=now [16:40:31] looks like it always has some business around that time of the day for some reason [16:41:15] (03PS1) 10Joal: Reduce Hadoop Yarn available memory from 4G [puppet] - 10https://gerrit.wikimedia.org/r/663009 [16:41:21] elukey, ottomata --^ [16:41:40] (03PS2) 10Ottomata: Reduce Hadoop Yarn available memory from 4G [puppet] - 10https://gerrit.wikimedia.org/r/663009 (https://phabricator.wikimedia.org/T273711) (owner: 10Joal) [16:42:45] I am going to do the promote [16:43:24] yuck [16:43:29] (03PS1) 10Hashar: all wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663010 [16:43:32] (03CR) 10Hashar: [C: 03+2] all wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663010 (owner: 10Hashar) [16:43:41] maybe wait til logstash is happy again? [16:43:50] hashar: ^^ [16:44:57] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663010 (owner: 10Hashar) [16:45:26] apergos: I dont think it is much of an issue [16:45:31] ok [16:45:38] i that case roll em roll em roll em [16:45:40] seems to be oscillating between 5 and 8 indexing failures per seconds [16:45:44] based on our overall traffic [16:45:54] (03PS3) 10Joal: Reduce Hadoop Yarn available memory from 4G [puppet] - 10https://gerrit.wikimedia.org/r/663009 [16:46:06] and since now is kind of peak hours (late in asia, prime in europe, us is around), I guess it is normal to have more events and more failures [16:46:50] (03CR) 10Elukey: [C: 03+2] Reduce Hadoop Yarn available memory from 4G [puppet] - 10https://gerrit.wikimedia.org/r/663009 (owner: 10Joal) [16:47:09] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 9.2 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [16:47:15] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.29 [16:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:46] the syslog for logstash does show there have been a peak of indexing error around 16:46 ( https://logstash.wikimedia.org/goto/8c3d1c10bc7574b41793e2b61dc2f151 ) [16:49:03] baseline ~ 800 events / 30 seconds. Peak at 1600 for a 30 seconds bucket [16:49:49] indeed, there's a bunch of open tasks to tackle the persisting indexing errors :( [16:50:02] seems it is due to some fields having mismatching types [16:50:20] (03PS5) 10Ejegg: Disable CentralNotice on API portal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649942 (https://phabricator.wikimedia.org/T270308) [16:50:34] that's correct yes [16:50:52] so we're on 29 everywhere [16:50:56] lemme get on the log host [16:51:03] PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:56:19] and thank you so much for your patches [16:56:23] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:56:33] yup, +\infty for those who do latex math :) [16:56:58] no its 1900 local now, midnight UTC will be 2am local [16:57:53] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/663011 (owner: 10David Caro) [16:57:59] PROBLEM - Check systemd state on maps1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:04] (03CR) 10David Caro: "Nice work! LGTM codewise but I'll leave for someone that knows the context to approve :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [16:58:25] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Horizon: put into maintenance mode for Train upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/662957 (owner: 10Andrew Bogott) [16:58:47] PROBLEM - tileratorui on maps1005 is CRITICAL: connect to address 10.64.0.12 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [16:58:51] (03PS1) 10Hashar: Revert "Caching fixes" [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662964 (https://phabricator.wikimedia.org/T264391) [16:58:56] Majavah: I’ll skip the good night but say good luck with the exam! :) [16:59:02] ^^ reconciliating wmf.30 [16:59:04] which lacks the rever [16:59:05] t [16:59:06] (03CR) 10David Caro: [C: 03+2] Revert "elasticsearch: return the cluster name in __str__ for ElasticsearchCluster" part 2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/663011 (owner: 10David Caro) [16:59:14] thank you Lucas_WMDE :D [16:59:16] wdym hashar? [16:59:21] and yeah, good luck with your exam :) [16:59:23] PROBLEM - tilerator on maps1005 is CRITICAL: connect to address 10.64.0.12 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [16:59:30] wdym ? [16:59:39] oh [16:59:41] what do you mean by " reconciliating wmf.30" [16:59:44] (03PS2) 10Bstorm: wikireplicas: adjust logrotate for multiinstance on wmf-pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/662797 (https://phabricator.wikimedia.org/T274044) [16:59:52] that is non sense sorry [17:00:00] PROBLEM - Maps HTTPS on maps1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.309 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [17:00:00] (i've also had exams every day since last wens) [17:00:02] :( [17:00:04] jbond42 and cdanis: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210209T1700) [17:00:24] so in short we had a patch for FeaturedFeeds on wmf.29 but it was faulty and got reverted [17:00:39] however the faulty patch got included in the wmf.30 branch and thus has to be reverted [17:00:43] which is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FeaturedFeeds/+/662964 :) [17:00:47] ah, makes sense [17:01:29] !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [17:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:03] (03CR) 10Hashar: "The change made it to master and thus got included in the wmf.30 branch cut. We have cherry picked to wmf.29 and found out it was causing" [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662964 (https://phabricator.wikimedia.org/T264391) (owner: 10Hashar) [17:02:18] meeting time [17:02:32] will watch logstash in parallel. Ping me if anything is needed [17:02:42] omg there are some log entries [17:02:47] now if only one has the key [17:03:59] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:04:08] (03CR) 10David Caro: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165) (owner: 10Andrew Bogott) [17:04:13] PROBLEM - Check systemd state on wdqs1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:28] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops-radar: mw2220 - broken IPMI / mgmt - https://phabricator.wikimedia.org/T273803 (10Dzahn) Thank you @Papaul. Reimaging it now and things look normal. [17:04:32] 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10epriestley) > We need an easy way to tell logged-out traffic apart in our caching layer. Can your caching layer examine cookie values? If the `phsid` cookie... [17:05:48] (03CR) 10Jbond: "This change is ready for review." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [17:06:57] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:07:22] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp1001.eqiad.wmnet [17:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:52] Async refresh failed for frwiki:featured-feeds:1:fr for people following along! [17:07:56] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:13:10] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1001.eqiad.wmnet [17:13:11] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp1002.eqiad.wmnet [17:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:02] what about doing a similar request but with a different language code to bypass the already-cached result and with mwdebug to hopefully get more detailed debug logging? [17:14:31] (03CR) 10Bstorm: "Adding Andrew as a reviewer just to poke for errors if you see any. I think this should add a log file per process and a logrotate." [puppet] - 10https://gerrit.wikimedia.org/r/662797 (https://phabricator.wikimedia.org/T274044) (owner: 10Bstorm) [17:14:32] that...might reproduce it [17:14:41] or simply purge the actual key :) [17:14:50] do it on another wiki? [17:14:50] um [17:15:27] apergos: not all wikis have featuredfeeds enabled [17:15:28] ah there's only the one so far. [17:15:33] let's see if we get another [17:15:38] I'd like to keep one intact [17:15:44] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/663016 [17:15:48] the page took ages to load on mwdebug1001 [17:15:58] rendered successfully, no errors visible to the user [17:16:10] Async refresh failed for jawiki:featured-feeds:1:ja [17:16:20] can anyone look if that reproduced it? cache key should be in theory frwiki:featured-feeds:1:fi with an I instead of R [17:16:35] um let's see [17:17:10] (03PS4) 10Giuseppe Lavagetto: upload-frontend: ban a specific url with no referer nor UA [puppet] - 10https://gerrit.wikimedia.org/r/663004 (https://phabricator.wikimedia.org/T273741) [17:17:13] don't see i in logstash yet [17:17:19] most recent is 3 minutes ago [17:17:25] the ja one. [17:17:44] 10SRE, 10Commons, 10Traffic, 10Patch-For-Review: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10ssingh) Update: Thank you for the interest in this task! Like we shared yesterday, we have identified that the traffic is... [17:18:26] still no. [17:20:02] let me try something... [17:20:09] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.388 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [17:20:16] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1002.eqiad.wmnet [17:20:17] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp1003.eqiad.wmnet [17:20:51] oohhh there's som e fa and he wiki ones now [17:21:23] <_joe_> uhm indexing errors on logstash is not a good sign usually [17:23:00] PROBLEM - nova-compute proc minimum on cloudvirt1013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:23:02] PROBLEM - nova-compute proc minimum on cloudvirt1035 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:23:04] PROBLEM - nova-compute proc minimum on cloudvirt1012 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:23:10] PROBLEM - nova-compute proc minimum on cloudvirt1017 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:23:30] PROBLEM - nova-compute proc minimum on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:23:32] PROBLEM - nova-compute proc minimum on cloudvirt1025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:23:32] PROBLEM - nova-compute proc minimum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:23:42] PROBLEM - nova-compute proc minimum on cloudvirt1021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:23:42] PROBLEM - nova-compute proc minimum on cloudvirt1029 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:23:50] PROBLEM - nova-compute proc minimum on cloudvirt-wdqs1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:23:51] PROBLEM - nova-compute proc minimum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:23:51] PROBLEM - nova-compute proc minimum on cloudvirt1037 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:23:53] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2264.codfw.wmnet with reason: REIMAGE [17:24:08] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 95 hosts with reason: upgrading openstack [17:24:10] PROBLEM - nova-compute proc maximum on cloudvirt1025 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:24:13] 10SRE, 10Commons, 10Traffic, 10Patch-For-Review: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10fdans) @ssingh said it yesterday on chat but this is such stellar data detective work. Congrats on finding the culprit!! [17:24:28] PROBLEM - nova-compute proc maximum on cloudvirt-wdqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:24:30] PROBLEM - nova-compute proc maximum on cloudvirt1020 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:24:31] PROBLEM - nova-compute proc maximum on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:24:32] RECOVERY - nova-compute proc minimum on cloudvirt1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:24:33] that's some downtime expiring, I'll renew [17:24:43] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 95 hosts with reason: upgrading openstack [17:25:28] RECOVERY - nova-compute proc maximum on cloudvirt-wdqs1001 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:25:30] RECOVERY - nova-compute proc maximum on cloudvirt1020 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:25:30] RECOVERY - nova-compute proc maximum on cloudvirt1036 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:25:31] why do i get `TypeError: Too few arguments to function FeaturedFeeds::getFeeds()`? [17:25:35] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1003.eqiad.wmnet [17:25:43] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1383.eqiad.wmnet with reason: REIMAGE [17:25:44] it says `exactly 2 expected` [17:25:57] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2264.codfw.wmnet with reason: REIMAGE [17:25:57] 10SRE, 10Commons, 10Traffic, 10Patch-For-Review: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Majavah) Is the effect that the block will have in the app known? [17:26:19] but that function...needs only one argment? [17:26:21] RECOVERY - nova-compute proc minimum on cloudvirt1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:26:23] RECOVERY - nova-compute proc minimum on cloudvirt1035 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:26:25] RECOVERY - nova-compute proc minimum on cloudvirt1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:26:25] RECOVERY - nova-compute proc maximum on cloudvirt1025 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:26:31] RECOVERY - nova-compute proc minimum on cloudvirt1017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:26:39] 10SRE, 10Platform Engineering, 10Release Pipeline, 10Release-Engineering-Team, and 5 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10thcipriani) [17:26:42] RECOVERY - nova-compute proc minimum on cloudvirt1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:26:45] RECOVERY - nova-compute proc minimum on cloudvirt1025 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:26:55] RECOVERY - nova-compute proc minimum on cloudvirt1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:26:55] RECOVERY - nova-compute proc minimum on cloudvirt1029 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:28:02] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1383.eqiad.wmnet with reason: REIMAGE [17:28:24] RECOVERY - nova-compute proc minimum on cloudvirt-wdqs1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:28:52] (03PS1) 10Giuseppe Lavagetto: varnish: fix escaping of variables in test run script [puppet] - 10https://gerrit.wikimedia.org/r/663017 [17:29:15] 10SRE, 10Scap, 10Release-Engineering-Team-TODO, 10Sustainability (Incident Followup), 10User-brennen: scap's logstash_checker.py is blissfully unaware of any logstash indexing latency - https://phabricator.wikimedia.org/T255197 (10thcipriani) [17:33:20] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2220.codfw.wmnet with reason: REIMAGE [17:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:05] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1300.eqiad.wmnet with reason: REIMAGE [17:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:24] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2220.codfw.wmnet with reason: REIMAGE [17:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:28] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1300.eqiad.wmnet with reason: REIMAGE [17:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:54] RECOVERY - Check systemd state on wdqs1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:39:59] (03PS29) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [17:40:55] (03CR) 10Hnowlan: start using imposm as OSM sync tool (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [17:43:29] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on maps1005.eqiad.wmnet with reason: Resyncing database, still [17:43:30] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on maps1005.eqiad.wmnet with reason: Resyncing database, still [17:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:14] RECOVERY - nova-compute proc minimum on cloudvirt1033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:47:21] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2264.codfw.wmnet'] ` an... [17:48:06] RECOVERY - nova-compute proc minimum on cloudvirt1037 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:50:11] 10SRE, 10Commons, 10Traffic, 10Patch-For-Review: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10calbon) Just to second what @fdans said, the data detective work was great and this was such a fun ticket to watch. [17:51:27] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1383.eqiad.wmnet'] ` an... [17:51:40] 10SRE, 10Data-Persistence-Backup: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) Something that may or may not be related, but we will want to correct is that backup2002 is resolved on dns... [17:55:57] (03PS30) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [17:58:36] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: a new (but still terrible) approach to making projectid==projectname [puppet] - 10https://gerrit.wikimedia.org/r/662800 (https://phabricator.wikimedia.org/T274165) (owner: 10Andrew Bogott) [18:00:04] chrisalbon and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210209T1800). [18:01:15] (03PS31) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [18:02:20] 10SRE, 10Commons, 10Traffic, 10Patch-For-Review: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Joe) >>! In T273741#6815874, @Majavah wrote: > Is the effect that the block will have in the app known? No, hence we trie... [18:02:47] grbmbl [18:02:49] (03CR) 10jerkins-bot: [V: 04-1] start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [18:14:37] !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001 [18:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:20] (03PS1) 10Andrew Bogott: Remove old OpenStack Rocky files/templates/manifests [puppet] - 10https://gerrit.wikimedia.org/r/663027 [18:20:05] (03PS2) 10DLynch: Enable DiscussionTools Reply Tool A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661373 (https://phabricator.wikimedia.org/T273554) (owner: 10Bartosz DziewoΕ„ski) [18:20:57] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) [18:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:15] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1300.eqiad.wmnet'] ` an... [18:21:33] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001 [18:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:10] PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:22:28] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:22:57] this is probably related to jupyter, we are upgrading [18:23:39] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2220.codfw.wmnet'] ` an... [18:24:01] Urbanecm: did you figure out the parameter issue? [18:24:28] Majavah: yes, I was stupid, I opened my local master version, and the server used wmf.29 obviously, and the number of params was different [18:25:38] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:25:58] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:26:24] RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:27:02] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:27:35] apergos: Majavah: we are letting the state as is for now [18:27:51] ok (I'm also pretty done for the day) [18:27:52] guess we will want to backport the cache fix again [18:27:59] or just https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FeaturedFeeds/+/662985 [18:28:03] (which is for master) [18:28:16] there is the additional fix that needs to go in on top of that one, after testing and code review [18:28:55] yeah I think it's both those combined, anyways I'll be back to looking at things tomorrow [18:29:33] !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001 [18:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:21] so I guess yes two patches got to be carried [18:30:41] will check with this week train conductors in ~ 1H30 and that will follow from there [18:30:45] for now it is dinner time [18:32:56] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001 [18:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:10] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:34:12] PROBLEM - Oozie Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.catalina.startup.Bootstrap https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie [18:34:31] ah snap downtime [18:34:57] we are upgrading :) [18:37:00] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2220.codfw.wmnet [18:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:14] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1300.eqiad.wmnet [18:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:24] RECOVERY - Oozie Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.catalina.startup.Bootstrap https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie [18:37:24] RECOVERY - Hive Metastore on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [18:37:40] !log T267927 [WDQS Data Reload] Clearing old wikidata journal file to free disk space before beginning data reload:`sudo systemctl status wdqs-blazegraph && sudo systemctl stop wdqs-blazegraph && sudo rm -fv /srv/wdqs/wikidata.jnl && sudo systemctl start wdqs-blazegraph` on `wdqs100[9,10]` [18:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:46] T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927 [18:39:05] !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001 [18:39:07] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload [18:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:16] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload [18:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:25] !log T267927 [WDQS Data Reload] `sudo cookbook sre.wdqs.data-reload wdqs1009.eqiad.wmnet --reuse-downloaded-dump --reload-data wikidata --skolemize --reason 'T267927: Reload wikidata jnl from fresh dumps' --task-id T267927` on `ryankemper@cumin1001` tmux session `wdqs_data_reload_1009` [18:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:57] !log T267927 [WDQS Data Reload] `sudo cookbook sre.wdqs.data-reload wdqs1010.eqiad.wmnet --reuse-downloaded-dump --reload-data wikidata --reason 'T267927: Reload wikidata jnl from fresh dumps' --task-id T267927` on `ryankemper@cumin1001` tmux session `wdqs_data_reload_1009` [18:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:09] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001 [18:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:53] !log T267927 [WDQS Data Reload] Small typo in previous SAL log message, see subsequent SAL line for correction: [18:41:54] 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10Gilles) a:0320after4 [18:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:04] !log T267927 [WDQS Data Reload] `sudo cookbook sre.wdqs.data-reload wdqs1010.eqiad.wmnet --reuse-downloaded-dump --reload-data wikidata --reason 'T267927: Reload wikidata jnl from fresh dumps' --task-id T267927` on `ryankemper@cumin1001` tmux session `wdqs_data_reload_1010` [18:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:01] 10SRE, 10Data-Persistence-Backup, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) **backup2002 -> backup1002** (please note this was while large backups were running in the backg... [18:45:36] !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001 [18:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:25] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001 [18:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:49] 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10mmodell) Indeed I think it's caching static files correctly. [18:47:45] 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10epriestley) > Eventually Phabricator ran out of local sockets (30k limit): I'm not familiar with the particulars of your infrastructure, but if this was on th... [18:48:30] 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10Gilles) a:0520after4β†’03mmodell [18:52:18] 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10Majavah) Hmm. I'm looking at the Firefox developer console, and it looks like this: {F34098763} Note that styles and scripts are marked as "cached" instead of... [18:57:00] !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001 [18:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:58] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001 [18:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210209T1900) [19:01:22] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2264.codfw.wmnet [19:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:33] !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001 [19:01:34] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1383.eqiad.wmnet [19:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:23] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001 [19:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:14] !log elukey@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid's jvm daemons. - elukey@cumin1001 [19:04:15] !log elukey@cumin1001 END (FAIL) - Cookbook sre.druid.roll-restart-workers (exit_code=99) for Druid analytics cluster: Roll restart of Druid's jvm daemons. - elukey@cumin1001 [19:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:34] ufff [19:08:23] !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001 [19:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:15] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001 [19:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:10] 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10epriestley) I suspect you're seeing that because when you reload a page by issuing a "Reload" command in your browser, most (all?) modern browsers interpret th... [19:13:42] (03PS9) 10Ryan Kemper: relforge: service impl of relforge100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211) [19:15:03] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2220.codfw.wmnet [19:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:25] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1300.eqiad.wmnet [19:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:51] (03CR) 10Ryan Kemper: relforge: service impl of relforge100[3,4] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper) [19:18:53] (03CR) 10Ryan Kemper: [C: 03+2] relforge: service impl of relforge100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper) [19:19:53] !log T262211 Attempting to bring `relforge100[3,4]` into service; merging https://gerrit.wikimedia.org/r/661229 [19:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:58] T262211: Service implementation for relforge100[34] - https://phabricator.wikimedia.org/T262211 [19:21:14] !log T262211 `sudo cumin 'P{relforge*}' 'sudo run-puppet-agent'` on `ryankemper@cumin1001` [19:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:01] 10SRE, 10netops: tcp handshake failure between pfw3-eqiad and frlog1001:6514 - https://phabricator.wikimedia.org/T263833 (10ayounsi) [19:23:16] !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001 [19:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:07] (03CR) 1020after4: [C: 03+2] Revert "Caching fixes" [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662964 (https://phabricator.wikimedia.org/T264391) (owner: 10Hashar) [19:26:53] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001 [19:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:01] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1383.eqiad.wmnet [19:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:04] (03Merged) 10jenkins-bot: Revert "Caching fixes" [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662964 (https://phabricator.wikimedia.org/T264391) (owner: 10Hashar) [19:30:53] (03PS1) 10Razzi: sre.druid.roll-restart-workers: properly pass commands list [cookbooks] - 10https://gerrit.wikimedia.org/r/663033 (https://phabricator.wikimedia.org/T269925) [19:31:31] (03PS1) 10Andrew Bogott: profile::labs::downloadserver: remove lvm class [puppet] - 10https://gerrit.wikimedia.org/r/663034 (https://phabricator.wikimedia.org/T274208) [19:33:15] (03CR) 10Elukey: [C: 03+1] sre.druid.roll-restart-workers: properly pass commands list [cookbooks] - 10https://gerrit.wikimedia.org/r/663033 (https://phabricator.wikimedia.org/T269925) (owner: 10Razzi) [19:33:50] (03CR) 10Kosta Harlan: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [19:33:53] (03CR) 10Razzi: [C: 03+2] sre.druid.roll-restart-workers: properly pass commands list [cookbooks] - 10https://gerrit.wikimedia.org/r/663033 (https://phabricator.wikimedia.org/T269925) (owner: 10Razzi) [19:33:59] (03CR) 10Andrew Bogott: [C: 03+2] profile::labs::downloadserver: remove lvm class [puppet] - 10https://gerrit.wikimedia.org/r/663034 (https://phabricator.wikimedia.org/T274208) (owner: 10Andrew Bogott) [19:35:24] !log razzi@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid's jvm daemons. - razzi@cumin1001 [19:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:46] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2264.codfw.wmnet [19:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:18] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:40:29] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:40:48] (03PS1) 10Andrew Bogott: profile::labs::downloadserver: remove lvm class [puppet] - 10https://gerrit.wikimedia.org/r/663036 (https://phabricator.wikimedia.org/T274208) [19:41:01] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:41:26] (03CR) 10Ppchelko: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [19:41:54] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:42:22] (03CR) 10jerkins-bot: [V: 04-1] profile::labs::downloadserver: remove lvm class [puppet] - 10https://gerrit.wikimedia.org/r/663036 (https://phabricator.wikimedia.org/T274208) (owner: 10Andrew Bogott) [19:44:59] (03PS2) 10Andrew Bogott: profile::labs::downloadserver: remove lvm class [puppet] - 10https://gerrit.wikimedia.org/r/663036 (https://phabricator.wikimedia.org/T274208) [19:46:33] (03CR) 10Kosta Harlan: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [19:47:09] (03CR) 10Andrew Bogott: [C: 03+2] profile::labs::downloadserver: remove lvm class [puppet] - 10https://gerrit.wikimedia.org/r/663036 (https://phabricator.wikimedia.org/T274208) (owner: 10Andrew Bogott) [19:51:17] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:53:09] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:56:13] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2263.codfw.wmnet with reason: REIMAGE [19:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:53] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1382.eqiad.wmnet with reason: REIMAGE [19:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2263.codfw.wmnet with reason: REIMAGE [19:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:50] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1385.eqiad.wmnet with reason: REIMAGE [19:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] twentyafterfour and hashar: May I have your attention please! Mediawiki train - American+European Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210209T2000) [20:00:25] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1382.eqiad.wmnet with reason: REIMAGE [20:00:27] !log prepping 1.36.0-wmf.30 [20:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:09] reimages are ongoing but ideally you dont notice because they get removed from scap dsh groups in time and I scap pull before repooling [20:02:28] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1385.eqiad.wmnet with reason: REIMAGE [20:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:27] PROBLEM - Check systemd state on an-coord1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:05:33] PROBLEM - Hive Metastore on an-coord1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [20:05:44] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/27938/" [puppet] - 10https://gerrit.wikimedia.org/r/662033 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [20:05:54] (03PS2) 10Dzahn: profile::rsyslog::udp_json_logback_compat: hiera -> lookup [puppet] - 10https://gerrit.wikimedia.org/r/662033 (https://phabricator.wikimedia.org/T209953) [20:06:07] !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001 [20:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:26] the hive errors are me sorry [20:07:21] thanks, ack [20:07:32] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1299.eqiad.wmnet with reason: REIMAGE [20:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:17] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001 [20:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:43] (03PS1) 10SBassett: Link to non-wiki privacy policy [puppet] - 10https://gerrit.wikimedia.org/r/663040 (https://phabricator.wikimedia.org/T207244) [20:09:35] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1299.eqiad.wmnet with reason: REIMAGE [20:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:34] !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001 [20:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:57] (03CR) 10Jforrester: "recheck" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/662966 (owner: 10Kormat) [20:11:19] !log otto@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop analytics cluster: Change Hadoop distribution - otto@cumin1001 [20:11:21] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001 [20:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:07] !log otto@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - otto@cumin1001 [20:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:54] !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001 [20:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:36] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop analytics cluster: Change Hadoop distribution - elukey@cumin1001 [20:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:35] (03CR) 10Dzahn: "noop on wdqs1003, restbase1022, thumbor2004,.." [puppet] - 10https://gerrit.wikimedia.org/r/662033 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [20:14:57] (03PS2) 10Dzahn: wmcs::monitoring: replace hiera inside hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/662026 (https://phabricator.wikimedia.org/T209953) [20:19:40] (03CR) 10Jfishback: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/663040 (https://phabricator.wikimedia.org/T207244) (owner: 10SBassett) [20:20:38] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2263.codfw.wmnet'] ` an... [20:21:07] (03CR) 1020after4: [C: 03+2] Branch commit for wmf/1.36.0-wmf.30 [core] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662803 (owner: 10TrainBranchBot) [20:21:26] !log razzi@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid's jvm daemons. - razzi@cumin1001 [20:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:50] RECOVERY - Hive Metastore on an-coord1002 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [20:23:48] RECOVERY - Check systemd state on an-coord1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:24:05] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1385.eqiad.wmnet'] ` an... [20:24:06] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:24:32] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1382.eqiad.wmnet'] ` an... [20:24:44] RECOVERY - Check systemd state on an-airflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:33:12] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2263.codfw.wmnet [20:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:29] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1385.eqiad.wmnet [20:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:40] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1382.eqiad.wmnet [20:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:10] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1385.eqiad.wmnet [20:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:10] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1382.eqiad.wmnet [20:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:19] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2263.codfw.wmnet [20:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:46] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.30 [core] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662803 (owner: 10TrainBranchBot) [20:45:04] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:45:29] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:46:01] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:49:52] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:52:53] (03CR) 10Dzahn: "compiler shows noop on cloudmetrics1002 - https://puppet-compiler.wmflabs.org/compiler1002/27939/" [puppet] - 10https://gerrit.wikimedia.org/r/662026 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [20:54:58] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1299.eqiad.wmnet'] ` an... [20:55:55] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/27940/grafana2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/662008 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [20:56:28] (03PS2) 10Dzahn: grafana: replace hiera inside hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/662008 (https://phabricator.wikimedia.org/T209953) [20:56:47] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1299.eqiad.wmnet [20:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:31] (03CR) 10Dzahn: "noop on grafana1002" [puppet] - 10https://gerrit.wikimedia.org/r/662008 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [20:58:58] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1299.eqiad.wmnet [20:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:29] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [21:01:25] (03PS2) 10Dzahn: netmon: replace hiera within hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/662013 (https://phabricator.wikimedia.org/T209953) [21:01:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1384.eqiad.wmnet with reason: REIMAGE [21:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:23] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1381.eqiad.wmnet with reason: REIMAGE [21:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2260.codfw.wmnet with reason: REIMAGE [21:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:27] 10SRE, 10ops-codfw: ms-be2031 repeated usb connect/disconnect message - https://phabricator.wikimedia.org/T273895 (10Papaul) p:05Triageβ†’03Medium [21:04:04] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1384.eqiad.wmnet with reason: REIMAGE [21:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:03] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1381.eqiad.wmnet with reason: REIMAGE [21:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:53] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2260.codfw.wmnet with reason: REIMAGE [21:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:28] (03CR) 10Dzahn: [C: 03+2] Link to non-wiki privacy policy [puppet] - 10https://gerrit.wikimedia.org/r/663040 (https://phabricator.wikimedia.org/T207244) (owner: 10SBassett) [21:10:02] !log Analytics Hadoop cluster upgrade completed [21:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:56] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/27941/" [puppet] - 10https://gerrit.wikimedia.org/r/662013 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [21:12:51] (03CR) 1020after4: "This change is ready for review." [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662965 (https://phabricator.wikimedia.org/T264391) (owner: 1020after4) [21:13:04] (03CR) 10jerkins-bot: [V: 04-1] Fix issues with recent caching update [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662965 (https://phabricator.wikimedia.org/T264391) (owner: 1020after4) [21:14:10] (03CR) 10Dzahn: "noop on netmon1002, netmon2001" [puppet] - 10https://gerrit.wikimedia.org/r/662013 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [21:14:39] (03Abandoned) 1020after4: Fix issues with recent caching update [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662965 (https://phabricator.wikimedia.org/T264391) (owner: 1020after4) [21:15:04] (03PS2) 10Dzahn: netbox: replace hiera inside hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/662022 (https://phabricator.wikimedia.org/T209953) [21:17:20] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/27943/" [puppet] - 10https://gerrit.wikimedia.org/r/662022 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [21:18:52] (03CR) 10Dzahn: "noop on netbox1001,netbox2001" [puppet] - 10https://gerrit.wikimedia.org/r/662022 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [21:26:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:27:00] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1003/27944/ but I am kind of expecting some alerts after this gets merged.. a new timer on EV" [puppet] - 10https://gerrit.wikimedia.org/r/661189 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:27:32] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1298.eqiad.wmnet with reason: REIMAGE [21:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:15] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1384.eqiad.wmnet'] ` an... [21:29:21] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2260.codfw.wmnet'] ` an... [21:29:33] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1298.eqiad.wmnet with reason: REIMAGE [21:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:49] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1381.eqiad.wmnet'] ` an... [21:30:22] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1381.eqiad.wmnet [21:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:33] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1384.eqiad.wmnet [21:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:54] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2260.codfw.wmnet [21:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:29] RECOVERY - WDQS high update lag on wdqs1005 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.154e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [21:35:40] (03CR) 10Dzahn: "not just mwdebug but also other canaries https://puppet-compiler.wmflabs.org/compiler1001/27945/mw2271.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/662798 (https://phabricator.wikimedia.org/T274023) (owner: 10Dzahn) [21:36:57] (03PS2) 10Legoktm: docker_registry_ha: Properly override nginx timeouts [puppet] - 10https://gerrit.wikimedia.org/r/662806 [21:36:59] (03PS2) 10Legoktm: [WIP] docker_registry_ha: Have restricted/ images that are limited read/write [puppet] - 10https://gerrit.wikimedia.org/r/662807 (https://phabricator.wikimedia.org/T273521) [21:38:23] 10SRE, 10dev-images, 10docker-pkg, 10Release-Engineering-Team (Local Dev): docker-pkg: "certificate verify failed: unable to get local issuer certificate" for docker-registry.discovery.wmnet when publishing dev-images from contint2001 - https://phabricator.wikimedia.org/T274306 (10brennen) [21:39:24] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1384.eqiad.wmnet [21:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:31] 10SRE, 10dev-images, 10docker-pkg, 10Release-Engineering-Team (Local Dev), 10User-brennen: docker-pkg: "certificate verify failed: unable to get local issuer certificate" for docker-registry.discovery.wmnet when publishing dev-images from contint2001 - https://phabricator.wikimedia.org/T274306 (10brennen... [21:40:15] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [21:40:40] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1381.eqiad.wmnet [21:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:18] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [21:42:57] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2260.codfw.wmnet [21:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:31] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [21:45:44] (03PS3) 10Legoktm: [WIP] docker_registry_ha: Have restricted/ images that are limited read/write [puppet] - 10https://gerrit.wikimedia.org/r/662807 (https://phabricator.wikimedia.org/T273521) [21:48:31] (03PS2) 10Dzahn: gerrit: replace certbot cron for cloud with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/662035 (https://phabricator.wikimedia.org/T273673) [21:56:56] (03CR) 10Dzahn: "cloud-only, not affecting prod: https://puppet-compiler.wmflabs.org/compiler1001/27949/" [puppet] - 10https://gerrit.wikimedia.org/r/662035 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:57:08] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1373.eqiad.wmnet with reason: REIMAGE [21:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:12] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1380.eqiad.wmnet with reason: REIMAGE [21:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:09] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1373.eqiad.wmnet with reason: REIMAGE [21:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:04] Legoktm and DannyS712: It is that lovely time of the day again! You are hereby commanded to deploy GlobalWatchlist deployment to production. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210209T2200). [22:00:27] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2259.codfw.wmnet with reason: REIMAGE [22:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:56] legoktm ready when you are [22:01:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:01:10] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655774 (https://phabricator.wikimedia.org/T260862) (owner: 10DannyS712) [22:01:11] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1380.eqiad.wmnet with reason: REIMAGE [22:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:13] (03CR) 10Dzahn: [C: 03+2] gerrit: replace certbot cron for cloud with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/662035 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [22:02:49] give me a minute [22:03:16] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2259.codfw.wmnet with reason: REIMAGE [22:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:32] DannyS712: did all the patches make it in the train? [22:04:00] {{checking}} [22:04:33] (03CR) 10Volans: "Code looks sane and in line with the current style, that as discussed we both agree should be refactored, but that's out of scope for this" (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/662762 (https://phabricator.wikimedia.org/T263768) (owner: 10CRusnov) [22:04:52] everything that has merged already was merged before the branch cut, per https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/GlobalWatchlist, but I'd like to get https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GlobalWatchlist/+/661103 in before the deployment [22:05:51] except something happened with the train? testwiki is only on wmf.29, not .30 [22:07:21] (03PS2) 10Dzahn: gerrit: remove code that absented cron [puppet] - 10https://gerrit.wikimedia.org/r/662036 (https://phabricator.wikimedia.org/T273673) [22:07:27] I guess hashar only went to .29 today? [22:07:49] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:07:58] https://lists.wikimedia.org/pipermail/wikitech-l/2021-February/094254.html [22:08:13] https://sal.toolforge.org/log/ntJhiHcBhxWNv8gIhjka has twentyafterfour "prepping 1.36.0-wmf.30" [22:08:13] I propose we deploy to testwiki with .29, but avoid enabling on metawiki, until .30 is live there [22:08:34] hmm [22:08:36] that one remaining patch can be merged on master and then backported to .30, but we don't need to wait for that now [22:08:53] I'm working on backporting patches now [22:09:12] twentyafterfour: when are you planning to rollout .30 to group0? [22:09:39] legoktm: as soon as I can sort out the mess that I'm in with https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FeaturedFeeds/+/662985 [22:09:50] ack [22:10:17] DannyS712: so lets enable it on testwiki, but remove the logging portion until that patch is deployed with wmf.30 [22:11:23] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:12:03] why? The logging portion specifies `info`, and the patch that cleaned up the logging moved stuff that were previously `debug` to `info` - if we enable the logging at `info`, without that logging patch the only consequence is that (in theory, if I understand correctly) *nothing* will get logged [22:12:28] oh, you're right [22:12:31] I confused myself [22:12:32] okay [22:12:38] let's go :) [22:13:35] (03PS7) 10Legoktm: Enable GlobalWatchlist extension on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655774 (https://phabricator.wikimedia.org/T260862) (owner: 10DannyS712) [22:13:44] (03CR) 10Legoktm: [C: 03+2] Enable GlobalWatchlist extension on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655774 (https://phabricator.wikimedia.org/T260862) (owner: 10DannyS712) [22:14:43] (03Merged) 10jenkins-bot: Enable GlobalWatchlist extension on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655774 (https://phabricator.wikimedia.org/T260862) (owner: 10DannyS712) [22:14:45] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1298.eqiad.wmnet'] ` an... [22:16:25] DannyS712: try on mwdebug1002 [22:16:32] trying [22:17:03] https://test.wikipedia.org/wiki/Special:GlobalWatchlist [22:17:13] I see it on Special:Version \o/ [22:17:33] its working [22:18:06] (03CR) 10Dzahn: "things looking good on gerrit-prod-1001.devtools in cloud. just noticed there is already a timer doing the same thing that comes from the" [puppet] - 10https://gerrit.wikimedia.org/r/662035 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [22:18:22] (03CR) 10Dzahn: [C: 03+2] gerrit: remove code that absented cron [puppet] - 10https://gerrit.wikimedia.org/r/662036 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [22:18:38] might be a bug with the sidebar on Special:Watchlist (it should have a link to the special page) but that also looks to be broken on the beta cluster... should be good to sync [22:18:47] (03Restored) 1020after4: Fix issues with recent caching update [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662965 (https://phabricator.wikimedia.org/T264391) (owner: 1020after4) [22:19:25] (03PS2) 1020after4: Fix issues with recent caching update [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662965 (https://phabricator.wikimedia.org/T264391) [22:20:12] syncing [22:23:02] !log legoktm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable GlobalWatchlist extension on testwiki (T260862) (duration: 02m 51s) [22:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:06] T260862: Deploy GlobalWatchlist extension to production (Meta only) - https://phabricator.wikimedia.org/T260862 [22:23:34] 10SRE, 10dev-images, 10docker-pkg, 10Release-Engineering-Team (Local Dev), 10User-brennen: docker-pkg: "certificate verify failed: unable to get local issuer certificate" for docker-registry.discovery.wmnet when publishing dev-images from contint2001 - https://phabricator.wikimedia.org/T274306 (10brennen... [22:23:39] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1298.eqiad.wmnet [22:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:45] legoktm confirmed to be working :) [22:23:55] will figure out the sidebar link in a minute [22:24:02] DannyS712: :)) I think you should send an email to wikitech-l asking for people to test / provide feedback [22:24:08] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1373.eqiad.wmnet'] ` an... [22:24:36] I will once we're done - if I can figure out the issue with the sidebar link first I'd like to [22:24:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:25:03] done? was there anything else to deploy? [22:25:30] I mean once I'm done figuring it out - there shouldn't be anything else to deploy now [22:25:32] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2259.codfw.wmnet'] ` an... [22:25:58] thanks for the help [22:26:08] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1298.eqiad.wmnet [22:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:15] yw! [22:26:43] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:28:15] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1373.eqiad.wmnet [22:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:39] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1373.eqiad.wmnet [22:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:03] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:31:01] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2259.codfw.wmnet [22:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:41] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2259.codfw.wmnet [22:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:49] (03PS13) 10Kosta Harlan: [WIP] linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) [22:35:29] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:37:07] (03PS2) 10Volans: mysql_legacy.py: Add x2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/662631 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui) [22:40:30] (03CR) 10Dzahn: [V: 04-1] "Failed to execute generator /usr/bin/systemd-analyze: Execution of '/usr/bin/systemd-analyze calendar *-*-1 0:0:00' returned 1: Failed to" [puppet] - 10https://gerrit.wikimedia.org/r/661536 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [22:44:34] (03CR) 10Dzahn: [V: 04-1] "/usr/bin/systemd-analyze calendar '*-*-1 0:0:00'" [puppet] - 10https://gerrit.wikimedia.org/r/661536 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [22:46:58] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1372.eqiad.wmnet with reason: REIMAGE [22:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:00] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1372.eqiad.wmnet with reason: REIMAGE [22:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:52] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1296.eqiad.wmnet with reason: REIMAGE [22:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:02] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1296.eqiad.wmnet with reason: REIMAGE [22:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:01] (03Abandoned) 1020after4: Fix issues with recent caching update [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662965 (https://phabricator.wikimedia.org/T264391) (owner: 1020after4) [23:01:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:01:07] (03PS2) 10Dzahn: phabricator: convert statistics mail crons to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/661536 (https://phabricator.wikimedia.org/T273673) [23:03:44] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2250.codfw.wmnet with reason: REIMAGE [23:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:46] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2250.codfw.wmnet with reason: REIMAGE [23:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:23] (03PS1) 10Dzahn: wmflib: add data type for 'day of the week' in systemd timers/caledar [puppet] - 10https://gerrit.wikimedia.org/r/663051 [23:12:07] (03PS2) 10Dzahn: wmflib: add data type for 'day of the week' in systemd timers/calendar [puppet] - 10https://gerrit.wikimedia.org/r/663051 [23:13:20] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1372.eqiad.wmnet'] ` an... [23:16:59] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1001/27951/phab1001.eqiad.wmnet/change.phab1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/661536 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [23:17:10] (03PS1) 10Ryan Kemper: relforge: New hosts are relforge100[3,4] [homer/public] - 10https://gerrit.wikimedia.org/r/663054 (https://phabricator.wikimedia.org/T274314) [23:18:02] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1372.eqiad.wmnet [23:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:14] 10Puppet, 10SRE, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10Ladsgroup) Thanks. I don't know much about puppet so I'm probably wrong but it's just it's mentioned in list of core types: https://puppet.com/do... [23:19:55] (03CR) 10Kosta Harlan: "Tested this out, and it's working! Still WIP until we can sort out the credential / env variables, see inline comment." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [23:20:01] (03CR) 10Dzahn: [C: 04-1] "needs some code to set it to "*" by default if weekday or monthday is not set, will follow-up later" [puppet] - 10https://gerrit.wikimedia.org/r/661536 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [23:21:54] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1380.eqiad.wmnet'] ` Of... [23:23:03] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1372.eqiad.wmnet [23:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:02] !log mw1380 - powercycling after it did not come back from normal reboot during reimaging [23:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:40] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:32:07] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1380.eqiad.wmnet [23:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:11] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1380.eqiad.wmnet [23:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:33] (03PS1) 1020after4: all wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663056 [23:35:35] (03CR) 1020after4: [C: 03+2] all wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663056 (owner: 1020after4) [23:36:29] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663056 (owner: 1020after4) [23:39:59] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1296.eqiad.wmnet'] ` an... [23:40:30] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1296.eqiad.wmnet [23:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:31] (03PS2) 10Cwhite: profile: remove deprecated syslog input [puppet] - 10https://gerrit.wikimedia.org/r/662009 (https://phabricator.wikimedia.org/T217032) [23:42:40] (03PS1) 10Jdlrobson: Add inline documentation to configuration about updating logos regarding labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663057 [23:43:20] (03PS1) 10Legoktm: Add hiera for docker_registry_ha I76a6fc9d21380 [labs/private] - 10https://gerrit.wikimedia.org/r/663058 [23:44:53] (03PS2) 10Jdlrobson: Add inline documentation to configuration about updating logos regarding labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663057 [23:44:58] (03PS3) 10Jdlrobson: Add inline documentation to configuration about updating logos regarding labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663057 [23:45:41] Did everything get rolled back to .27 again? [23:45:58] yes [23:46:00] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.27 [23:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:08] PROBLEM - Apache HTTP on mw1382 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 944 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:46:16] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 3543 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:46:30] looking at 1382 [23:46:50] scap pulling [23:47:02] Is the intention to roll things forward again shortly, or are we stuck for a while again? (I ask entirely because it affects whether I go ahead with the config patch I have scheduled for the current backport window.) [23:47:10] !log twentyafterfour@deploy1001 Started scap: (no justification provided) [23:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:29] Kemayo: stuck until .30 [23:47:34] PROBLEM - Apache HTTP on mw2220 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 944 bytes in 0.094 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:47:36] (03CR) 10Cwhite: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/662009 (https://phabricator.wikimedia.org/T217032) (owner: 10Cwhite) [23:47:36] mutante: I'm running a sync-world [23:47:45] !log running scap sync-world [23:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:06] twentyafterfour: Okay, got it. [23:48:11] Kemayo: I'll update the task here in a second, but we're staying on wmf.27 for now, will move forward once we get a backport for wmf.30 figured out. wmf.29 won't go out. [23:48:17] .30 tomorrow, hopefully [23:48:25] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1296.eqiad.wmnet [23:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:32] RECOVERY - Apache HTTP on mw1382 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:48:51] twentyafterfour: thanks and that's good [23:49:07] 1296 repooled as well.. and pulled [23:49:08] PROBLEM - Apache HTTP on mw1385 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 944 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:49:10] PROBLEM - Apache HTTP on mw1383 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 944 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:49:55] ugh, this is: https://phabricator.wikimedia.org/T273334 [23:49:56] RECOVERY - Apache HTTP on mw2220 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:50:44] !log mw1383,mw1385 - scap pull, php [23:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:37] RECOVERY - Apache HTTP on mw1383 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:51:43] RECOVERY - Apache HTTP on mw1385 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:52:29] twentyafterfour: should be good now [23:52:39] rescheduled icinga after pull etc [23:52:51] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2250.codfw.wmnet'] ` an... [23:52:55] (pulling also runs the restart check) [23:52:59] DannyS712: is GlobalWatchlist is OK on wmf.27 or should we disable it? [23:53:38] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2250.codfw.wmnet [23:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:25] !log twentyafterfour@deploy1001 Finished scap: (no justification provided) (duration: 08m 43s) [23:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:40] !log Depooled `wdqs1005` - it's catching up on hours of lag [23:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:03] (03Restored) 10Thcipriani: Fix issues with recent caching update [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662965 (https://phabricator.wikimedia.org/T264391) (owner: 1020after4) [23:56:09] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2250.codfw.wmnet [23:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:07] (03PS1) 1020after4: testwikis wikis to 1.36.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663060 [23:57:09] (03CR) 1020after4: [C: 03+2] testwikis wikis to 1.36.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663060 (owner: 1020after4) [23:57:55] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663060 (owner: 1020after4) [23:58:09] !log twentyafterfour@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.30 [23:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:15] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops