[00:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210210T0000). [00:00:04] ejegg and kemayo: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:55] legoktm: I guess it'll be on wmf.30 now. ;-) [00:00:55] (03PS1) 10Ottomata: Refine - use spark assembly without hadoop jars [puppet] - 10https://gerrit.wikimedia.org/r/663061 (https://phabricator.wikimedia.org/T273711) [00:01:19] !log train status: wmf.28 and wmf.29 are undeployed. wmf.27 is everywhere with the exception of testwikis which is at wmf.30 refs T271344 [00:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:24] I turn away for a second and we've jumped 3 weeks [00:01:24] T271344: 1.36.0-wmf.30 deployment blockers - https://phabricator.wikimedia.org/T271344 [00:01:35] twentyafterfour: thank you :) [00:01:41] (and everyone else pushing the train forward!) [00:02:20] +1 [00:02:25] legoktm: just playing the roll of chaos monkey. [00:02:34] (03CR) 10jerkins-bot: [V: 04-1] Refine - use spark assembly without hadoop jars [puppet] - 10https://gerrit.wikimedia.org/r/663061 (https://phabricator.wikimedia.org/T273711) (owner: 10Ottomata) [00:03:00] (03PS2) 10Legoktm: Add hiera for docker_registry_ha I76a6fc9d21380 [labs/private] - 10https://gerrit.wikimedia.org/r/663058 [00:03:43] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Add hiera for docker_registry_ha I76a6fc9d21380 [labs/private] - 10https://gerrit.wikimedia.org/r/663058 (owner: 10Legoktm) [00:03:45] (03PS2) 10Ottomata: Refine - use spark assembly without hadoop jars [puppet] - 10https://gerrit.wikimedia.org/r/663061 (https://phabricator.wikimedia.org/T273711) [00:03:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:12] thcipriani: twentyafterfour I can help with the backport of featured feeds [00:04:18] and review if needed [00:04:27] this has been going on for too long [00:04:33] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27952/console" [puppet] - 10https://gerrit.wikimedia.org/r/662807 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [00:06:46] Amir1: that would be welcome for sure [00:06:56] did we again rollback? :-( [00:07:15] Should I do it now? [00:07:18] Ops around? [00:07:29] Amir1: it's B&C right now, btw [00:07:33] oooh [00:07:35] nice timing [00:07:41] that makes me ask...is someone leading that window? [00:07:46] there are two config patches [00:08:30] Train deploys over-ride B&C windows. [00:08:36] Plus, mine can be skipped -- it was scheduled on the assumption that .29 was going to stay out. [00:08:53] Hopefully I'll be back with another one for .30 soon. :D [00:09:28] * James_F grins. [00:09:30] Hopefully. [00:09:39] twentyafterfour: Are you planning to push wmf.30 to group0? [00:10:03] James_F: not planning to go to group0 until blockers/logspammers are resolved? [00:10:12] I think that will happen tomorrow [00:10:24] * James_F nods. [00:10:25] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) changes to add the new backends to scap... [00:10:28] but I can stick around and do it tonight if everyone is comfortable with it [00:11:11] I can push the fix [00:11:15] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FeaturedFeeds/+/662965 [00:11:29] (03CR) 10Ladsgroup: [C: 03+2] Fix issues with recent caching update [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662965 (https://phabricator.wikimedia.org/T264391) (owner: 1020after4) [00:11:34] Amir1: go for it? [00:11:39] (03CR) 10Ottomata: [C: 03+2] Refine - use spark assembly without hadoop jars [puppet] - 10https://gerrit.wikimedia.org/r/663061 (https://phabricator.wikimedia.org/T273711) (owner: 10Ottomata) [00:13:33] https://versions.toolforge.org/ says not even test wikis are on wmf.30 [00:13:46] Amir1: it's still syncing [00:13:51] scap is running currently [00:13:56] aah [00:13:57] I see [00:14:08] sync-apaches: 46% (ok: 160; fail: 0; left: 187) [00:14:12] Yeah, first scap takes forever. [00:14:19] (03PS1) 10Ottomata: Fix type in refine.pp spark conf [puppet] - 10https://gerrit.wikimedia.org/r/663062 (https://phabricator.wikimedia.org/T273711) [00:14:33] We could make it much faster by dropping i18n and doing everything in English. ;-) [00:14:44] only first scap James_F ? [00:15:32] Urbanecm: Yeah. Needs to do a total copy of all the new files to each host, plus a full i18n build and sync. [00:15:40] Up to ~an hour. [00:16:02] (03CR) 10Ottomata: [C: 03+2] Fix type in refine.pp spark conf [puppet] - 10https://gerrit.wikimedia.org/r/663062 (https://phabricator.wikimedia.org/T273711) (owner: 10Ottomata) [00:16:05] it shouldn't be that long at this point [00:16:08] i thought that i18n build happens every time someone runs scap sync-world, but maybe I'm wrong :) [00:16:31] Urbanecm: Yes, but it's working from the i18n already being present, that's just the small build step. [00:16:32] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) @Krinkle also see T149924#6699330 and... [00:16:34] it does but it's cached so it takes longer for a new branch [00:16:41] got it [00:16:43] So very much longer. [00:16:43] (03Merged) 10jenkins-bot: Fix issues with recent caching update [extensions/FeaturedFeeds] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/662965 (https://phabricator.wikimedia.org/T264391) (owner: 1020after4) [00:16:53] #StuffOnlyRelEngHaveToCryAbout [00:17:11] also, the actual xfer of the bits via rsync on first scap makes it take extra long [00:17:23] sync-apaches: 92% (ok: 321; fail: 0; left: 26) [00:17:27] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) >>! In T247653#6814030, @Krinkle wrote:... [00:17:43] deprecating cdb would have helped? [00:18:08] it would have simplified the process [00:18:15] the process itself is weird [00:18:25] scap-cdb-rebuild: 0% (ok: 0; fail: 0; left: 366) [00:18:29] (03CR) 10CRusnov: "Thank you for the feedback. Here are my responses:" (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/662762 (https://phabricator.wikimedia.org/T263768) (owner: 10CRusnov) [00:18:40] we go: cdb -> json -> sync json with servers -> each server rebuilds cdb from json [00:18:42] Amir1: Yes, but SRE had concerns that it would slow down prod. [00:19:00] cdb rebuild is going fast, already 33$ [00:19:03] 33% [00:19:07] getting rid of that weird dance would have been a win [00:19:10] i wonder if it would be helpful to log a few more of the scap steps in here. [00:19:15] ...or just noisy. [00:19:15] Clearly we should have logmsgbot ping the channel with status updates for mega-scaps. [00:19:17] 50% [00:19:21] Ha! Snap, brennen. :-D [00:19:27] :) [00:19:34] thcipriani: Alas. [00:42:37] (03CR) 10Legoktm: k8s: Add docker-registry credentials to pull restricted images (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [00:48:53] (03PS2) 10Anne Tomasevich: Add external entity search URI for new MediaSearch extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662792 (https://phabricator.wikimedia.org/T265939) [00:50:00] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 441512552 and 275 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:21] (03CR) 10jerkins-bot: [V: 04-1] Add external entity search URI for new MediaSearch extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662792 (https://phabricator.wikimedia.org/T265939) (owner: 10Anne Tomasevich) [00:52:28] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 104112 and 405 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:52:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:55:33] !log milimetric@deploy1001 Started deploy [analytics/refinery@b539bf6]: Job fixes after Hadoop upgrade [00:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:50] !log doc1001 - reloaded apache2 [00:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:41] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) There is definitely only doc1001 in the... [01:06:28] !log milimetric@deploy1001 Finished deploy [analytics/refinery@b539bf6]: Job fixes after Hadoop upgrade (duration: 10m 55s) [01:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:36] !log milimetric@deploy1001 Started deploy [analytics/refinery@b539bf6] (thin): Job fixes after Hadoop upgrade [01:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:43] !log milimetric@deploy1001 Finished deploy [analytics/refinery@b539bf6] (thin): Job fixes after Hadoop upgrade (duration: 00m 06s) [01:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:35] (03CR) 10BryanDavis: "This change is a contributing factor to T274310" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/661915 (owner: 10Giuseppe Lavagetto) [01:19:04] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) ` Got error 'PHP message: PHP Parse er... [01:25:10] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) This is a stretch server with PHP 7.0.... [01:27:56] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Krinkle) >>! In T247653#6817226, @Dzahn wrote:... [01:41:10] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:43:22] !log krinkle@deploy1001 Started deploy [integration/docroot@fddc7c9]: Unbreak doc.wm.o - Ibf28e02ec03 [01:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:29] !log krinkle@deploy1001 Finished deploy [integration/docroot@fddc7c9]: Unbreak doc.wm.o - Ibf28e02ec03 (duration: 00m 06s) [01:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:54:03] !log krinkle@deploy1001 Started deploy [integration/docroot@0234db2]: Unbreak doc.wm.o (2) - Ib67da94fb1bdf0 [01:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:54:09] !log krinkle@deploy1001 Finished deploy [integration/docroot@0234db2]: Unbreak doc.wm.o (2) - Ib67da94fb1bdf0 (duration: 00m 06s) [01:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:57:42] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) after the changes above and rebooting th... [02:43:08] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:51:00] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:19:45] (03CR) 10Anne Tomasevich: Add external entity search URI for new MediaSearch extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662792 (https://phabricator.wikimedia.org/T265939) (owner: 10Anne Tomasevich) [03:23:08] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:28:08] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.208 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:36:08] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:46:12] !log `ryankemper@wdqs1012:~$ sudo systemctl restart wdqs-blazegraph.service` [03:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:49:06] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.064 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:43:04] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:50:50] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:18:14] 10SRE, 10Commons, 10Traffic, 10Patch-For-Review: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Aawarapam) The traffic spikes are closely matching indian holidays. 2 Oct, 5 sept, 14 Nov, 31 Dec, 12-14 Jan etc. [05:58:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1076 to clone db1162 T258361', diff saved to https://phabricator.wikimedia.org/P14277 and previous config saved to /var/cache/conftool/dbconfig/20210210-055846-marostegui.json [05:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:52] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [06:00:05] (03PS1) 10Marostegui: install_server: Reimage db1162 to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/663103 (https://phabricator.wikimedia.org/T258361) [06:01:25] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1162 to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/663103 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:03:06] (03PS1) 10Marostegui: db1170: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/663104 (https://phabricator.wikimedia.org/T258361) [06:04:54] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1020.eqiad.wmnet [06:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:34] (03CR) 10Marostegui: [C: 03+2] db1170: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/663104 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:08:42] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1162.eqiad.wmnet'] ` The log ca... [06:11:10] (03PS1) 10Marostegui: instances.yaml: Add db1170 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/663105 (https://phabricator.wikimedia.org/T258361) [06:11:26] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1020.eqiad.wmnet [06:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:15] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1170 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/663105 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:16:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1170:3312 and db1170:3317 to dbctl, depooled T258361', diff saved to https://phabricator.wikimedia.org/P14278 and previous config saved to /var/cache/conftool/dbconfig/20210210-061638-marostegui.json [06:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:44] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [06:19:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1170:3312, db1170:3317 with minimal weight for the first time T258361', diff saved to https://phabricator.wikimedia.org/P14279 and previous config saved to /var/cache/conftool/dbconfig/20210210-061924-marostegui.json [06:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1162.eqiad.wmnet with reason: REIMAGE [06:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1162.eqiad.wmnet with reason: REIMAGE [06:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:49] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1162.eqiad.wmnet'] ` and were **ALL** successful. [06:35:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more weight to db1170:3312, db1170:3317 T258361', diff saved to https://phabricator.wikimedia.org/P14281 and previous config saved to /var/cache/conftool/dbconfig/20210210-063534-marostegui.json [06:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:40] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [06:39:15] (03CR) 10Marostegui: [C: 03+1] "This looks good: https://puppet-compiler.wmflabs.org/compiler1003/27955/clouddb1013.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/662797 (https://phabricator.wikimedia.org/T274044) (owner: 10Bstorm) [06:41:37] (03PS1) 10Marostegui: mariadb: Productionize db1162 [puppet] - 10https://gerrit.wikimedia.org/r/663107 (https://phabricator.wikimedia.org/T258361) [06:42:39] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1162 [puppet] - 10https://gerrit.wikimedia.org/r/663107 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:43:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully pool db1170:3312, db1170:3317 T258361', diff saved to https://phabricator.wikimedia.org/P14282 and previous config saved to /var/cache/conftool/dbconfig/20210210-064330-marostegui.json [06:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:35] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [06:43:56] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) Fully pooled: db1170:3312 db1170:3317 [06:44:15] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:45:17] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:45:31] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [07:35:33] I'm going to be live hacking/debugging on mwdebug1003 [07:42:58] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:49:02] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:05:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1127 T266483', diff saved to https://phabricator.wikimedia.org/P14283 and previous config saved to /var/cache/conftool/dbconfig/20210210-080512-marostegui.json [08:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:18] T266483: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 [08:14:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 10%: Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P14284 and previous config saved to /var/cache/conftool/dbconfig/20210210-081453-root.json [08:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:05] !log swift eqiad-prod: decrease weight for SSDs on ms-be[1019-1026] - T272836 [08:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:09] T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836 [08:29:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 20%: Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P14285 and previous config saved to /var/cache/conftool/dbconfig/20210210-082957-root.json [08:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:48] I'm done on mwdebug1003 [08:37:21] (03CR) 10Filippo Giunchedi: "AFAICT removing a lvs service will need to be done in reverse steps as mentioned here: https://wikitech.wikimedia.org/wiki/LVS#Remove_a_lo" [puppet] - 10https://gerrit.wikimedia.org/r/662009 (https://phabricator.wikimedia.org/T217032) (owner: 10Cwhite) [08:37:56] (03PS4) 10Legoktm: docker_registry_ha: Have restricted/ images that are limited read/write [puppet] - 10https://gerrit.wikimedia.org/r/662807 (https://phabricator.wikimedia.org/T273521) [08:37:58] (03PS2) 10Legoktm: k8s: Add docker-registry credentials to pull restricted images [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) [08:38:37] apergos: we are back to 1.36.0-wmf.27 , the upgrade we did yesterday got rolled back :D [08:39:03] morning hashar and apergos [08:39:21] morning! [08:39:22] not sure why thought [08:39:24] though [08:39:40] afaik just so there is a safe version to rollback [08:39:48] the puzzling thing is that yesterday with wmf.29 we only had a few Serialization Closure issues [08:40:49] but I guess that was the same amount of issues we previously had so [08:41:16] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1404.eqiad.wmnet [08:41:18] lets see what happens now that the fix and the fix for the fix are merged [08:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:37] !log depooling mw1404.eqiad.wmnet for perf benchmarking (T274041) [08:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:42] T274041: Investigate performance impact of HookContainer loading 500+ interfaces - https://phabricator.wikimedia.org/T274041 [08:43:03] (03CR) 10JMeybohm: [C: 04-1] k8s: Add docker-registry credentials to pull restricted images (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [08:44:34] 10SRE, 10DBA: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui) [08:45:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 40%: Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P14286 and previous config saved to /var/cache/conftool/dbconfig/20210210-084500-root.json [08:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:13] hey hashar and Majavah [08:45:33] when was 29 -> frwiki rolled back? [08:45:38] er, 30 -> frwiki [08:46:10] I checked logstash errors this morning and there were indeed no new logged errors after the push to frwiki, to me that is conclusive [08:46:23] I also checked for errors from Feed* and there were no exceptions [08:46:30] amir manually patched mwdebug1002 to have frwiki to .30, otherwise it is only on group0 [08:47:09] .29 does not have the fixes [08:47:28] oh so 30 is still on frwiki [08:47:35] on mwdebug1002, yes [08:47:46] oh, only on mwdebug1002? [08:47:50] yes [08:47:55] I played around with featuredfeeds on frwiki mwdebug1002 today morning after my exam was done, not sure if you checked logstash before or after that [08:48:04] what time utc? [08:48:22] maybe around 8 utc [08:49:22] (03PS3) 10Matthias Mullie: Add external entity search URI for new MediaSearch extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662792 (https://phabricator.wikimedia.org/T265939) (owner: 10Anne Tomasevich) [08:49:24] mwdebug1002 doesn't actually get any real traffic aside from basic monitoring and devs [08:50:09] yeah so that's o good, I will comment on the task and retract my opinion, it was based on 30 -> frwiki everywhere [08:50:18] clearly a msitaken understanding [08:50:51] (03PS4) 10Matthias Mullie: Add external entity search URI for new MediaSearch extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662792 (https://phabricator.wikimedia.org/T265939) (owner: 10Anne Tomasevich) [08:54:20] ok well. I"ll go check logstash again right now and see if there's anything new [08:54:32] from mwdebug1002 right? I can at least filter for that host [08:56:03] (03PS1) 10Elukey: cumin: add presto test alias [puppet] - 10https://gerrit.wikimedia.org/r/663150 [08:56:36] apergos: yes [08:57:22] welp, I still see nothing that looks like errors... [08:57:45] sounds promising [08:57:50] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1162 is now replicating, but I won't pool it until I'm back next week. [08:58:53] https://logstash.wikimedia.org/goto/61b2bb002beab0f1fdb5dceec68b3502 [08:58:54] (03CR) 10Elukey: [C: 03+2] cumin: add presto test alias [puppet] - 10https://gerrit.wikimedia.org/r/663150 (owner: 10Elukey) [08:58:57] for those with access. [09:00:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 60%: Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P14287 and previous config saved to /var/cache/conftool/dbconfig/20210210-090004-root.json [09:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:15] (03PS1) 10Elukey: sre.hadoop: add presto in test cumin aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/663151 [09:00:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1076 (re)pooling @ 10%: Slowly repooling db1076 after cloning db1162', diff saved to https://phabricator.wikimedia.org/P14288 and previous config saved to /var/cache/conftool/dbconfig/20210210-090057-root.json [09:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:12] so, hashar, Majavah, what is next to move this forward? [09:01:25] I don't have logstash access so can't look at that link you jsut sent [09:02:04] (03CR) 10JMeybohm: [C: 04-1] [WIP] linkrecommendation: Cron job to load datasets (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [09:02:23] Majavah: I thought that might be the case. the only log messages I see reported are for around 1:40 a.m. (utc) and they are saying that deferredupdates started and ended, no errors [09:02:32] again from mwdebug1002 [09:03:56] (03CR) 10Elukey: [C: 03+2] sre.hadoop: add presto in test cumin aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/663151 (owner: 10Elukey) [09:04:53] is there a way of knowing what was inside the deferred update? ie do we know that deferred update did come from featuredfeeds caching? [09:05:09] frwiki was switched to .30 on mwdebug1002 about 00:40 utc [09:06:44] (03Merged) 10jenkins-bot: sre.hadoop: add presto in test cumin aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/663151 (owner: 10Elukey) [09:07:40] there were a few asyncrefreshes but none of them from feeds [09:10:12] (03CR) 10JMeybohm: [C: 03+1] docker_registry_ha: Properly override nginx timeouts [puppet] - 10https://gerrit.wikimedia.org/r/662806 (owner: 10Legoktm) [09:10:57] !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [09:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:15] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [09:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 80%: Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P14289 and previous config saved to /var/cache/conftool/dbconfig/20210210-091507-root.json [09:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1076 (re)pooling @ 25%: Slowly repooling db1076 after cloning db1162', diff saved to https://phabricator.wikimedia.org/P14290 and previous config saved to /var/cache/conftool/dbconfig/20210210-091601-root.json [09:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:10] (03CR) 10JMeybohm: [C: 04-1] docker_registry_ha: Have restricted/ images that are limited read/write (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/662807 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [09:23:15] !log rolling restart of cp nodes to catch up on kernel upgrades [09:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1157 (re)pooling @ 100%: Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P14292 and previous config saved to /var/cache/conftool/dbconfig/20210210-093011-root.json [09:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:55] 10SRE, 10Internet-Archive: noc.wikimedia.org disappeared - https://phabricator.wikimedia.org/T274342 (10Gilles) [09:31:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1076 (re)pooling @ 50%: Slowly repooling db1076 after cloning db1162', diff saved to https://phabricator.wikimedia.org/P14293 and previous config saved to /var/cache/conftool/dbconfig/20210210-093104-root.json [09:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 10%: Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P14294 and previous config saved to /var/cache/conftool/dbconfig/20210210-093132-root.json [09:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:48] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp1075.eqiad.wmnet [09:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:14] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp1076.eqiad.wmnet [09:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:21] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp2027.codfw.wmnet [09:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:50] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp2028.codfw.wmnet [09:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:38] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3050.esams.wmnet [09:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:12] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3051.esams.wmnet [09:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:30] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4027.ulsfo.wmnet [09:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:56] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4021.ulsfo.wmnet [09:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:04] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5007.eqsin.wmnet [09:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:15] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5001.eqsin.wmnet [09:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:25] (03PS1) 10Gilles: Don’t apply X-Wikimedia-Debug routing to noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/663156 (https://phabricator.wikimedia.org/T245552) [09:45:11] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1076.eqiad.wmnet [09:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:00] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1075.eqiad.wmnet [09:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1076 (re)pooling @ 75%: Slowly repooling db1076 after cloning db1162', diff saved to https://phabricator.wikimedia.org/P14295 and previous config saved to /var/cache/conftool/dbconfig/20210210-094608-root.json [09:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:29] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2028.codfw.wmnet [09:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 20%: Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P14296 and previous config saved to /var/cache/conftool/dbconfig/20210210-094635-root.json [09:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:15] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2027.codfw.wmnet [09:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:18] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3050.esams.wmnet [09:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:00] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3051.esams.wmnet [09:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:27] (03CR) 10Kormat: [C: 03+2] tox: Add py3 env that uses default system python3 [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/662966 (owner: 10Kormat) [09:50:31] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4027.ulsfo.wmnet [09:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:22] (03Merged) 10jenkins-bot: tox: Add py3 env that uses default system python3 [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/662966 (owner: 10Kormat) [09:56:04] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5001.eqsin.wmnet [09:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:05] 10SRE, 10Internet-Archive: noc.wikimedia.org is a 404 when X-Wikimedia-Debug is enabled - https://phabricator.wikimedia.org/T274342 (10Aklapper) [09:57:29] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5007.eqsin.wmnet [09:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:02] PROBLEM - Host cp4021 is DOWN: PING CRITICAL - Packet loss = 100% [10:00:07] !log power cycling cp4021 [10:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1076 (re)pooling @ 100%: Slowly repooling db1076 after cloning db1162', diff saved to https://phabricator.wikimedia.org/P14297 and previous config saved to /var/cache/conftool/dbconfig/20210210-100111-root.json [10:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 40%: Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P14298 and previous config saved to /var/cache/conftool/dbconfig/20210210-100139-root.json [10:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:39] (03PS1) 10Elukey: Rename the cdh module to bigtop [puppet] - 10https://gerrit.wikimedia.org/r/663160 [10:02:59] (03PS1) 10Marostegui: mariadb: Decommission db1081 [puppet] - 10https://gerrit.wikimedia.org/r/663161 (https://phabricator.wikimedia.org/T273040) [10:03:38] RECOVERY - Host cp4021 is UP: PING OK - Packet loss = 0%, RTA = 68.36 ms [10:05:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [10:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:08] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:10:27] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4021.ulsfo.wmnet [10:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:44] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1081 [puppet] - 10https://gerrit.wikimedia.org/r/663161 (https://phabricator.wikimedia.org/T273040) (owner: 10Marostegui) [10:14:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [10:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:50] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10Aklapper) [10:15:50] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:58] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1081.eqiad.wmnet - https://phabricator.wikimedia.org/T273040 (10Marostegui) a:05Marostegui→03wiki_willy [10:16:02] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1081.eqiad.wmnet - https://phabricator.wikimedia.org/T273040 (10Marostegui) [10:16:24] !log installing firejail security updates [10:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:38] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [10:16:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 60%: Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P14299 and previous config saved to /var/cache/conftool/dbconfig/20210210-101642-root.json [10:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:18] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Clsuter for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10ChristineDeKock) [10:18:51] 10Puppet, 10SRE, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) @Ladsgroup This could well be to do with how puppetlabs defines core type however it has definitely been removed from the puppet git repo... [10:20:30] (03Abandoned) 10Elukey: Rename the cdh module to bigtop [puppet] - 10https://gerrit.wikimedia.org/r/663160 (owner: 10Elukey) [10:25:09] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp1077.eqiad.wmnet [10:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:34] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp1078.eqiad.wmnet [10:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:59] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp2029.codfw.wmnet [10:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:18] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp2030.codfw.wmnet [10:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:54] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3052.esams.wmnet [10:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:31] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3053.esams.wmnet [10:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Remove old OpenStack Rocky files/templates/manifests [puppet] - 10https://gerrit.wikimedia.org/r/663027 (owner: 10Andrew Bogott) [10:27:50] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4028.ulsfo.wmnet [10:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:16] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4022.ulsfo.wmnet [10:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:34] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5008.eqsin.wmnet [10:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:49] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5002.eqsin.wmnet [10:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 80%: Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P14300 and previous config saved to /var/cache/conftool/dbconfig/20210210-103146-root.json [10:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:06] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1077.eqiad.wmnet [10:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:30] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1078.eqiad.wmnet [10:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:02] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2029.codfw.wmnet [10:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:11] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3053.esams.wmnet [10:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:11] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2030.codfw.wmnet [10:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:49] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3052.esams.wmnet [10:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:11] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4022.ulsfo.wmnet [10:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:22] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5002.eqsin.wmnet [10:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:47] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4028.ulsfo.wmnet [10:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:34] !log powercycle cp5008 [10:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 100%: Slowly repool db1127', diff saved to https://phabricator.wikimedia.org/P14301 and previous config saved to /var/cache/conftool/dbconfig/20210210-104649-root.json [10:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:12] (03PS3) 10Kormat: mysql_root_clients: Allow orch access to clouddb [puppet] - 10https://gerrit.wikimedia.org/r/662697 (https://phabricator.wikimedia.org/T273606) [10:54:16] (03CR) 10Kormat: "@bstorm: Can i get you to look at this, please? I don't want to merge it without a sanity-check :)" [puppet] - 10https://gerrit.wikimedia.org/r/662697 (https://phabricator.wikimedia.org/T273606) (owner: 10Kormat) [11:00:46] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5008.eqsin.wmnet [11:00:51] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:07] (03PS1) 10Muehlenhoff: Extend access for daniram [puppet] - 10https://gerrit.wikimedia.org/r/663172 [11:09:59] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for daniram [puppet] - 10https://gerrit.wikimedia.org/r/663172 (owner: 10Muehlenhoff) [11:12:49] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:18:04] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4023.ulsfo.wmnet [11:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:31] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:22:43] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1404.eqiad.wmnet [11:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:35] (03PS1) 10Ayounsi: Improve loopback dhcp term [homer/public] - 10https://gerrit.wikimedia.org/r/663176 [11:28:44] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4023.ulsfo.wmnet [11:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:18] (03CR) 10Giuseppe Lavagetto: Add support for php deployments (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto) [11:32:51] (03PS6) 10Giuseppe Lavagetto: Add support for php deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 [11:38:18] (03CR) 10Arturo Borrero Gonzalez: "LGTM, comment inline" (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/663176 (owner: 10Ayounsi) [11:42:28] (03CR) 10Ayounsi: Improve loopback dhcp term (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/663176 (owner: 10Ayounsi) [11:42:42] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp1079.eqiad.wmnet [11:42:47] vgutierrez@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [11:42:59] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp1080.eqiad.wmnet [11:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:24] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp2031.codfw.wmnet [11:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:41] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp2032.codfw.wmnet [11:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:50] (03CR) 10Arturo Borrero Gonzalez: "This cr/firewall.conf file is only loaded in core routers, right?" [homer/public] - 10https://gerrit.wikimedia.org/r/663176 (owner: 10Ayounsi) [11:43:57] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3054.esams.wmnet [11:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:13] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3055.esams.wmnet [11:44:14] (03PS2) 10Muehlenhoff: Initial client profile for unprivileged Cumin [puppet] - 10https://gerrit.wikimedia.org/r/662945 [11:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:38] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4029.ulsfo.wmnet [11:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:06] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5009.eqsin.wmnet [11:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:19] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5003.eqsin.wmnet [11:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:25] (03CR) 10Arturo Borrero Gonzalez: Improve loopback dhcp term (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/663176 (owner: 10Ayounsi) [11:47:22] (03PS1) 10Volans: wmf-auto-reimage: splay the start when in parallel [puppet] - 10https://gerrit.wikimedia.org/r/663178 [11:47:55] (03CR) 10JMeybohm: [C: 04-1] Add support for php deployments (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto) [11:54:02] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2031.codfw.wmnet [11:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:20] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2032.codfw.wmnet [11:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:36] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1079.eqiad.wmnet [11:54:38] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3054.esams.wmnet [11:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:58] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1080.eqiad.wmnet [11:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:12] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5009.eqsin.wmnet [11:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:37] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3055.esams.wmnet [11:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:58] !log powercycle cp5003 [11:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:18] 10% of crash ratio on rebooting cp hosts /o\ [11:58:46] (03CR) 10Jbond: [C: 04-1] "LGTM but we should have it under the systemd module e.g." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663051 (owner: 10Dzahn) [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate European mid-day backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210210T1200). [12:00:05] Urbanecm, dcausse, and Lucas_WMDE: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:11] o/ [12:00:50] I can deploy today [12:01:02] beware of the horoscope [12:01:19] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4029.ulsfo.wmnet [12:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:29] PROBLEM - dhclient process on sretest1002 is CRITICAL: PROCS CRITICAL: 1 process with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [12:01:49] tabbycat: is there sth affecting deployers? :-) [12:02:06] (03PS2) 10Urbanecm: Set wgGEHelpPanelAskMentor to true for several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661448 (https://phabricator.wikimedia.org/T272753) [12:02:11] (03CR) 10Urbanecm: [C: 03+2] Set wgGEHelpPanelAskMentor to true for several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661448 (https://phabricator.wikimedia.org/T272753) (owner: 10Urbanecm) [12:03:36] Lucas_WMDE: can I just +2 your patch, or do you want to self-deploy? Sounds simple enough to me. [12:03:36] (03Merged) 10jenkins-bot: Set wgGEHelpPanelAskMentor to true for several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661448 (https://phabricator.wikimedia.org/T272753) (owner: 10Urbanecm) [12:04:32] 10SRE, 10Data-Persistence-Backup, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) It looks congestion-dependent? It peaks around ~22 UTC and improves at ~6 UTC: https://grafana-... [12:04:52] I’d like to deploy it :) [12:05:01] dcausse: around? [12:05:03] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5003.eqsin.wmnet [12:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:15] Lucas_WMDE: okay, I'll ping you once done [12:05:33] (03PS3) 10Urbanecm: Enable GrowthExperiments on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650012 (https://phabricator.wikimedia.org/T266020) (owner: 10Gergő Tisza) [12:05:37] (03CR) 10Urbanecm: [C: 03+2] Enable GrowthExperiments on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650012 (https://phabricator.wikimedia.org/T266020) (owner: 10Gergő Tisza) [12:06:36] (03Merged) 10jenkins-bot: Enable GrowthExperiments on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650012 (https://phabricator.wikimedia.org/T266020) (owner: 10Gergő Tisza) [12:06:38] (I'll sneak one this patch as well, forgot to schedule it) [12:06:58] (03CR) 10Elukey: [C: 03+1] "LGTM as workaround, I am wondering if more than one minute could be better, like 2/3, but we can always review it later!" [puppet] - 10https://gerrit.wikimedia.org/r/663178 (owner: 10Volans) [12:07:08] (03PS1) 10Marostegui: install_server: Do not reimage db1157 [puppet] - 10https://gerrit.wikimedia.org/r/663181 (https://phabricator.wikimedia.org/T258361) [12:07:20] hasharLunch: Majavah: I checked wikimedia-versions.json on mwdebug1001,2,3 and .30 is deployed only to testwiki, testwikidatawiki, labtestwiki on all. not deployed to frwiki [12:07:29] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 2d8cb10f246904f1af07b019da270fd8dc7816fa: Set wgGEHelpPanelAskMentor to true for several wikis (T272753) (duration: 01m 21s) [12:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:34] T272753: Scale: pilot help panel with mentorship in frwiki, bnwiki, arwiki, viwiki - https://phabricator.wikimedia.org/T272753 [12:07:56] so that is why you would have seen no log messages, unless it has since been undeployed from frwiki on mwdebug1002 [12:08:03] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1157 [puppet] - 10https://gerrit.wikimedia.org/r/663181 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [12:08:09] it was definitely on frwiki when testing [12:08:23] moritzm: ok to merge your change? [12:09:33] maybe it was reset during this backport window [12:09:54] Amir's comments indicate that he just livehacked it there and it would reset on next scap pull [12:10:12] (03CR) 10Volans: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/663178 (owner: 10Volans) [12:10:15] that sounds plausible [12:10:32] apergos: if he did so, he had to edit wikiversions.php, not the json file, btw [12:10:44] (03PS2) 10Volans: wmf-auto-reimage: splay the start when in parallel [puppet] - 10https://gerrit.wikimedia.org/r/663178 [12:11:53] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: e8214ee812f3812f609c26d6422b85a99a91e1f6: Enable GrowthExperiments on bnwiki (T266020) (duration: 01m 08s) [12:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:59] T266020: Deploy Growth experiments at Bangla Wikipedia - https://phabricator.wikimedia.org/T266020 [12:12:01] Lucas_WMDE: the floor is yours [12:12:03] (03CR) 10Vgutierrez: [C: 03+1] Don’t apply X-Wikimedia-Debug routing to noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/663156 (https://phabricator.wikimedia.org/T245552) (owner: 10Gilles) [12:12:14] I have just done the following: trie to get v1 keys from the wan object cache for testwiki featured feeds en, tried to getv2 keys, all empty. load https://test.wikipedia.org/wiki/Special:FeedItem/featured/20210201000000/en via mwdebug1002, get a whine because there's no article (ok); reget v2 key, it's now there [12:12:29] ok thanks! [12:12:42] I can now look at the logs for specific errors [12:12:50] (03PS2) 10Lucas Werkmeister (WMDE): Remove Wikibase.NewItemIdFormatter log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658321 (https://phabricator.wikimedia.org/T268870) (owner: 10Rosalie Perside (WMDE)) [12:12:55] I'm not testig multilingual feeds of course :-( [12:12:59] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:13:16] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove Wikibase.NewItemIdFormatter log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658321 (https://phabricator.wikimedia.org/T268870) (owner: 10Rosalie Perside (WMDE)) [12:13:26] do we have those anywhere on group0? [12:14:13] frwiki on 1002 still seems to be on .30 fwiw [12:14:40] someone definitely ran scap pull there already (either manually or by all-box script) [12:14:43] (03Merged) 10jenkins-bot: Remove Wikibase.NewItemIdFormatter log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658321 (https://phabricator.wikimedia.org/T268870) (owner: 10Rosalie Perside (WMDE)) [12:15:07] https://fr.wikipedia.org/wiki/Sp%C3%A9cial:Version MediaWiki 1.36.0-wmf.30 (afb9c32) 2021-02-09T04:07:27 [12:15:21] what's that timestamp? last commit? [12:15:21] I’m about to pull to mwdebug1001, hope that’s okay [12:15:28] sorry I'm late [12:15:36] testing on mwdebug1001 [12:15:45] Lucas_WMDE: definitely, we're just talking how to debug :) [12:16:22] (03PS10) 10Arturo Borrero Gonzalez: toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) [12:16:45] test seems fine, syncing [12:18:21] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:658321|Remove Wikibase.NewItemIdFormatter log channel (T268870)]] 1/2 (duration: 01m 07s) [12:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:27] T268870: Remove Wikibase.NewItemIdFormatter log channel - https://phabricator.wikimedia.org/T268870 [12:19:37] (03PS1) 10Muehlenhoff: profile::kerberos::keytabs: Drop require for the user [puppet] - 10https://gerrit.wikimedia.org/r/663184 [12:20:03] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:658321|Remove Wikibase.NewItemIdFormatter log channel (T268870)]] 2/2 (prod no-op) (duration: 01m 08s) [12:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:09] (the New Yorker would, of course, call it a prod noöp) [12:20:33] dcausse: do you want to self-deploy your change? [12:20:38] Lucas_WMDE: sure [12:20:42] alright, go ahead :) [12:20:46] I"m still looking at the ebug logs, there are 4k + lines :-) [12:20:46] thanks! :) [12:20:54] for my one request... [12:20:58] apergos: yeah, verbose logging is...really verbose :) [12:21:05] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:11] (03PS4) 10DCausse: [wdqs] Add flink sideoutput stream definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661727 (https://phabricator.wikimedia.org/T269619) [12:21:28] (03CR) 10DCausse: [C: 03+2] [wdqs] Add flink sideoutput stream definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661727 (https://phabricator.wikimedia.org/T269619) (owner: 10DCausse) [12:22:01] (03CR) 10Elukey: [C: 03+1] wmf-auto-reimage: splay the start when in parallel [puppet] - 10https://gerrit.wikimedia.org/r/663178 (owner: 10Volans) [12:22:24] (03Merged) 10jenkins-bot: [wdqs] Add flink sideoutput stream definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661727 (https://phabricator.wikimedia.org/T269619) (owner: 10DCausse) [12:23:32] (03CR) 10Volans: [C: 03+2] wmf-auto-reimage: splay the start when in parallel [puppet] - 10https://gerrit.wikimedia.org/r/663178 (owner: 10Volans) [12:24:26] (03CR) 10Elukey: [C: 03+1] profile::kerberos::keytabs: Drop require for the user [puppet] - 10https://gerrit.wikimedia.org/r/663184 (owner: 10Muehlenhoff) [12:24:51] PROBLEM - configured eth on sretest1002 is CRITICAL: eno2 reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [12:26:44] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T269619: [wdqs] Add flink sideoutput stream definitions (duration: 01m 06s) [12:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:52] T269619: Create pipelines for late/spurious/failed events - https://phabricator.wikimedia.org/T269619 [12:27:04] (03PS2) 10Jbond: base::service_unit: drop support for sysV and upstart init scripts [puppet] - 10https://gerrit.wikimedia.org/r/661917 (https://phabricator.wikimedia.org/T273743) [12:28:07] I'm done [12:28:29] so we're all done then :) [12:29:24] (03CR) 10jerkins-bot: [V: 04-1] base::service_unit: drop support for sysV and upstart init scripts [puppet] - 10https://gerrit.wikimedia.org/r/661917 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [12:29:33] (03PS1) 10Thiemo Kreuz (WMDE): [DNM] ReferenceTooltips gadget names for ReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663185 (https://phabricator.wikimedia.org/T274353) [12:29:57] (03PS6) 10Effie Mouzeli: WIP: profile::memcached::instance: remove "default_values" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [12:30:57] (03PS11) 10Arturo Borrero Gonzalez: toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) [12:31:38] (03CR) 10jerkins-bot: [V: 04-1] WIP: profile::memcached::instance: remove "default_values" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [12:35:42] (trying to make frwiki cache the key it ought to cache, via mwdebug1002 now) [12:36:17] mwdebug1002 frwiki 1.36.0-wmf.27 so that won't happen [12:36:28] ok back to looking at my logs from testwiki [12:40:03] (03PS12) 10Arturo Borrero Gonzalez: toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) [12:41:54] (03PS13) 10Arturo Borrero Gonzalez: toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) [12:42:34] (03PS2) 10Ayounsi: Improve loopback DHCP term [homer/public] - 10https://gerrit.wikimedia.org/r/663176 [12:45:06] (03PS14) 10Arturo Borrero Gonzalez: toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) [12:45:46] I scried everything as best as possible, still see no errors, key is cached properly, but again this is not multilingual anything. [12:45:52] not sure what to do next [12:46:05] role out to mediawiki on debug1002 and check that? [12:46:06] (03PS4) 10Base: Changing frwiktionary's wmgBabelMainCategory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662720 (https://phabricator.wikimedia.org/T274137) [12:46:46] (this question is for hasharAway and Majavah) [12:47:12] who's the train conductor this week? [12:47:19] it can wait for hasharAway to return, presuming he's back later today [12:47:20] (03PS5) 10Base: Changing frwiktionary's wmgBabelMainCategory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662720 (https://phabricator.wikimedia.org/T274137) [12:47:48] imo that's a question for releng and not me [12:47:56] um twentyafterfour I believe [12:48:03] them and hashar according to wikitech [12:48:07] all right, duly pinged :-) [12:48:14] but I'm fairly confident about the fix [12:48:19] i will add my meagre findings to the task in the meantime [12:48:32] thanks! [12:48:43] PROBLEM - configured eth on sretest1001 is CRITICAL: ens2f1 reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [12:48:44] according to -releng they will be back later today [12:49:05] PROBLEM - dhclient process on sretest1001 is CRITICAL: PROCS CRITICAL: 1 process with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [12:51:16] (03PS7) 10Effie Mouzeli: WIP: profile::memcached::instance: remove "default_values" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [12:54:14] (03PS15) 10Arturo Borrero Gonzalez: toolforge: add ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) [12:54:32] ok great, task updated and we shall see [12:56:33] (03CR) 10Ayounsi: relforge: New hosts are relforge100[3,4] (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/663054 (https://phabricator.wikimedia.org/T274314) (owner: 10Ryan Kemper) [12:57:19] we should probably add a multilingual feed to a group0 wiki for future incidents [12:58:11] that's probably true [12:58:30] the extension ought to have tests too, at some point [13:09:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "The filter itself LGTM, but I think the original warning was about the port being opened in cloudsw devices, not in the core routers. Is t" [homer/public] - 10https://gerrit.wikimedia.org/r/663176 (owner: 10Ayounsi) [13:13:41] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:17:53] 10SRE, 10Data-Persistence-Backup, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) [13:18:51] (03PS3) 10Jbond: base::service_unit: drop support for sysV and upstart init scripts [puppet] - 10https://gerrit.wikimedia.org/r/661917 [13:19:24] (03CR) 10Jbond: "PCC (still running) https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27960" [puppet] - 10https://gerrit.wikimedia.org/r/661917 (owner: 10Jbond) [13:20:04] (03PS4) 10Jbond: base::service_unit: drop support for sysV and upstart init scripts [puppet] - 10https://gerrit.wikimedia.org/r/661917 [13:21:20] (03CR) 10Jbond: "correct pcc" [puppet] - 10https://gerrit.wikimedia.org/r/661917 (owner: 10Jbond) [13:21:43] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:22:21] (03CR) 10jerkins-bot: [V: 04-1] base::service_unit: drop support for sysV and upstart init scripts [puppet] - 10https://gerrit.wikimedia.org/r/661917 (owner: 10Jbond) [13:24:36] (03PS5) 10Jbond: base::service_unit: drop support for sysV and upstart init scripts [puppet] - 10https://gerrit.wikimedia.org/r/661917 [13:25:42] (03CR) 10MSantos: "`cleanup-old-osm2pgsql-tables.sql` should be removed. The reason for it was if we migrated imposm3 without cleaning-up data, which is not " (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [13:26:44] (03CR) 10MSantos: [C: 03+1] role::maps: fix MOTD message [puppet] - 10https://gerrit.wikimedia.org/r/662659 (owner: 10Hnowlan) [13:27:04] (03CR) 10jerkins-bot: [V: 04-1] base::service_unit: drop support for sysV and upstart init scripts [puppet] - 10https://gerrit.wikimedia.org/r/661917 (owner: 10Jbond) [13:33:51] (03PS6) 10Jbond: base::service_unit: drop support for sysV and upstart init scripts [puppet] - 10https://gerrit.wikimedia.org/r/661917 (https://phabricator.wikimedia.org/T273743) [13:38:09] (03PS1) 10Kormat: integration: Allow testing of multiple versions [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663192 (https://phabricator.wikimedia.org/T265266) [13:38:18] (03CR) 10Volans: [C: 03+1] "replies inline." (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/662762 (https://phabricator.wikimedia.org/T263768) (owner: 10CRusnov) [13:39:32] (03PS2) 10Kormat: integration: Allow testing of multiple versions [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663192 (https://phabricator.wikimedia.org/T265266) [13:40:47] (03CR) 10Effie Mouzeli: [C: 03+2] Don’t apply X-Wikimedia-Debug routing to noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/663156 (https://phabricator.wikimedia.org/T245552) (owner: 10Gilles) [13:46:12] (03PS3) 10Kormat: integration: Allow testing of multiple versions [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663192 (https://phabricator.wikimedia.org/T265266) [13:47:31] (03PS8) 10Jcrespo: [WIP] Move database backups-related puppet code to its own profile/role [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) [13:50:25] (03CR) 10Kormat: [C: 03+2] integration: Allow testing of multiple versions [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663192 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat) [13:52:55] (03CR) 10Jcrespo: "Actually change looks fine as a starting point, only thing missing is the denylist on monitoring screens: https://puppet-compiler.wmflabs." [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [13:54:22] (03Merged) 10jenkins-bot: integration: Allow testing of multiple versions [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663192 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat) [14:00:04] twentyafterfour and hashar: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - American+European Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210210T1400). [14:03:03] (03PS9) 10Jcrespo: dbbackups: Move database backups-related puppet code to its own profile/role [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) [14:05:05] hey if any of you are here (hashar, twentyafterfour) please see my question about the ubn [14:05:08] in the scrollback [14:11:14] (03PS1) 10Klausman: Add etcd role for ML Team's new clusters [puppet] - 10https://gerrit.wikimedia.org/r/663200 (https://phabricator.wikimedia.org/T273071) [14:11:53] (03PS2) 10Klausman: Add etcd role for ML Team's new clusters [puppet] - 10https://gerrit.wikimedia.org/r/663200 (https://phabricator.wikimedia.org/T273071) [14:14:19] (03CR) 10Effie Mouzeli: [C: 04-2] "Not good https://puppet-compiler.wmflabs.org/compiler1002/27964/" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [14:24:12] (03PS1) 10David Caro: utils: add script to run docker ci tests locally [software/spicerack] - 10https://gerrit.wikimedia.org/r/663205 (https://phabricator.wikimedia.org/T274338) [14:26:40] (03CR) 10Ammarpad: [C: 03+1] Add inline documentation to configuration about updating logos regarding labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663057 (owner: 10Jdlrobson) [14:29:01] (03PS1) 10Jbond: P:idp: drop tls config in cloud [puppet] - 10https://gerrit.wikimedia.org/r/663206 [14:29:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27967/console" [puppet] - 10https://gerrit.wikimedia.org/r/663206 (owner: 10Jbond) [14:30:40] (03CR) 10jerkins-bot: [V: 04-1] P:idp: drop tls config in cloud [puppet] - 10https://gerrit.wikimedia.org/r/663206 (owner: 10Jbond) [14:32:33] (03CR) 10David Caro: [C: 04-1] "Now that we have development docs I'll add it there too :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/663205 (https://phabricator.wikimedia.org/T274338) (owner: 10David Caro) [14:33:51] (03CR) 10Jbond: WIP: profile::memcached::instance: remove "default_values" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [14:37:07] (03PS2) 10Jbond: P:idp: drop tls config in cloud [puppet] - 10https://gerrit.wikimedia.org/r/663206 [14:39:01] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp1081.eqiad.wmnet [14:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:13] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp1082.eqiad.wmnet [14:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:27] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp2034.codfw.wmnet [14:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:46] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3056.esams.wmnet [14:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:03] (03CR) 10Jbond: [C: 03+2] P:idp: drop tls config in cloud [puppet] - 10https://gerrit.wikimedia.org/r/663206 (owner: 10Jbond) [14:40:43] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3057.esams.wmnet [14:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:58] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4030.ulsfo.wmnet [14:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:13] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4024.ulsfo.wmnet [14:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:27] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5010.eqsin.wmnet [14:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:32] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5004.eqsin.wmnet [14:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:01] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for urbanecm - https://phabricator.wikimedia.org/T274318 (10Ottomata) Approved. [14:45:38] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp2033.codfw.wmnet [14:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:05] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2034.codfw.wmnet [14:50:05] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5004.eqsin.wmnet [14:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:03] !log updating puppet-compiler-facts [14:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:10] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3056.esams.wmnet [14:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:18] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4030.ulsfo.wmnet [14:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:33] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5010.eqsin.wmnet [14:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:40] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1081.eqiad.wmnet [14:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:52] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1082.eqiad.wmnet [14:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:57] apergos: I am half there yeah [14:53:30] ok, uh, what do you think (I guess we are not using this train deploy window) [14:53:55] hasharAway: [14:54:09] we can do group0 I guess [14:54:11] (03CR) 10Elukey: "Added a couple of notes, let me know!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663200 (https://phabricator.wikimedia.org/T273071) (owner: 10Klausman) [14:54:20] when do you want to do it [14:54:29] I gotta leave in 40 minutes [14:54:33] I want to run a test first to get the version 1 of the key [14:54:41] ok let me do this test right now [14:55:06] but we can at least deploy the wmf.30 patch for FeatureFeeds if it has not been deployed already [14:55:10] and promote group0 wikis [14:55:21] not like those wikis have FeatureFeeds [14:56:03] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3057.esams.wmnet [14:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:50] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4024.ulsfo.wmnet [14:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:56] ok I need to do a preliminary test so I can get the format of the key in question [14:58:01] verify that the v2 version ain't there [14:58:10] (03PS1) 10KartikMistry: Update cxserver to 2021-02-10-134029-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/663213 (https://phabricator.wikimedia.org/T274133) [14:58:14] then we can promote and I can rerun the test to verify the key IS there w/o errors. [14:58:19] so give me 2 to 5 mins [14:58:24] :]] [14:59:22] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2033.codfw.wmnet [14:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:48] (03PS1) 10Cwhite: hiera: prepare logstash syslog lvs config for removal [puppet] - 10https://gerrit.wikimedia.org/r/663214 (https://phabricator.wikimedia.org/T217032) [15:02:37] well I do not get the format of the key from there [15:02:57] so next is: go ahead and promote, I try the url, I verify that the exception I see is not again in todays log [15:03:19] and I can check that a key goes in because it will be v2 and in the debug logs and I'll see "miss" for the first rounf [15:03:26] then i can retrieve it to verify there is content [15:03:31] so, tl;dr: roll 'em [15:03:36] oh [15:03:41] so you managed to reproduce it? [15:03:41] (03CR) 10Muehlenhoff: [C: 03+2] profile::kerberos::keytabs: Drop require for the user [puppet] - 10https://gerrit.wikimedia.org/r/663184 (owner: 10Muehlenhoff) [15:03:50] no, this is the multilingual feed issue [15:04:00] ah [15:04:00] (03CR) 10Klausman: Add etcd role for ML Team's new clusters (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663200 (https://phabricator.wikimedia.org/T273071) (owner: 10Klausman) [15:04:07] so promote group 0 right? [15:04:12] command is ready to launch [15:04:13] that's mediawikiwiki, right? [15:04:24] yes [15:04:27] go [15:04:37] we had www.mediawiki.org added there cause it is "low" traffic [15:04:48] but has lot of advanced users who can craft nice reports [15:04:53] promoting [15:05:01] (03PS1) 10Hashar: group0 wikis to 1.36.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663216 [15:05:03] (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.36.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663216 (owner: 10Hashar) [15:05:27] !log group0 wikis to 1.36.0-wmf.30 T271344 [15:05:27] well this is extremely fortunate because it has a multilingual feed right on the home page [15:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:32] T271344: 1.36.0-wmf.30 deployment blockers - https://phabricator.wikimedia.org/T271344 [15:05:35] ;]]] [15:05:37] 10SRE, 10Data-Persistence-Backup, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) [15:05:51] is it out already? or I shoul wait? [15:06:05] that will report back with a !log once completed [15:06:07] ok [15:06:13] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663216 (owner: 10Hashar) [15:06:24] and mwdebug1002 will have it as well, yes? because debug logging gives me key names [15:06:33] yeah [15:06:36] perfect [15:06:49] mwdebug hosts are just like all the others [15:06:53] (03CR) 10Itamar Givon: [C: 03+1] "LGTM 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662970 (https://phabricator.wikimedia.org/T272242) (owner: 10Lucas Werkmeister (WMDE)) [15:06:54] it is just that they dont receive live traffic [15:07:02] unless ones set some http header [15:07:12] but otherwise they are part of the scap targets [15:07:15] that browser extension is te best thing ever [15:07:34] yeah I was cautious because of last night's "30 is on such an such wiki only there" [15:08:08] tick... tick... tick... [15:08:13] apaches syncing [15:08:38] I should have just scap pull to mwebug1002 directly :-D oh well [15:08:58] yeah that works as well [15:09:07] actually that is what we should have done bah [15:09:27] shoulda coulda woulda [15:09:36] I have high confidence in the patch but still [15:11:39] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.30 [15:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:59] woo hoo [15:12:01] https://www.mediawiki.org/wiki/Special:Version is at wmf.30 [15:12:05] time to test [15:12:12] hope you didn't load the home page yet :-P [15:12:24] most probably someone did [15:14:04] the particular error oes not show up so that's a win [15:14:35] nothing in the exception log [15:14:46] no key format, guess I should do that for completeness [15:15:22] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Clsuter for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10diego) [15:15:50] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler1002/27970/" [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [15:17:05] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: prepare logstash syslog lvs config for removal [puppet] - 10https://gerrit.wikimedia.org/r/663214 (https://phabricator.wikimedia.org/T217032) (owner: 10Cwhite) [15:18:03] meh it doesn't even have the featuredfeeds [15:18:04] so. [15:20:07] (03CR) 10Filippo Giunchedi: [C: 03+1] base::service_unit: drop support for sysV and upstart init scripts [puppet] - 10https://gerrit.wikimedia.org/r/661917 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [15:20:46] calling it done regardless :-/ [15:21:25] (03PS1) 10Ottomata: Do not produce canary events for rdf-streaming-updater streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663219 (https://phabricator.wikimedia.org/T269619) [15:21:56] (03CR) 10DCausse: [C: 03+1] Do not produce canary events for rdf-streaming-updater streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663219 (https://phabricator.wikimedia.org/T269619) (owner: 10Ottomata) [15:22:02] apergos: I thought about using wmf.30 on some wiki on mwdebug [15:22:17] but I am afraid of the side effectrs it might have if the rest of the app servers are on wmf.27 [15:22:25] hashar: ok if i deploy a config change? ^^ [15:22:31] I alreay tested testwiki on mwdebug1002 [15:22:41] so that's the equivalent, I noted it on the task earlier [15:22:59] it all looked fine as to the cache and the key etc [15:22:59] ottomata: yes go for it ;) [15:23:02] ty [15:23:07] ottomata: well not like I have any idea what that change is doing hehe [15:23:13] but consider scap your ! [15:23:17] (03CR) 10Ottomata: [C: 03+2] Do not produce canary events for rdf-streaming-updater streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663219 (https://phabricator.wikimedia.org/T269619) (owner: 10Ottomata) [15:24:40] (03Merged) 10jenkins-bot: Do not produce canary events for rdf-streaming-updater streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663219 (https://phabricator.wikimedia.org/T269619) (owner: 10Ottomata) [15:24:40] i guess this means the train can roll forward in the evening slot today [15:25:47] apergos: may you report the result on the wmf.30 blocking task please? [15:26:05] yeah I was getting there until I got pinged about my issues with element -> slack [15:26:09] e_toomanybrokenthings [15:26:10] and I guess american folks will promote wmf.30 to group1 at 20:00 UTC later today [15:26:15] yep [15:26:15] ahah [15:26:17] sounds familiar [15:26:25] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Do not produce canary events for rdf-streaming-updater streams - T269619 (duration: 01m 13s) [15:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:29] T269619: Create pipelines for late/spurious/failed events - https://phabricator.wikimedia.org/T269619 [15:26:30] Error: Too many errors [15:29:25] (03PS1) 10Jcrespo: dbbackups: Update password locations for database-backups db [labs/private] - 10https://gerrit.wikimedia.org/r/663221 (https://phabricator.wikimedia.org/T138562) [15:31:04] commented on task, that should be enough [15:31:13] (03CR) 10Muehlenhoff: [C: 03+1] "Good riddance!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661917 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [15:31:21] thanks a lot folks [15:31:39] (03PS32) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [15:31:45] happy to help [15:32:32] (03CR) 10Volans: "Started doing a pass, but then converted it into a first pass giving the general comments I have. Sorry in advance for the mix of nits/gen" (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro) [15:33:34] (03CR) 10jerkins-bot: [V: 04-1] start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [15:33:55] (03PS2) 10Jcrespo: dbbackups: Update password locations for database-backups db [labs/private] - 10https://gerrit.wikimedia.org/r/663221 (https://phabricator.wikimedia.org/T138562) [15:33:57] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27971/console" [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [15:37:36] (03PS3) 10Jcrespo: dbbackups: Update password locations for database-backups db [labs/private] - 10https://gerrit.wikimedia.org/r/663221 (https://phabricator.wikimedia.org/T138562) [15:37:49] (03CR) 10Jbond: "thanks see comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661917 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [15:42:47] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10hashar) I can not tell why the homepage of doc.... [15:43:01] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:05] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] dbbackups: Update password locations for database-backups db [labs/private] - 10https://gerrit.wikimedia.org/r/663221 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [15:50:09] (03CR) 10Jcrespo: "alert hosts look good now, but this change requires private puppet changes deployed at the same time https://puppet-compiler.wmflabs.org/c" [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [15:50:45] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:57] (03CR) 10Muehlenhoff: [C: 03+2] Initial client profile for unprivileged Cumin [puppet] - 10https://gerrit.wikimedia.org/r/662945 (owner: 10Muehlenhoff) [15:57:08] (03PS8) 10Effie Mouzeli: WIP: profile::memcached::instance: remove "default_values" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [15:58:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1081.eqiad.wmnet - https://phabricator.wikimedia.org/T273040 (10wiki_willy) a:05wiki_willy→03Cmjohnson Thanks @Marostegui >>! In T273040#6817937, @Marostegui wrote: > This is ready for DCOps! [16:01:18] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on maps1005.eqiad.wmnet with reason: Resyncing database, still [16:01:19] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on maps1005.eqiad.wmnet with reason: Resyncing database, still [16:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:00] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) >>! In T247653#6819019, @hashar wrote: >... [16:02:08] (03Abandoned) 10JMeybohm: Lint the chart _scaffold by creating a dummy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/638025 (owner: 10JMeybohm) [16:02:52] (03PS10) 10Jcrespo: dbbackups: Move database backups-related puppet code to its own profile/role [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) [16:03:26] (03PS1) 10Jbond: P:idp: add ability to disable start tls [puppet] - 10https://gerrit.wikimedia.org/r/663231 [16:03:35] (03CR) 10Elukey: Add etcd role for ML Team's new clusters (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663200 (https://phabricator.wikimedia.org/T273071) (owner: 10Klausman) [16:04:04] (03PS33) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [16:04:25] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27979/console" [puppet] - 10https://gerrit.wikimedia.org/r/663231 (owner: 10Jbond) [16:04:51] RECOVERY - tilerator on maps1005 is OK: HTTP OK: HTTP/1.1 200 OK - 315 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [16:04:53] RECOVERY - tileratorui on maps1005 is OK: HTTP OK: HTTP/1.1 200 OK - 315 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [16:05:11] (03PS1) 10Volans: dhcpd: create and include files for option 82 [puppet] - 10https://gerrit.wikimedia.org/r/663233 (https://phabricator.wikimedia.org/T221388) [16:05:13] (03PS1) 10Volans: dhcpd: move sretest1002 to option 82 [puppet] - 10https://gerrit.wikimedia.org/r/663234 (https://phabricator.wikimedia.org/T221388) [16:05:41] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:idp: add ability to disable start tls [puppet] - 10https://gerrit.wikimedia.org/r/663231 (owner: 10Jbond) [16:05:53] (03CR) 10jerkins-bot: [V: 04-1] dhcpd: create and include files for option 82 [puppet] - 10https://gerrit.wikimedia.org/r/663233 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [16:05:58] (03CR) 10jerkins-bot: [V: 04-1] start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [16:06:06] (03CR) 10jerkins-bot: [V: 04-1] dhcpd: move sretest1002 to option 82 [puppet] - 10https://gerrit.wikimedia.org/r/663234 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [16:06:17] (03CR) 10RLazarus: [C: 03+1] mysql_legacy.py: Add x2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/662631 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui) [16:07:09] (03PS2) 10Volans: dhcpd: create and include files for option 82 [puppet] - 10https://gerrit.wikimedia.org/r/663233 (https://phabricator.wikimedia.org/T221388) [16:07:11] (03PS2) 10Volans: dhcpd: move sretest1002 to option 82 [puppet] - 10https://gerrit.wikimedia.org/r/663234 (https://phabricator.wikimedia.org/T221388) [16:08:11] (03CR) 10jerkins-bot: [V: 04-1] dhcpd: create and include files for option 82 [puppet] - 10https://gerrit.wikimedia.org/r/663233 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [16:08:13] (03CR) 10jerkins-bot: [V: 04-1] dhcpd: move sretest1002 to option 82 [puppet] - 10https://gerrit.wikimedia.org/r/663234 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [16:08:28] not to self, don't try to make a puppet patch while doing other 3 things [16:09:39] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27980/console" [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [16:12:34] !log installing atftp security updates [16:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:10] (03PS1) 10Filippo Giunchedi: alertmanager: route Performance team alerts [puppet] - 10https://gerrit.wikimedia.org/r/663238 (https://phabricator.wikimedia.org/T272979) [16:18:12] 10ops-eqiad: maps1005.eqiad.wmnet: possible cable issues - https://phabricator.wikimedia.org/T274387 (10hnowlan) [16:20:33] (03PS1) 10Jbond: P:idp: macke ldap_bind_nd a parameter [puppet] - 10https://gerrit.wikimedia.org/r/663239 [16:20:34] !log installing unzip security updates [16:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:48] (03CR) 10jerkins-bot: [V: 04-1] P:idp: macke ldap_bind_nd a parameter [puppet] - 10https://gerrit.wikimedia.org/r/663239 (owner: 10Jbond) [16:23:31] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [16:23:47] (03CR) 10Muehlenhoff: P:idp: macke ldap_bind_nd a parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663239 (owner: 10Jbond) [16:24:27] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [16:24:38] (03PS9) 10Effie Mouzeli: WIP: profile::memcached::instance: remove "default_values" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [16:24:45] (03PS1) 10Elukey: cdh::hive: remove sentry specific bits [puppet] - 10https://gerrit.wikimedia.org/r/663240 (https://phabricator.wikimedia.org/T274345) [16:25:40] (03CR) 10Elukey: [C: 03+2] cdh::hive: remove sentry specific bits [puppet] - 10https://gerrit.wikimedia.org/r/663240 (https://phabricator.wikimedia.org/T274345) (owner: 10Elukey) [16:26:11] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [16:28:58] (03CR) 10Volans: [C: 03+2] mysql_legacy.py: Add x2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/662631 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui) [16:31:24] 10SRE, 10Services, 10Service-deployment-requests: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10MSantos) [16:32:53] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Services, 10Service-deployment-requests: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10MSantos) [16:33:45] (03PS1) 10Elukey: druid: remove cdh specific configurations [puppet] - 10https://gerrit.wikimedia.org/r/663241 (https://phabricator.wikimedia.org/T274345) [16:33:49] (03PS2) 10Jbond: P:idp: macke ldap_bind_nd a parameter [puppet] - 10https://gerrit.wikimedia.org/r/663239 [16:34:08] (03CR) 10jerkins-bot: [V: 04-1] P:idp: macke ldap_bind_nd a parameter [puppet] - 10https://gerrit.wikimedia.org/r/663239 (owner: 10Jbond) [16:34:32] (03PS3) 10Jbond: P:idp: make ldap_bind_nd a parameter [puppet] - 10https://gerrit.wikimedia.org/r/663239 [16:34:46] (03CR) 10jerkins-bot: [V: 04-1] P:idp: make ldap_bind_nd a parameter [puppet] - 10https://gerrit.wikimedia.org/r/663239 (owner: 10Jbond) [16:35:57] (03Merged) 10jenkins-bot: mysql_legacy.py: Add x2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/662631 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui) [16:37:14] (03CR) 10Effie Mouzeli: [C: 04-2] "Still not good https://puppet-compiler.wmflabs.org/compiler1002/27985/" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [16:38:13] (03PS3) 10Volans: dhcpd: create and include files for option 82 [puppet] - 10https://gerrit.wikimedia.org/r/663233 (https://phabricator.wikimedia.org/T221388) [16:38:15] (03PS3) 10Volans: dhcpd: move sretest1002 to option 82 [puppet] - 10https://gerrit.wikimedia.org/r/663234 (https://phabricator.wikimedia.org/T221388) [16:38:31] 10SRE, 10MediaWiki-Debug-Logger, 10Traffic, 10Developer Productivity, 10Performance-Team (Radar): noc.wikimedia.org with X-Wikimedia-Debug routes to mwdebug but host is not served there - https://phabricator.wikimedia.org/T245552 (10Gilles) 05Open→03Resolved a:03Gilles Thanks @jijiki ! [16:38:42] (03PS4) 10Jbond: P:idp: update profile to use ldap['proxyagent'] [puppet] - 10https://gerrit.wikimedia.org/r/663239 [16:38:46] (03CR) 10Jbond: P:idp: update profile to use ldap['proxyagent'] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663239 (owner: 10Jbond) [16:39:34] (03CR) 10Volans: "Compiler results at: https://puppet-compiler.wmflabs.org/compiler1003/27983/install1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/663233 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [16:39:36] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27986/console" [puppet] - 10https://gerrit.wikimedia.org/r/663239 (owner: 10Jbond) [16:40:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/663239 (owner: 10Jbond) [16:40:23] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1371.eqiad.wmnet with reason: REIMAGE [16:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:46] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:idp: update profile to use ldap['proxyagent'] [puppet] - 10https://gerrit.wikimedia.org/r/663239 (owner: 10Jbond) [16:41:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/663234 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [16:42:24] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1371.eqiad.wmnet with reason: REIMAGE [16:42:24] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1379.eqiad.wmnet with reason: REIMAGE [16:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:43] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/663233 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [16:44:14] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Services, 10Service-deployment-requests: [DRAFT] New Service Request tegola - https://phabricator.wikimedia.org/T274390 (10MSantos) [16:44:25] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1379.eqiad.wmnet with reason: REIMAGE [16:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:07] (03CR) 10Cwhite: [C: 03+2] hiera: prepare logstash syslog lvs config for removal [puppet] - 10https://gerrit.wikimedia.org/r/663214 (https://phabricator.wikimedia.org/T217032) (owner: 10Cwhite) [16:47:34] (03CR) 10Ayounsi: [C: 03+1] dhcpd: create and include files for option 82 [puppet] - 10https://gerrit.wikimedia.org/r/663233 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [16:47:56] (03CR) 10Ayounsi: [C: 03+1] dhcpd: move sretest1002 to option 82 [puppet] - 10https://gerrit.wikimedia.org/r/663234 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [16:48:13] (03PS10) 10Effie Mouzeli: WIP: profile::memcached::instance: remove "default_values" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [16:52:04] (03PS3) 10Dzahn: systemd: add data type for 'day of the week' in systemd timers/calendar [puppet] - 10https://gerrit.wikimedia.org/r/663051 [16:53:20] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1295.eqiad.wmnet with reason: REIMAGE [16:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:47] (03CR) 10Dzahn: "could releng let us know if this is officially declined or still happening? It was stalled month ago by request but the reason for that re" [puppet] - 10https://gerrit.wikimedia.org/r/556270 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [16:56:54] (03PS1) 10Cwhite: hiera: prepare logstash-syslog lvs service for removal [puppet] - 10https://gerrit.wikimedia.org/r/663242 (https://phabricator.wikimedia.org/T217032) [16:57:01] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1295.eqiad.wmnet with reason: REIMAGE [16:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:45] (03CR) 10Phuedx: [C: 03+1] Add inline documentation to configuration about updating logos regarding labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663057 (owner: 10Jdlrobson) [16:59:19] (03CR) 10Dzahn: "how to get review from traffic?" [puppet] - 10https://gerrit.wikimedia.org/r/659377 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [17:06:29] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/663242 (https://phabricator.wikimedia.org/T217032) (owner: 10Cwhite) [17:07:23] (03CR) 10Cwhite: [C: 03+2] hiera: prepare logstash-syslog lvs service for removal [puppet] - 10https://gerrit.wikimedia.org/r/663242 (https://phabricator.wikimedia.org/T217032) (owner: 10Cwhite) [17:07:38] (03PS2) 10Cwhite: hiera: prepare logstash-syslog lvs service for removal [puppet] - 10https://gerrit.wikimedia.org/r/663242 (https://phabricator.wikimedia.org/T217032) [17:08:36] apergos: which question in the scrollback? [17:08:53] (03PS34) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [17:08:56] ah now it's ased ans answered :-) [17:09:10] !log andrew@deploy1001 Started deploy [horizon/deploy@4f5a5a7]: puppet dashboard policy updates [17:09:10] ah ok [17:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:13] current state of things for the ubn train blocker: it is deployed to group1 seemingly without issues [17:09:17] (03CR) 10Hnowlan: [V: 03+1] "I believe this latest patchset addresses all issues raised, apologies for the sprawling nature of fixes." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [17:09:22] i.e. wmf.30 is deployed [17:09:47] apergos: oh, excellent [17:10:14] (03PS5) 10Legoktm: docker_registry_ha: Have restricted/ images that are limited read/write [puppet] - 10https://gerrit.wikimedia.org/r/662807 (https://phabricator.wikimedia.org/T273521) [17:10:16] (03PS3) 10Legoktm: k8s: Add docker-registry credentials to pull restricted images [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) [17:10:25] no wrong it is on group0 [17:10:26] (03CR) 10Legoktm: k8s: Add docker-registry credentials to pull restricted images (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [17:10:29] including mediawikiwiki [17:10:32] (03CR) 10Legoktm: docker_registry_ha: Have restricted/ images that are limited read/write (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/662807 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [17:10:33] (sorry) [17:10:47] I just checked exception log, it seems fine still [17:11:50] so "we" were thinking that in the evening train slot the deployer might roll it out to group1 [17:12:16] indeed that would be the plan [17:12:20] cool! [17:13:02] !log andrew@deploy1001 Finished deploy [horizon/deploy@4f5a5a7]: puppet dashboard policy updates (duration: 03m 53s) [17:13:04] !log restart pybal on backup lvs1016 [17:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:01] (03PS11) 10Effie Mouzeli: WIP: profile::memcached::instance: remove "default_values" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [17:14:22] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.36:10514]) https://wikitech.wikimedia.org/wiki/PyBal [17:14:58] ^^ known [17:15:13] (03CR) 10Elukey: [C: 03+2] druid: remove cdh specific configurations [puppet] - 10https://gerrit.wikimedia.org/r/663241 (https://phabricator.wikimedia.org/T274345) (owner: 10Elukey) [17:18:21] (03PS12) 10Effie Mouzeli: WIP: profile::memcached::instance: remove "default_values" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [17:18:58] !log restart pybal on low-traffic lvs1015 [17:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:57] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host thumbor1001.eqiad.wmnet [17:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:26] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.36:10514]) https://wikitech.wikimedia.org/wiki/PyBal [17:21:42] (03PS1) 10Andrew Bogott: cinder: set default policy to admin_or_projectadmin [puppet] - 10https://gerrit.wikimedia.org/r/663251 (https://phabricator.wikimedia.org/T274107) [17:22:26] (03CR) 10Effie Mouzeli: "woohoo https://puppet-compiler.wmflabs.org/compiler1003/27991/" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [17:22:48] (03PS13) 10Effie Mouzeli: profile::memcached::instance: remove "default_values" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [17:23:38] (03PS14) 10Effie Mouzeli: profile::memcached::instance: remove "default_values" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [17:23:52] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:24:26] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:25:07] (03CR) 10JMeybohm: k8s: Add docker-registry credentials to pull restricted images (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [17:27:49] (03PS1) 10Andrew Bogott: Openstack policies: add a default policy override requiring projectadmin [puppet] - 10https://gerrit.wikimedia.org/r/663253 (https://phabricator.wikimedia.org/T274107) [17:28:05] (03PS3) 10Cwhite: profile: remove deprecated syslog input [puppet] - 10https://gerrit.wikimedia.org/r/662009 (https://phabricator.wikimedia.org/T217032) [17:28:11] (03CR) 10Andrew Bogott: [C: 03+2] cinder: set default policy to admin_or_projectadmin [puppet] - 10https://gerrit.wikimedia.org/r/663251 (https://phabricator.wikimedia.org/T274107) (owner: 10Andrew Bogott) [17:28:42] (03CR) 10Andrew Bogott: [C: 03+2] Openstack policies: add a default policy override requiring projectadmin [puppet] - 10https://gerrit.wikimedia.org/r/663253 (https://phabricator.wikimedia.org/T274107) (owner: 10Andrew Bogott) [17:38:52] (03CR) 10Cwhite: [C: 03+2] profile: update w3creportingapi to use 12 weekly indexes [puppet] - 10https://gerrit.wikimedia.org/r/661993 (https://phabricator.wikimedia.org/T274005) (owner: 10Cwhite) [17:39:59] (03CR) 10Bstorm: [C: 03+2] wikireplicas: adjust logrotate for multiinstance on wmf-pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/662797 (https://phabricator.wikimedia.org/T274044) (owner: 10Bstorm) [17:40:14] PROBLEM - Host thumbor1001 is DOWN: PING CRITICAL - Packet loss = 100% [17:41:13] (03CR) 10JMeybohm: [C: 04-1] docker_registry_ha: Have restricted/ images that are limited read/write (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/662807 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [17:41:33] 10SRE: hosts failing puppet compile due to missing secrets - https://phabricator.wikimedia.org/T274392 (10Dzahn) [17:42:24] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1295.eqiad.wmnet'] ` an... [17:42:54] (03PS2) 10Dzahn: hieradata/common: replace hiera within hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/662021 (https://phabricator.wikimedia.org/T209953) [17:43:10] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:43:22] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1295.eqiad.wmnet [17:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:13] (03PS1) 10Andrew Bogott: Keystone policy: standardize on the rule name 'admin_or_projectadmin' [puppet] - 10https://gerrit.wikimedia.org/r/663258 (https://phabricator.wikimedia.org/T274107) [17:45:04] (03CR) 10Andrew Bogott: [C: 03+2] Keystone policy: standardize on the rule name 'admin_or_projectadmin' [puppet] - 10https://gerrit.wikimedia.org/r/663258 (https://phabricator.wikimedia.org/T274107) (owner: 10Andrew Bogott) [17:47:56] RECOVERY - Check systemd state on clouddb1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_arclamp site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:49:36] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:51:03] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1295.eqiad.wmnet [17:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:29] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:53:31] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/663261 [17:54:12] PROBLEM - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [17:55:18] PROBLEM - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [17:58:05] (03PS1) 10Jdlrobson: Labs should override all logo definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663263 (https://phabricator.wikimedia.org/T274210) [17:58:08] RECOVERY - Check systemd state on clouddb1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:58:34] RECOVERY - Check systemd state on clouddb1020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:58:38] RECOVERY - Check systemd state on clouddb1015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:59:34] (03PS2) 10Jdlrobson: Labs should override all logo definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663263 (https://phabricator.wikimedia.org/T274210) [18:01:48] RECOVERY - Check systemd state on clouddb1017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:04:35] yay ^ , bstorm! (I am guessing it was you) [18:04:51] Yes it was :) [18:04:57] thank you!!!! [18:05:06] np [18:05:33] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1371.eqiad.wmnet'] ` Of... [18:06:53] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1379.eqiad.wmnet'] ` Of... [18:13:24] (03CR) 10Legoktm: docker_registry_ha: Have restricted/ images that are limited read/write (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/662807 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [18:13:40] (03CR) 10Legoktm: k8s: Add docker-registry credentials to pull restricted images (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [18:14:10] (03PS6) 10Legoktm: docker_registry_ha: Have restricted/ images that are limited read/write [puppet] - 10https://gerrit.wikimedia.org/r/662807 (https://phabricator.wikimedia.org/T273521) [18:14:12] (03PS4) 10Legoktm: k8s: Add docker-registry credentials to pull restricted images [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) [18:14:31] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host thumbor1001.eqiad.wmnet [18:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:22] (03CR) 10jerkins-bot: [V: 04-1] k8s: Add docker-registry credentials to pull restricted images [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [18:17:22] RECOVERY - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [18:17:24] (03PS5) 10Legoktm: k8s: Add docker-registry credentials to pull restricted images [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) [18:17:26] RECOVERY - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [18:17:34] RECOVERY - Check systemd state on clouddb1019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:18:33] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1294.eqiad.wmnet with reason: REIMAGE [18:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:59] (03CR) 10Dzahn: [C: 03+1] "not speaking for the nginx config, but the lookup/hiera part and naming of the keys looks good to me now." [puppet] - 10https://gerrit.wikimedia.org/r/662807 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [18:20:09] 10SRE, 10Scap, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)): Re-imaged mw app servers can end up with missing l10n cache for old versions of MW needed for rollback - https://phabricator.wikimedia.org/T273334 (10Legoktm) This happened agai... [18:20:38] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1294.eqiad.wmnet with reason: REIMAGE [18:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:04] RECOVERY - Check systemd state on clouddb1018 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:24:07] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by legoktm on cumin1001.eq... [18:25:49] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Legoktm) [18:27:02] RECOVERY - Host logstash1020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [18:27:51] (03PS1) 10Legoktm: Update docker-registry related hiera keys for I76a6fc9d21 and Ic655290a69a [labs/private] - 10https://gerrit.wikimedia.org/r/663268 [18:28:03] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Update docker-registry related hiera keys for I76a6fc9d21 and Ic655290a69a [labs/private] - 10https://gerrit.wikimedia.org/r/663268 (owner: 10Legoktm) [18:28:31] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/663269 [18:28:35] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/663270 [18:29:11] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27992/console" [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [18:30:42] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27993/console" [puppet] - 10https://gerrit.wikimedia.org/r/662807 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [18:31:38] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27994/console" [puppet] - 10https://gerrit.wikimedia.org/r/662806 (owner: 10Legoktm) [18:31:47] (03PS3) 10Legoktm: docker_registry_ha: Properly override nginx timeouts [puppet] - 10https://gerrit.wikimedia.org/r/662806 [18:32:13] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1371.eqiad.wmnet [18:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:29] !log andrew@deploy1001 Started deploy [horizon/deploy@4f5a5a7]: security group dashboard policy updates [18:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:36] !log andrew@deploy1001 Finished deploy [horizon/deploy@4f5a5a7]: security group dashboard policy updates (duration: 00m 07s) [18:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:21] !log andrew@deploy1001 Started deploy [horizon/deploy@02cb8a4]: security group dashboard policy updates, now after doing a submodule update! [18:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:54] (03CR) 10Legoktm: [C: 03+2] docker_registry_ha: Properly override nginx timeouts [puppet] - 10https://gerrit.wikimedia.org/r/662806 (owner: 10Legoktm) [18:34:10] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1379.eqiad.wmnet [18:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:31] (03CR) 10Dzahn: "+1 for the lookup/hiera keys part discussed on IRC, not speaking for the rest of it" [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [18:36:52] !log andrew@deploy1001 Finished deploy [horizon/deploy@02cb8a4]: security group dashboard policy updates, now after doing a submodule update! (duration: 03m 31s) [18:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:00] (03PS1) 10Jbond: idp - cloud: update base-dn [puppet] - 10https://gerrit.wikimedia.org/r/663274 [18:43:22] 10SRE: hosts failing puppet compile due to missing secrets - https://phabricator.wikimedia.org/T274392 (10Dzahn) [18:44:08] (03CR) 10Jbond: [C: 03+2] idp - cloud: update base-dn [puppet] - 10https://gerrit.wikimedia.org/r/663274 (owner: 10Jbond) [18:44:36] PROBLEM - Host mw1379 is DOWN: PING CRITICAL - Packet loss = 100% [18:44:45] 10SRE: hosts failing puppet compile due to missing secrets - https://phabricator.wikimedia.org/T274392 (10Dzahn) [18:46:48] (03CR) 10Bstorm: [C: 03+1] "Looks good to me. Access peculiarities are kept to the proxy layer, so I have no concerns at this layer." [puppet] - 10https://gerrit.wikimedia.org/r/662697 (https://phabricator.wikimedia.org/T273606) (owner: 10Kormat) [18:47:10] legoktm: any idea why hosts are going down like 1379 that were already done? [18:47:21] uhh not me [18:47:30] I'm doing 1321-1324 [18:47:31] hmm, ACK.. looking [18:47:35] ok, thx [18:47:48] weird.. everything else seemed normal and it was done [18:48:08] and it's 2 of them [18:48:58] oh.. it's what happened once before yesterday [18:49:11] everything works .. past the first puppet run [18:49:22] but then it does the final reboot and times out [18:49:44] then I powercycled and it was .. fine [18:52:11] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1321.eqiad.wmnet with reason: REIMAGE [18:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:54] Unable to perform requested operation. [18:54:08] hrmm.. another special case [18:54:09] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1322.eqiad.wmnet with reason: REIMAGE [18:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:21] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1321.eqiad.wmnet with reason: REIMAGE [18:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:13] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1323.eqiad.wmnet with reason: REIMAGE [18:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:31] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1322.eqiad.wmnet with reason: REIMAGE [18:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:28] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1371.eqiad.wmnet [18:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:59] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:58:07] twentyafterfour: the train is good to move forward, the subticket is addressed T273242#6818961 [18:58:08] T273242: MemcachedPeclBagOStuff: Serialization of 'Closure' is not allowed - https://phabricator.wikimedia.org/T273242 [18:58:08] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1324.eqiad.wmnet with reason: REIMAGE [18:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:30] Amir1: thanks! [18:58:36] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1323.eqiad.wmnet with reason: REIMAGE [18:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] twentyafterfour and hashar: It is that lovely time of the day again! You are hereby commanded to deploy Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210210T1900). [19:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210210T1900). [19:00:05] No GERRIT patches in the queue for this window AFAICS. [19:00:05] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw1379.eqiad.wmnet [19:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:39] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1324.eqiad.wmnet with reason: REIMAGE [19:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:09] !log mw1379 - racadm racreset - host did not come back from reboot and DRAC says it can't powercycle it.. while it also ALREADY ON [19:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:15] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:06:38] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1294.eqiad.wmnet'] ` an... [19:07:02] (03PS1) 10Jbond: P:idp::client: add generic uwsgi template [puppet] - 10https://gerrit.wikimedia.org/r/663276 [19:07:04] (03PS2) 10Thcipriani: Remove a couple of useless DNS lookups from mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661732 (https://phabricator.wikimedia.org/T231025) (owner: 10Giuseppe Lavagetto) [19:08:03] (03CR) 10Thcipriani: [C: 03+2] Remove a couple of useless DNS lookups from mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661732 (https://phabricator.wikimedia.org/T231025) (owner: 10Giuseppe Lavagetto) [19:08:12] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:08:55] (03Merged) 10jenkins-bot: Remove a couple of useless DNS lookups from mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661732 (https://phabricator.wikimedia.org/T231025) (owner: 10Giuseppe Lavagetto) [19:11:52] RECOVERY - Host mw1379 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [19:12:31] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1294.eqiad.wmnet [19:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:16] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:14:51] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1370.eqiad.wmnet with reason: REIMAGE [19:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:54] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1379.eqiad.wmnet [19:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:25] 10SRE: mw1379 - down after reboot attempt and DRAC can't powercycle - https://phabricator.wikimedia.org/T274403 (10Dzahn) [19:16:53] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1370.eqiad.wmnet with reason: REIMAGE [19:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:10] 10SRE: mw1379 - down after reboot attempt and DRAC can't powercycle - https://phabricator.wikimedia.org/T274403 (10Dzahn) [19:17:13] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [19:17:25] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1379.eqiad.wmnet [19:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:49] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1294.eqiad.wmnet [19:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:05] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1294.eqiad.wmnet [19:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:55] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:20:36] (03PS1) 10Jbond: P:idp: add ability to change memcached port [puppet] - 10https://gerrit.wikimedia.org/r/663279 [19:20:41] !log thcipriani@deploy1001 Synchronized wmf-config/ProductionServices.php: [[gerrit:661732|Remove a couple of useless DNS lookups from mediawiki-config]] T231025 (duration: 01m 10s) [19:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:46] T231025: LegacyHandler.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002 - https://phabricator.wikimedia.org/T231025 [19:23:06] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1378.eqiad.wmnet with reason: REIMAGE [19:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27995/console" [puppet] - 10https://gerrit.wikimedia.org/r/663279 (owner: 10Jbond) [19:23:35] (03CR) 10Jbond: [C: 03+2] P:idp::client: add generic uwsgi template [puppet] - 10https://gerrit.wikimedia.org/r/663276 (owner: 10Jbond) [19:23:40] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:idp: add ability to change memcached port [puppet] - 10https://gerrit.wikimedia.org/r/663279 (owner: 10Jbond) [19:25:10] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1378.eqiad.wmnet with reason: REIMAGE [19:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:45] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:27:05] ACKNOWLEDGEMENT - configured eth on sretest1001 is CRITICAL: ens2f1 reporting no carrier. daniel_zahn machines with TEST in the name should not have prod monitoring alerts https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [19:27:05] ACKNOWLEDGEMENT - dhclient process on sretest1001 is CRITICAL: PROCS CRITICAL: 1 process with command name dhclient daniel_zahn machines with TEST in the name should not have prod monitoring alerts https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [19:27:05] ACKNOWLEDGEMENT - configured eth on sretest1002 is CRITICAL: eno2 reporting no carrier. daniel_zahn machines with TEST in the name should not have prod monitoring alerts https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [19:27:05] ACKNOWLEDGEMENT - dhclient process on sretest1002 is CRITICAL: PROCS CRITICAL: 1 process with command name dhclient daniel_zahn machines with TEST in the name should not have prod monitoring alerts https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [19:29:52] (03CR) 10MarcoAurelio: "Untested." [puppet] - 10https://gerrit.wikimedia.org/r/663074 (owner: 10MarcoAurelio) [19:30:09] 10SRE: hosts failing puppet compile due to missing secrets - https://phabricator.wikimedia.org/T274392 (10Dzahn) [19:30:50] (03CR) 10Dzahn: "tried to compile on all - opened https://phabricator.wikimedia.org/T274392 because there are always a bunch of false positives" [puppet] - 10https://gerrit.wikimedia.org/r/662021 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:31:12] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/27989/" [puppet] - 10https://gerrit.wikimedia.org/r/662021 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:33:27] 10SRE: mw1379 - down after reboot attempt and DRAC can't powercycle - https://phabricator.wikimedia.org/T274403 (10Dzahn) 05Open→03Resolved a:03Dzahn [19:33:30] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [19:34:20] 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2021 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle) a:03Robert-RtC3V [19:35:17] PROBLEM - Host ms-be1034 is DOWN: PING CRITICAL - Packet loss = 100% [19:36:44] 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2021 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10RhinosF1) @Krinkle: Robert doesn't seem to have been active for a few years nor involved in this task. Did you mean to assign it to him? [19:37:52] 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2021 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle) a:05Robert-RtC3V→03None It wasn't me, it was . [19:38:14] (03CR) 10Dzahn: [C: 03+2] wmcs::monitoring: replace hiera inside hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/662026 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:38:51] 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2021 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10RhinosF1) >>! In T245183#6819999, @Krinkle wrote: > It wasn't me, it was . Can we get some... [19:40:19] 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2021 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Ladsgroup) I'm admin and I can't disable it.... [19:41:46] (03CR) 10Dzahn: "cloudmetrics1002 - confirmed noop" [puppet] - 10https://gerrit.wikimedia.org/r/662026 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:41:59] 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2021 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10RhinosF1) >>! In T245183#6820029, @Ladsgroup wrote: > I'm admin and I can't disable it.... Phabricator is good at that. I guess it'll have to be... [19:45:52] 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2021 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Dzahn) >>! In T245183#6820029, @Ladsgroup wrote: > I'm admin and I can't disable it.... You should ask the members of the "phabricator-admin" sh... [19:46:22] (03CR) 10Dzahn: [C: 03+2] ldap: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/661916 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [19:46:59] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1293.eqiad.wmnet with reason: REIMAGE [19:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:39] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:02] (03CR) 10Dzahn: "being bold here.. Ladsgroup can't self-merge this and T209953 was originally created by wmcs" [puppet] - 10https://gerrit.wikimedia.org/r/661916 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [19:49:06] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1293.eqiad.wmnet with reason: REIMAGE [19:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:56] 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Use lookup() instead of hiera() in Puppet - https://phabricator.wikimedia.org/T209953 (10Dzahn) This would be done if it wasn't for a single remaining case: Could you guys fix this one please? ` 62 puppetmaster::servers: 63 "%{hiera('puppetmas... [19:52:30] (03PS1) 10Dzahn: cloud: replace hiera in hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/663289 (https://phabricator.wikimedia.org/T209953) [19:52:59] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1321.eqiad.wmnet', 'mw13... [19:53:38] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:36] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1321.eqiad.wmnet [19:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:40] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1322.eqiad.wmnet [19:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:44] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1323.eqiad.wmnet [19:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:50] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1324.eqiad.wmnet [19:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:19] (03PS14) 10Kosta Harlan: [WIP] linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) [19:59:40] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@1c5477d]: query_clicks: timestamp is now a reserved keyword [19:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] twentyafterfour and hashar: Your horoscope predicts another unfortunate Mediawiki train - American+European Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210210T2000). [20:00:06] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1321.eqiad.wmnet [20:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:23] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1322.eqiad.wmnet [20:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:06] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1323.eqiad.wmnet [20:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:29] 10SRE, 10Maps: Requesting access to maps for mbsantos and jgiannelos - https://phabricator.wikimedia.org/T269357 (10Dzahn) 05Open→03Stalled [20:01:59] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@1c5477d]: query_clicks: timestamp is now a reserved keyword (duration: 02m 19s) [20:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:56] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1324.eqiad.wmnet [20:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:13] (03PS1) 1020after4: group1 wikis to 1.36.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663295 [20:09:15] (03CR) 1020after4: [C: 03+2] group1 wikis to 1.36.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663295 (owner: 1020after4) [20:09:57] (03CR) 10Kosta Harlan: [WIP] linkrecommendation: Cron job to load datasets (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [20:10:04] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663295 (owner: 1020after4) [20:12:27] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.30 [20:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:30] !log twentyafterfour@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.30 (duration: 01m 02s) [20:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:36] (03PS19) 10CRusnov: dhcp: Introduce automation proxies for management networks [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) [20:19:38] (03CR) 10CRusnov: dhcp: Introduce automation proxies for management networks (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [20:20:16] (03CR) 10jerkins-bot: [V: 04-1] dhcp: Introduce automation proxies for management networks [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [20:21:33] !log mw1370, mw1378 - again failing to reboot as the last step of reimaging script [20:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:08] !log mw1370, mw1378 - powercycling via DRAC [20:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:55] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1370.eqiad.wmnet'] ` an... [20:27:45] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1378.eqiad.wmnet'] ` an... [20:31:15] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1370.eqiad.wmnet [20:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:25] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1378.eqiad.wmnet [20:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:13] (03PS15) 10Kosta Harlan: linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) [20:35:01] (03CR) 10Kosta Harlan: "Tested with local-charts, it's working" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [20:36:04] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1378.eqiad.wmnet [20:36:04] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1293.eqiad.wmnet'] ` an... [20:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:26] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1370.eqiad.wmnet [20:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:44] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1293.eqiad.wmnet [20:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:50] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1293.eqiad.wmnet [20:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:50] 10SRE, 10Commons, 10Traffic, 10Patch-For-Review: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Joe) >>! In T273741#6816099, @Joe wrote: >>>! In T273741#6815874, @Majavah wrote: >> Is the effect that the block will hav... [20:45:02] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:46:02] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:47:12] (03PS16) 10Kosta Harlan: linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) [20:53:14] 10SRE, 10netops, 10observability: Ingest Cron and Root Alerts Into Logstash - https://phabricator.wikimedia.org/T274377 (10herron) Sorry, I should have clarified this initially, afaict a proxy won't work for this case because logstash configures this at the JVM level and would have unwanted effects on the ot... [21:00:04] chrisalbon and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210210T2100). [21:01:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1369.eqiad.wmnet with reason: REIMAGE [21:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:55] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1377.eqiad.wmnet with reason: REIMAGE [21:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:59] (03PS1) 10Ebernhardson: airflow: Increase scheduler health check to match interval [puppet] - 10https://gerrit.wikimedia.org/r/663304 [21:03:38] (03CR) 10Volans: "Some comments/questions inline." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/663205 (https://phabricator.wikimedia.org/T274338) (owner: 10David Caro) [21:04:06] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1369.eqiad.wmnet with reason: REIMAGE [21:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:23] (03CR) 10Cwhite: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/662009 (https://phabricator.wikimedia.org/T217032) (owner: 10Cwhite) [21:06:00] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1377.eqiad.wmnet with reason: REIMAGE [21:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:28] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by legoktm on cumin1001.eq... [21:10:17] (03PS11) 10Cwhite: profile: update netdev to output ECS-formatted logs [puppet] - 10https://gerrit.wikimedia.org/r/647029 (https://phabricator.wikimedia.org/T234565) [21:12:45] (03PS12) 10Cwhite: profile: update netdev to output ECS-formatted logs [puppet] - 10https://gerrit.wikimedia.org/r/647029 (https://phabricator.wikimedia.org/T234565) [21:14:21] 10SRE, 10Traffic, 10Patch-For-Review: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) [21:15:24] (03CR) 10Cwhite: [C: 03+2] profile: update netdev to output ECS-formatted logs [puppet] - 10https://gerrit.wikimedia.org/r/647029 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [21:19:56] (03CR) 10Ebernhardson: "verified that changing this config var fixes the UI warning in our analytics-integration environment." [puppet] - 10https://gerrit.wikimedia.org/r/663304 (owner: 10Ebernhardson) [21:21:08] (03PS1) 10Jgreen: remove A/PTR records for frdata1001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/663307 (https://phabricator.wikimedia.org/T255435) [21:22:52] (03CR) 10Jgreen: [C: 03+2] remove A/PTR records for frdata1001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/663307 (https://phabricator.wikimedia.org/T255435) (owner: 10Jgreen) [21:35:29] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1325.eqiad.wmnet with reason: REIMAGE [21:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:28] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1326.eqiad.wmnet with reason: REIMAGE [21:37:39] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1325.eqiad.wmnet with reason: REIMAGE [21:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:32] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1327.eqiad.wmnet with reason: REIMAGE [21:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:39] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1326.eqiad.wmnet with reason: REIMAGE [21:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:29] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1328.eqiad.wmnet with reason: REIMAGE [21:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:44] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1327.eqiad.wmnet with reason: REIMAGE [21:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:51] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1328.eqiad.wmnet with reason: REIMAGE [21:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:20] twentyafterfour: did wmf.30 roll to group1 yet? I'm just trying to keep track [21:52:18] (03CR) 10Dave Pifke: [C: 03+1] "Overall LGTM." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663238 (https://phabricator.wikimedia.org/T272979) (owner: 10Filippo Giunchedi) [21:52:45] nm I see it is, I shoul have checked versions earlier, my bad [21:55:36] ryankemper: kibana is failing on relforge1003/1004 [21:56:29] (03PS1) 10Legoktm: Revert "profiler: Send data to excimer-buster pipeline" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663078 [21:56:45] (03PS1) 10Legoktm: Revert "arclamp: Add excimer-buster pipeline" [puppet] - 10https://gerrit.wikimedia.org/r/663079 [21:56:55] (03PS2) 10Legoktm: Revert "arclamp: Add excimer-buster pipeline" [puppet] - 10https://gerrit.wikimedia.org/r/663079 [21:56:57] (03PS2) 10Legoktm: Revert "profiler: Send data to excimer-buster pipeline" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663078 [21:57:07] ACKNOWLEDGEMENT - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T262211#6817218 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:57:07] ACKNOWLEDGEMENT - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T262211#6817218 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:57:25] 10SRE, 10MediaWiki-Debug-Logger, 10Traffic, 10Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: LegacyHandler.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002 - https://phabricator.wikimedia.org/T231025 (10thcipriani) 05Open→03Resolved a:03Joe Specific er... [21:58:07] (03CR) 10Dave Pifke: [C: 03+1] Revert "profiler: Send data to excimer-buster pipeline" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663078 (owner: 10Legoktm) [22:04:42] (03PS3) 10Cwhite: profile: update netdev rsyslog template to ecs 1.7.0 [puppet] - 10https://gerrit.wikimedia.org/r/647032 (https://phabricator.wikimedia.org/T234565) [22:07:16] !log mw1369, mw1377 - all servers in this section now consistenly fail to reboot when triggered as the last step of wmf-reimage script [22:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:35] mutante: is it a general problem or specific to those servers? [22:09:59] legoktm: I don't know, either it is this type of hardware or something broke about sending the reboot command [22:10:08] (03CR) 10Dave Pifke: [C: 03+1] "This can be deployed after the other patch to stop sending data to it." [puppet] - 10https://gerrit.wikimedia.org/r/663079 (owner: 10Legoktm) [22:10:16] the facts I have.. it did not happen until yesterday [22:10:21] now it happens all the time ..to me [22:10:45] but this a specific section in the etherpad [22:10:50] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1369.eqiad.wmnet'] ` an... [22:10:59] generally if you just powercycle them they will be ok [22:11:02] the 4 I did earlier had no issue, and I'm doing 4 right now, but they've all been in the same group (mw1321-1328) [22:11:12] except the special case among special cases which needed DRAC reset [22:11:15] to be able to do just that [22:11:31] ok, *nod* [22:12:18] if you manually powercycle before a full hour is over.. you can even get the reimage script to end with exit 0 and all good [22:12:39] but if you don't.. just gives up after 60 min [22:12:46] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1377.eqiad.wmnet'] ` an... [22:13:57] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1369.eqiad.wmnet [22:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:07] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1377.eqiad.wmnet [22:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:34] (03CR) 10Cwhite: [C: 03+2] profile: update netdev rsyslog template to ecs 1.7.0 [puppet] - 10https://gerrit.wikimedia.org/r/647032 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [22:23:42] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1369.eqiad.wmnet [22:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:52] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1377.eqiad.wmnet [22:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:35] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/27996/" [puppet] - 10https://gerrit.wikimedia.org/r/663289 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [22:28:07] (03CR) 10Dzahn: [V: 03+1] "already compiled on * - so unless this is used by internal cloud VPS machines - it is noop" [puppet] - 10https://gerrit.wikimedia.org/r/663289 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [22:29:11] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem - https://phabricator.wikimedia.org/T271058 (10RobH) [22:29:25] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem - https://phabricator.wikimedia.org/T271058 (10RobH) [22:29:39] (03CR) 10Dzahn: "This is just asking to add new and already existing VMs to scap so that when people deploy they also deploy to these." [puppet] - 10https://gerrit.wikimedia.org/r/650306 (owner: 10Dzahn) [22:30:50] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@d97f7d9]: query_clicks: Remove result file merging [22:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:07] * Krinkle testing on mwdebug1001 [22:32:17] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@d97f7d9]: query_clicks: Remove result file merging (duration: 01m 27s) [22:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:13] (03CR) 10Dzahn: "This is switching doc.wm.org from doc1001 (stretch) to doc1002 (buster). It is already up and running, has the right puppet role, no error" [dns] - 10https://gerrit.wikimedia.org/r/650625 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [22:35:14] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1325.eqiad.wmnet', 'mw13... [22:37:39] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1325.eqiad.wmnet [22:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:46] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1326.eqiad.wmnet [22:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:51] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1327.eqiad.wmnet [22:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:55] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1328.eqiad.wmnet [22:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:59] (03PS1) 10Cwhite: Revert "profile: update netdev rsyslog template to ecs 1.7.0" [puppet] - 10https://gerrit.wikimedia.org/r/663080 [23:04:54] (03CR) 10Cwhite: [C: 03+2] Revert "profile: update netdev rsyslog template to ecs 1.7.0" [puppet] - 10https://gerrit.wikimedia.org/r/663080 (owner: 10Cwhite) [23:20:21] (03PS1) 10BryanDavis: README: line wrapping for easier source reading [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/663322 [23:38:58] !log milimetric@deploy1001 Started deploy [analytics/refinery@3da19b6]: More fixes for jobs after cluster upgrade [23:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:09] 10SRE, 10Analytics, 10SRE-Access-Requests: Add kzeta to analytics-privatedata-users - https://phabricator.wikimedia.org/T272982 (10kzimmerman) @Vgutierrez (I saw you were listed on [[ https://wikitech.wikimedia.org/wiki/SRE_Clinic_Duty | Clinic Duty ]]) - I ran into access problems again today; do you need a... [23:49:35] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1325.eqiad.wmnet [23:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:40] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1326.eqiad.wmnet [23:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:45] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1327.eqiad.wmnet [23:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:51] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1328.eqiad.wmnet [23:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:07] 10SRE: mw1379 - down after reboot attempt and DRAC can't powercycle - https://phabricator.wikimedia.org/T274403 (10Papaul) I looked at 3 hosts wmf-auto-reimage .out log, there were no indication of this issue then i looked at the IDRAC log of 3 of the hosts that are having this issue (mw1377,mw1378 and mw1379)... [23:53:21] !log milimetric@deploy1001 Finished deploy [analytics/refinery@3da19b6]: More fixes for jobs after cluster upgrade (duration: 14m 23s) [23:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:33] mutante: fyi, the most notable part of doc1001 is the /srv/doc which is stateful (not scap deployed) [23:59:00] this is not automatically synced from old to new server and between the two new servers, right?