[00:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210219T0000). [00:00:05] dpifke and Jdlrobson: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:26] i can deploy today [00:00:26] I can deploy [00:00:30] or RoanKattouw [00:00:30] here [00:00:33] Oh you beat me to it haha [00:00:35] I can deploy too. :) [00:00:50] and both exactly at 00:00 - amazing [00:01:12] Should be a quick one, setting up to do it now. [00:01:12] Both exactly at 00:00:26 even [00:01:17] hehe [00:01:18] Go ahead dpifke [00:01:27] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/664920 @RoanKattouw I was hoping to backport that before the weekend because it's impacting users but I haven't been able to find a +2. Is that code something you have any knowledge of and confidence in reviewing? [00:01:33] I'm a fan of self-service deploys [00:01:56] RoanKattouw: I'll leave the patch from Jdlrobson to you :) [00:01:59] Jdlrobson: Well it's -1ed with a comment saying "does not work", so... [00:01:59] oh i spoke too soon - bartosz just reviewed lol. guess this one will wait for next week now [00:02:18] But also, no, HTMLForm is dark magic and I'm not comfortable reviewing changes to it [00:02:26] mwmaint2001 is being reimaged and already back up but in the process of running puppet - removed it from dsh groups so you should not notice it, but just in case [00:03:32] (03CR) 10Dave Pifke: [C: 03+2] profiler: wall-clock excimer instance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597654 (https://phabricator.wikimedia.org/T253160) (owner: 10Ori.livneh) [00:05:46] definitely dark magic.. [00:05:51] (03Merged) 10jenkins-bot: profiler: wall-clock excimer instance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597654 (https://phabricator.wikimedia.org/T253160) (owner: 10Ori.livneh) [00:06:16] https://gerrit.wikimedia.org/r/c/665192/ is the other patch I have in backport window and that's beta cluster only so should be very straightforward [00:09:37] Mine looks good on mwdebug2001, rolling out further now. [00:12:47] !log dpifke@deploy1001 Synchronized wmf-config/profiler.php: Deploying excimer-wall profiler pipeline T253160 (duration: 01m 02s) [00:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:53] T253160: Wall-clock Excimer profiling in production - https://phabricator.wikimedia.org/T253160 [00:13:59] !log dpifke@deploy1001 Synchronized wmf-config/PhpAutoPrepend.php: Deploying excimer-wall profiler pipeline T253160 (duration: 01m 03s) [00:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:48] OK, done. Will stick around and keep an eye on logstash for a bit. [00:15:02] Jdlrobson: All yours. :) [00:16:14] dpifke: i cant deploy [00:16:21] i need someone to do that for me :) [00:16:45] (03CR) 10Urbanecm: [C: 03+2] Restore logos on Vector (classic version) and use cloud icon for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665192 (https://phabricator.wikimedia.org/T274210) (owner: 10Jdlrobson) [00:16:50] i know i said I'll leave to roan... [00:16:55] ...but this is just a +2 anyway [00:17:03] Just a labs change right? [00:17:18] yeah [00:17:31] Right, no further action needed then [00:17:39] RoanKattouw: we should probably sync the static file through [00:17:51] (03Merged) 10jenkins-bot: Restore logos on Vector (classic version) and use cloud icon for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665192 (https://phabricator.wikimedia.org/T274210) (owner: 10Jdlrobson) [00:17:53] (not used out of labs _right now_, but what if it changes?) [00:17:59] Jdlrobson: should auto-deploy to beta soon :) [00:18:12] Yes, good point [00:18:33] cloud - it will never not be labs (tm) [00:18:39] neeeattt [00:18:51] mutante: i never got used to the new name :D [00:19:36] RoanKattouw: you do the sync, or should i? [00:19:38] Urbanecm: hehe, yes [00:19:53] Urbanecm: Go for it [00:20:04] ok [00:20:21] Urbanecm: if you have time https://gerrit.wikimedia.org/r/c/mediawiki/core/+/664920 just got +2ed.. [00:20:35] dpifke: git status is really weird in /srv/mediawiki-stagging [00:20:37] and my understanding is it's been driving page protecters mad all week [00:20:48] it says `Your branch and 'origin/master' have diverged, and have 1 and 2 different commits each, respectively.` [00:20:49] Urbanecm: checking. [00:22:06] Jdlrobson: we want to do both versions, right? [00:22:20] yeh but only if you feel comfortable [00:22:40] i am out monday and tuesday next week so i am worried it wont be backported otherwise. [00:23:33] i can do it; making sure protecting stuff still works should save us from a regression [00:23:38] (03PS1) 10Jdlrobson: field descriptors in HTMLForm must have keys [core] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/665173 (https://phabricator.wikimedia.org/T275018) [00:23:51] (03PS2) 10Urbanecm: field descriptors in HTMLForm must have keys [core] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/665173 (https://phabricator.wikimedia.org/T275018) (owner: 10Jdlrobson) [00:23:56] (03CR) 10Urbanecm: [C: 03+2] field descriptors in HTMLForm must have keys [core] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/665173 (https://phabricator.wikimedia.org/T275018) (owner: 10Jdlrobson) [00:24:06] Fixed. I think I had a stale Gerrit page (pre-merge) open when I fetched. [00:24:08] (03PS1) 10Urbanecm: field descriptors in HTMLForm must have keys [core] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/665174 (https://phabricator.wikimedia.org/T275018) [00:24:13] thanks dpifke [00:24:18] (03CR) 10Urbanecm: [C: 03+2] field descriptors in HTMLForm must have keys [core] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/665174 (https://phabricator.wikimedia.org/T275018) (owner: 10Urbanecm) [00:24:28] (03PS2) 10Jdlrobson: field descriptors in HTMLForm must have keys [core] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/665174 (https://phabricator.wikimedia.org/T275018) (owner: 10Urbanecm) [00:24:45] mutante: I whine about "labs" everywhere but when folks are talking about mediawiki-config. The horrible db name for wikitech is "labswiki" and that's not easy to change :) [00:24:53] Jdlrobson: oh, is there an issue in the commit i make? [00:25:02] no my bad [00:25:17] (03CR) 10Urbanecm: [C: 03+2] field descriptors in HTMLForm must have keys [core] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/665174 (https://phabricator.wikimedia.org/T275018) (owner: 10Urbanecm) [00:26:05] bd808: oh, yea, I see that is maybe the hardest part to change of all. what caught my eye was the combination of "cloud logo for labs" but it is correct then :) [00:26:34] bd808: we can create wikitechwiki as a regular wiki in the MW cluster, import all of wikitech, and kill its special features, but that's...also not exactly easy [00:26:53] !log urbanecm@deploy1001 Synchronized static/images/project-logos/wikimedia-cloud-services.svg: 686acba2f31df0d454c6f1c506c042af50b5cce0: Restore logos on Vector (classic version) and use cloud icon for labs (T274210) (duration: 01m 07s) [00:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:59] T274210: [Regression] some beta cluster wikis using official logos - https://phabricator.wikimedia.org/T274210 [00:27:23] Urbanecm: if you solve T237773 I will supply you with stickers for life :) [00:27:24] T237773: Move Wikitech onto the production MW cluster - https://phabricator.wikimedia.org/T237773 [00:27:45] hehe [00:28:11] here is one good part guys.. since we already have wikitech-static there is already something that dumps all of the content and imports it elsewhere [00:28:25] so that idea doesnt sound bad to me at all actually [00:30:15] mutante: I'll make you the same offer. Lifetime supply of whatever stickers I have that you want. [00:31:51] mutante: we'd need to solve T237889 first through [00:31:51] T237889: Install php-ldap on all MW appservers - https://phabricator.wikimedia.org/T237889 [00:32:42] bd808: ah:) I see.... [00:36:11] Urbanecm: right.. that is the part where it's still a special case and not like any other wiki [00:36:18] yeah [00:36:54] but as long as mw appservers can talk to ldap, and have the library, it actually isn't _that_ hard [00:37:44] obviously not easy, but... [00:38:08] unless it _really_ becomes just a normal wiki that is about the docs content and the whole LDAP / dev user account handling would move to Horizon, I suppose [00:38:17] none of it is technically challenging work. its just hard to find anyone who will sign off on someone doing it [00:38:38] T161859 is a later step [00:38:38] T161859: Make Wikitech an SUL wiki - https://phabricator.wikimedia.org/T161859 [00:39:05] hmm. maybe the order of things could change.. first move LDAP away from it? [00:40:02] sure, but then that is blocked by T196171 which would ideally be fixed by T179463 [00:40:03] T196171: Developer account creation without OpenStackManager - https://phabricator.wikimedia.org/T196171 [00:40:03] T179463: Create a single application to provision and manage developer (LDAP) accounts - https://phabricator.wikimedia.org/T179463 [00:40:05] would that be Horizon though are something else yet to evaluated? [00:40:23] ah [00:40:58] bd808: am i misremembering it, or does actually striker allow dev account creation? [00:41:29] Urbanecm: you are not misremembering. https://toolsadmin.wikimedia.org/ can make new accounts [00:41:42] but there is confusing branding there for Toolforge [00:42:02] which is the point of T179463 basically. to make that its own thing [00:42:16] i see [00:42:28] * bd808 just needs more hours in the day and days in the week [00:42:45] changing the order of operations might make sense [00:43:08] because once we have wikitech as a normal non-SUL wiki, making it a SUL wiki is non-trivial [00:43:20] I'll ship Toolhub first and then see if I can convince folks to let me start on the Developer account portal if it still sitting in the backlog :) [00:43:29] (we've done it before, all wikis used to be non-SUL, but it's non-trivial) [00:43:38] Someone will need to fix the account migration/matching scripts... I bet they're at least partially broken :D [00:43:50] probably :) [00:43:54] I'm pretty sure legoktm would agree that SUL unification is not trivial :) [00:44:35] while with moving just the content out of wikitech, it would be a simple xml export and import, no need to even worry about credentials [00:45:41] "simple" [00:45:45] * Reedy coughs [00:45:56] Reedy: XML import is easy. Creating a wiki is not :D [00:46:07] 10SRE, 10serviceops: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mwmaint2001.codfw.wmnet'] ` and were **ALL** successful. [00:46:14] probably horrible http://phpldapadmin.sourceforge.net/wiki/index.php/Main_Page [00:47:01] http://phpldapadmin.sourceforge.net/wiki/index.php/Special:Version [00:47:22] wow [00:47:54] it came to mind as LDAP web UI :p [00:51:23] phpldapadmin is about the same quality of code as phpmyadmin [00:51:55] heh [00:52:24] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2899051392 and 189 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:00] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1651405568 and 104 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:13] Urbanecm: finnalllyyy :) [00:53:26] Jdlrobson: oh, is it merged? [00:54:02] PROBLEM - Postgres Replication Lag on maps2007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2682381376 and 165 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:09] the core change merged so ours cant be far behind [00:54:34] !log mwmaint2001 - back from reimage - scap pull [00:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:25] got it [00:56:37] (03Merged) 10jenkins-bot: field descriptors in HTMLForm must have keys [core] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/665173 (https://phabricator.wikimedia.org/T275018) (owner: 10Jdlrobson) [00:56:43] (03Merged) 10jenkins-bot: field descriptors in HTMLForm must have keys [core] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/665174 (https://phabricator.wikimedia.org/T275018) (owner: 10Urbanecm) [00:56:48] \o/ [00:57:16] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 5520 and 120 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:57:16] RECOVERY - Postgres Replication Lag on maps2007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 5520 and 120 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:57:42] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 6112 and 145 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:58:19] Jdlrobson: pulled to mwdebug1001 [00:58:34] testing.. [00:59:05] Urbanecm: LGTM [00:59:09] thanks [01:00:33] (03PS1) 10Dzahn: Revert "scap: remove mwmaint2001 from "dsh" groups" [puppet] - 10https://gerrit.wikimedia.org/r/665175 [01:00:55] bd808: since I have you can you confirm that the logo on https://en.wikibooks.beta.wmflabs.org/wiki/Main_Page meets the labs terms of service? https://usercontent.irccloud-cdn.com/file/PDv6A2Ad/Screen%20Shot%202021-02-18%20at%205.00.50%20PM.png [01:01:17] Jdlrobson: :shipit: :) [01:01:26] https://usercontent.irccloud-cdn.com/file/h98zdGj6/classic%20Vector [01:01:37] yep already shipped and hopefully more future proofed this time [01:01:52] thanks for working on that [01:02:05] thanks Urbanecm for helping me backport it before vacation [01:02:06] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.31/includes/ProtectionForm.php: 2487c253b090d93daf85adae8ceb9d255cbf4ff2: field descriptors in HTMLForm must have keys (T275018; T274980) (duration: 01m 10s) [01:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:13] T275018: Prefilling the protection form does not work as expected - https://phabricator.wikimedia.org/T275018 [01:02:13] T274980: Protect Page form leaves the Watch page checkbox unfilled leading to unwatching the page on protect - https://phabricator.wikimedia.org/T274980 [01:03:40] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.30/includes/ProtectionForm.php: d305308a5d46a3f86bf0b211e8a733c0a951ddc1: field descriptors in HTMLForm must have keys (T275018; T274980) (duration: 01m 08s) [01:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:47] Jdlrobson: should be live [01:03:50] anything else? [01:03:51] Urbanecm: on it! [01:04:01] Urbanecm: nope that would be it (and hugggee thank you) [01:04:06] i feel much lighter all of a sudden [01:04:08] happy to help :) [01:04:12] and enjoy your vacation Jdlrobson [01:04:19] looking good in production too [01:04:19] however possible those days :) [01:04:24] it's been a crazy few weeks :) [01:08:22] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:11:12] jouncebot: now [01:11:12] No deployments scheduled for the next 6 hour(s) and 48 minute(s) [01:11:19] is everything done? [01:11:50] Jdlrobson: enjoy the free days now [01:12:29] Urbanecm: putting mwmaint2001 back into scap, ok? [01:12:40] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:19:52] !log deleting my huge build from puppet-compiler that failed because it made the compiler instance run out of disk to run on * [01:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:34] (03CR) 10Dzahn: [C: 03+2] Revert "scap: remove mwmaint2001 from "dsh" groups" [puppet] - 10https://gerrit.wikimedia.org/r/665175 (owner: 10Dzahn) [01:22:42] !log mwmaint2001 back on buster and back in scap dsh groups (if anything pops up you can revert 665175) [01:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:55] mutante: sorry, was afk. Yes, totally ;) [01:27:46] Urbanecm: no problem, it is done [01:28:04] Great! [01:29:43] also mail to ops list now. and be back later. have a good night [01:35:08] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) Ok, so an-worker11[23]9 needs the network stuff figured out by onsite still, but the installer loop issue i was having is due t... [02:06:28] PROBLEM - Check systemd state on an-worker1108 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:45] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1133.eqiad.wmnet ` Th... [02:14:07] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) [02:15:44] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) [02:18:34] RECOVERY - Check systemd state on an-worker1108 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:24:34] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1133.eqiad.wmnet with reason: REIMAGE [02:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:26:39] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1133.eqiad.wmnet with reason: REIMAGE [02:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:46] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1133.eqiad.wmnet'] ` and were **ALL** successful. [02:40:01] PROBLEM - Check systemd state on an-worker1108 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:45:04] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) [03:08:26] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:12:44] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:24:00] RECOVERY - Check systemd state on an-worker1108 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:25:34] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:31:29] (03PS1) 10KartikMistry: Adjust CX MT threshold to 90 for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665238 (https://phabricator.wikimedia.org/T275121) [03:34:00] PROBLEM - Check systemd state on an-worker1108 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:35:30] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.068 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:47:30] RECOVERY - Check systemd state on an-worker1108 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:07:36] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:11:12] PROBLEM - Check systemd state on an-worker1108 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:12:44] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:17:52] RECOVERY - Check systemd state on an-worker1108 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:41:38] PROBLEM - Check systemd state on an-worker1108 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:46:42] RECOVERY - Check systemd state on an-worker1108 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:12:08] PROBLEM - Check systemd state on an-worker1108 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:18:54] RECOVERY - Check systemd state on an-worker1108 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:31:19] (03PS1) 10Jforrester: Echo::create: Convert UserIdentityValue to plain User [extensions/Echo] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/665177 (https://phabricator.wikimedia.org/T275161) [05:42:46] PROBLEM - Check systemd state on an-worker1108 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:47:52] RECOVERY - Check systemd state on an-worker1108 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:06:38] PROBLEM - Check systemd state on an-worker1108 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:18:38] RECOVERY - Check systemd state on an-worker1108 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:37:12] PROBLEM - Check systemd state on an-worker1108 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:38:22] 10SRE, 10SRE-Access-Requests: Add Angie Muigai to analytics-privatedata-users - https://phabricator.wikimedia.org/T275140 (10MoritzMuehlenhoff) p:05Triage→03Medium [06:38:37] 10SRE, 10SRE-Access-Requests: Add Pau Giner to analytics-privatedata-users - https://phabricator.wikimedia.org/T275138 (10MoritzMuehlenhoff) p:05Triage→03Medium [06:38:44] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:41:47] (03PS1) 10Muehlenhoff: Add dancy to gerrit-root [puppet] - 10https://gerrit.wikimedia.org/r/665245 (https://phabricator.wikimedia.org/T275050) [06:43:34] (03PS1) 10Muehlenhoff: Remove members of gerrit-admin who are also members of gerrit-root [puppet] - 10https://gerrit.wikimedia.org/r/665246 [06:43:42] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:43:43] (03CR) 10Muehlenhoff: [C: 03+2] Add dancy to gerrit-root [puppet] - 10https://gerrit.wikimedia.org/r/665245 (https://phabricator.wikimedia.org/T275050) (owner: 10Muehlenhoff) [06:46:27] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to gerrit-root and gerrit-admin for dancy - https://phabricator.wikimedia.org/T275050 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @dancy : I've added you to gerrit-root, you can log into gerrit1001.wikimedia.org. I di... [06:47:10] RECOVERY - Check systemd state on an-worker1108 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:07:39] 10SRE, 10Mail, 10Wikimedia-Mailing-lists: Email to WikimediaUA mailing list from base-w[at]yandex.ru does not get delivered - https://phabricator.wikimedia.org/T247603 (10Base) (Just in case this is still an ongoing issue, I can provide a fresher copy of headers whenever someone is actively ready to take a g... [07:12:18] PROBLEM - Check systemd state on an-worker1108 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:17:22] RECOVERY - Check systemd state on an-worker1108 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:26:42] 10SRE, 10serviceops: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10MoritzMuehlenhoff) IIRC the previous update for the mwmaint servers happened via a hardware replacement: mwmaint1002 was new server which replaced terbium. Procedure-wise it's probably best if we reimage an ex... [07:30:58] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [07:34:18] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [07:39:41] (03CR) 10Muehlenhoff: [C: 03+1] "I don't think CI is setup here, you can simply manually +V2 it in Gerrit." [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/662923 (https://phabricator.wikimedia.org/T274203) (owner: 10DCausse) [07:42:20] PROBLEM - puppet last run on an-worker1108 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:42:56] PROBLEM - Check systemd state on an-worker1108 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:46:40] there seems to be a soft cpu lockup for --^ [07:47:48] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1108.eqiad.wmnet [07:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:02] RECOVERY - Check systemd state on an-worker1108 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:54:08] RECOVERY - Check systemd state on analytics1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:58:09] 10SRE: Integrate Buster 10.8 point update - https://phabricator.wikimedia.org/T274099 (10MoritzMuehlenhoff) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210219T0800) [08:01:53] (03CR) 10Filippo Giunchedi: mw_rc_irc: add check_prometheus alert on no messages being relayed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665129 (https://phabricator.wikimedia.org/T216611) (owner: 10Cwhite) [08:04:36] RECOVERY - puppet last run on an-worker1108 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:04:57] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1108.eqiad.wmnet [08:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:44] !log swift codfw-prod: more weight to ms-be20[58-61] - T269337 [08:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:50] T269337: Add ms-be20[58-61] to swift - https://phabricator.wikimedia.org/T269337 [08:08:24] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:11:04] (03PS1) 10Filippo Giunchedi: hieradata: decom ms-be20[16-27] from swift [puppet] - 10https://gerrit.wikimedia.org/r/665298 (https://phabricator.wikimedia.org/T272837) [08:13:34] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:14] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:26] (03PS1) 10Elukey: cumin: add Hadoop backup aliases [puppet] - 10https://gerrit.wikimedia.org/r/665299 [08:22:22] (03PS2) 10Elukey: cumin: add Hadoop backup aliases [puppet] - 10https://gerrit.wikimedia.org/r/665299 [08:24:37] (03CR) 10Elukey: [C: 03+2] cumin: add Hadoop backup aliases [puppet] - 10https://gerrit.wikimedia.org/r/665299 (owner: 10Elukey) [08:25:38] (03CR) 10Gehel: "Looks good! Thanks for iterating on my OCDs!" [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [08:29:42] (03PS1) 10Elukey: sre.hadoop: add the backup cluster to the stop-cluster cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/665300 [08:32:02] (03CR) 10Elukey: [C: 03+2] sre.hadoop: add the backup cluster to the stop-cluster cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/665300 (owner: 10Elukey) [08:34:15] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-airflow1001.eqiad.wmnet [08:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:40] (03PS1) 10Elukey: cumin: add hadoop backup journalnode alias [puppet] - 10https://gerrit.wikimedia.org/r/665301 [08:38:27] (03CR) 10Elukey: [C: 03+2] cumin: add hadoop backup journalnode alias [puppet] - 10https://gerrit.wikimedia.org/r/665301 (owner: 10Elukey) [08:40:31] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1001.eqiad.wmnet [08:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:56] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:49:24] (03PS1) 10Elukey: Decommission the Hadoop backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/665304 (https://phabricator.wikimedia.org/T274795) [08:52:53] 10SRE, 10SRE-Access-Requests: Add Pau Giner to analytics-privatedata-users - https://phabricator.wikimedia.org/T275138 (10Pginer-WMF) >>! In T275138#6841123, @nshahquinn-wmf wrote: > @Pginer-WMF: can you read and sign the [Acknowledgement of Wikimedia Server Access Responsibilities](https://phabricator.wikimed... [08:54:51] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jijiki: Upgrade memcached to version 1.6.x - https://phabricator.wikimedia.org/T270315 (10hashar) That broke Puppet on `deployment-memc08.deployment-prep.eqiad.wmflabs`, I don't know why other memcached instances are not affected though. I have filed T275187... [08:55:22] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [09:02:31] (03CR) 10Nik Gkountas: [C: 03+1] Enable Section Translation on Bengali Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665051 (https://phabricator.wikimedia.org/T271397) (owner: 10KartikMistry) [09:04:18] (03CR) 10Ryan Kemper: "Will get this plugin change built/uploaded Friday and will merge this patch then." [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/662923 (https://phabricator.wikimedia.org/T274203) (owner: 10DCausse) [09:05:29] (03PS3) 10Ryan Kemper: wdqs: explicit shutdown of Blazegraph during reboots. [cookbooks] - 10https://gerrit.wikimedia.org/r/662988 (owner: 10Gehel) [09:13:43] 10SRE, 10Traffic, 10Documentation, 10HTTPS, 10Performance-Team (Radar): TLS certificates renewal process - https://phabricator.wikimedia.org/T196248 (10Aklapper) [09:15:59] !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster for Hadoop backup cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [09:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:34] PROBLEM - SSH on sretest1001 is CRITICAL: connect to address 10.64.48.138 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:22:17] (03PS3) 10Jbond: utils/run_ci_localy.sh: Add a script to allow users to run CI from there laptops [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 [09:23:54] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:24:16] PROBLEM - Check systemd state on analytics1059 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:27:46] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) for Hadoop backup cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [09:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:14] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:38:02] RECOVERY - SSH on sretest1001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:38:06] sretest1001 is me [09:38:28] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:39:20] (03CR) 10Elukey: [C: 03+2] Decommission the Hadoop backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/665304 (https://phabricator.wikimedia.org/T274795) (owner: 10Elukey) [09:43:38] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:50:42] (03Abandoned) 10David Caro: utils: add script to run docker ci tests locally [software/spicerack] - 10https://gerrit.wikimedia.org/r/663205 (https://phabricator.wikimedia.org/T274338) (owner: 10David Caro) [10:01:16] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:06:18] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:13:29] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem - https://phabricator.wikimedia.org/T271058 (10aborrero) Unfortunately the server still shows the same problems, and even self-rebooted over night. Last log before reboot... [10:18:30] 10SRE, 10GitLab, 10vm-requests, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10mark) >>! In T274459#6841122, @thcipriani wrote: > Whoa, catching up on scrollback overnight. My question is: is this the first anyone in SRE has heard about any of this? I guess... [10:19:46] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:20:56] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:27:27] 10SRE, 10SRE Program Management, 10Documentation, 10PM: Create a Clinic Duty roster process - https://phabricator.wikimedia.org/T244266 (10Aklapper) [10:27:42] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:28:16] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:37:23] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Traffic, 10netops: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10jcrespo) [10:38:06] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Traffic, 10netops: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10jcrespo) Clarifying expected duration and method of depooling for next week. [10:38:45] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Traffic, 10netops: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10jcrespo) [10:45:33] (03PS16) 10Arturo Borrero Gonzalez: toolforge: add haproxy and nginx-ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) [10:51:03] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Traffic, 10netops: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10ayounsi) Note that one of the eqiad/codfw links is still down due to Texas weather issues. I hope it will be back up by the 22nd, but if it's not, we shoul... [10:59:29] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28136/console" [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [11:26:06] (03PS17) 10Arturo Borrero Gonzalez: toolforge: add haproxy and nginx-ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) [11:27:38] (03CR) 10jerkins-bot: [V: 04-1] toolforge: add haproxy and nginx-ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) (owner: 10Arturo Borrero Gonzalez) [11:40:36] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 86 probes of 600 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:43:53] 10SRE, 10serviceops: mcrouter v0.41 fails to connect to a mcrouter v0.37 ssl proxy - https://phabricator.wikimedia.org/T275202 (10jijiki) p:05Triage→03Medium [11:44:00] (03PS18) 10Arturo Borrero Gonzalez: toolforge: add haproxy and nginx-ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) [11:45:19] 10SRE, 10serviceops: mcrouter v0.41 fails to connect to a mcrouter v0.37 ssl proxy - https://phabricator.wikimedia.org/T275202 (10jijiki) [11:46:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: add haproxy and nginx-ingress configuration for jobs.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/662941 (https://phabricator.wikimedia.org/T274139) (owner: 10Arturo Borrero Gonzalez) [11:46:20] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 46 probes of 600 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:59:47] 10SRE, 10observability, 10serviceops, 10Sustainability (Incident Followup), 10User-jijiki: add monitoring of sustained memcached TKO rates - https://phabricator.wikimedia.org/T253384 (10jijiki) [12:39:19] (03PS40) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [12:39:40] (03CR) 10Hnowlan: "I'll be merging this change on Monday" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [12:55:46] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem - https://phabricator.wikimedia.org/T271058 (10MoritzMuehlenhoff) > I propose we try a newer kernel (buster-backports: 5.10.13-1~bpo10+1) to see if that makes any differen... [13:15:03] (03PS1) 10Klausman: analytics/camus: Add job to export ATSKafka events to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/665321 (https://phabricator.wikimedia.org/T254317) [13:20:36] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Traffic, 10netops: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10fgiunchedi) Reassessing the situation early next week sounds good to me -- we're not terribly in a rush to do this and might as well avoid unnecessary risks [13:21:45] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Xover) So… we're currently waiting for a suitable volunteer to materialize out of thin air to address an iss... [13:21:48] I'm about to reboot prometheus VMs in pops, there will be alerts [13:22:52] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus5001.eqsin.wmnet [13:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:57] (03PS1) 10Alexandros Kosiaris: echostore: Enable networkpolicy.egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/665322 [13:29:09] !log gehel@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1010.eqiad.wmnet with reason: REIMAGE [13:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] echostore: Enable networkpolicy.egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/665322 (owner: 10Alexandros Kosiaris) [13:30:36] 10SRE, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, 10Platform Engineering (Icebox): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10akosiaris) [13:31:14] !log gehel@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1010.eqiad.wmnet with reason: REIMAGE [13:31:15] (03Merged) 10jenkins-bot: echostore: Enable networkpolicy.egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/665322 (owner: 10Alexandros Kosiaris) [13:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:53] PROBLEM - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [13:33:01] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [13:33:38] that's an artifact ^ [13:34:03] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [13:38:24] (03PS2) 10Alexandros Kosiaris: Remove graphoid from services_proxy [puppet] - 10https://gerrit.wikimedia.org/r/663813 (https://phabricator.wikimedia.org/T242855) [13:38:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove graphoid from services_proxy [puppet] - 10https://gerrit.wikimedia.org/r/663813 (https://phabricator.wikimedia.org/T242855) (owner: 10Alexandros Kosiaris) [13:39:05] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus5001.eqsin.wmnet [13:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:51] PROBLEM - Check systemd state on prometheus5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:23] (03PS1) 10Kormat: mariadb: Convert pt-heartbeat to a systemd service. [puppet] - 10https://gerrit.wikimedia.org/r/665324 (https://phabricator.wikimedia.org/T252528) [13:41:02] (03CR) 10Kormat: [C: 04-2] "-2: can not be merged until the requisite change is made to wmfmariadbpy and a release is done." [puppet] - 10https://gerrit.wikimedia.org/r/665324 (https://phabricator.wikimedia.org/T252528) (owner: 10Kormat) [13:41:40] !log reset-failed ifup@ens13 on prometheus5001 - T273026 [13:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:46] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [13:41:54] (03CR) 10Hashar: "I kind of missed this change this week sorry!" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663876 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [13:42:17] RECOVERY - Check systemd state on prometheus5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:42:55] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'production' . [13:42:55] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'staging' . [13:42:58] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'production' . [13:42:58] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'staging' . [13:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:02] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'staging' . [13:43:02] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'production' . [13:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:13] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'echostore' for release 'staging' . [13:43:13] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'echostore' for release 'production' . [13:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:24] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'echostore' for release 'production' . [13:43:24] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'echostore' for release 'staging' . [13:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:04] (03CR) 10Hashar: profile: add gerrit log duplication and ecs mutations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663876 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [13:48:34] (03PS1) 10Muehlenhoff: Remove two additional aliases which are obsolete with the Hadoop backup cluster removal [puppet] - 10https://gerrit.wikimedia.org/r/665325 [13:50:09] (03CR) 10jerkins-bot: [V: 04-1] Remove two additional aliases which are obsolete with the Hadoop backup cluster removal [puppet] - 10https://gerrit.wikimedia.org/r/665325 (owner: 10Muehlenhoff) [13:57:54] RECOVERY - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [13:58:08] (03PS1) 10Mforns: analytics:refinery:job:data_purge: Absent Growth deletion jobs [puppet] - 10https://gerrit.wikimedia.org/r/665326 (https://phabricator.wikimedia.org/T273821) [13:58:52] (03PS2) 10Muehlenhoff: Remove two additional Hadoop aliases [puppet] - 10https://gerrit.wikimedia.org/r/665325 [13:59:58] (03PS1) 10Mforns: analytics:refinery:job:data_purge: Remove Growth deletion jobs [puppet] - 10https://gerrit.wikimedia.org/r/665328 (https://phabricator.wikimedia.org/T273821) [14:00:07] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'staging' . [14:00:08] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'production' . [14:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:51] (03CR) 10Elukey: [C: 03+2] "Thanks! Sorry my bad!" [puppet] - 10https://gerrit.wikimedia.org/r/665325 (owner: 10Muehlenhoff) [14:05:56] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [14:18:58] (03PS1) 10Muehlenhoff: Add uzoma to wmf LDAP group [puppet] - 10https://gerrit.wikimedia.org/r/665336 (https://phabricator.wikimedia.org/T275139) [14:21:03] (03CR) 10Muehlenhoff: [C: 03+2] Add uzoma to wmf LDAP group [puppet] - 10https://gerrit.wikimedia.org/r/665336 (https://phabricator.wikimedia.org/T275139) (owner: 10Muehlenhoff) [14:22:46] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the nda group for Uzoma Ozurumba - https://phabricator.wikimedia.org/T275139 (10MoritzMuehlenhoff) 05Open→03Resolved a:05elukey→03MoritzMuehlenhoff @UOzurumba: I've added to you the wmf LDAP group, you should be able to access Superset... [14:23:27] (03PS1) 10Kormat: switchover: Use heartbeat systemd service. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/665337 (https://phabricator.wikimedia.org/T252528) [14:24:42] (03CR) 10Kormat: [C: 04-2] "> Patch Set 1: Code-Review-2" [puppet] - 10https://gerrit.wikimedia.org/r/665324 (https://phabricator.wikimedia.org/T252528) (owner: 10Kormat) [14:28:59] !log mbsantos@deploy1001 Started deploy [tilerator/deploy@937deb5]: (no justification provided) [14:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:15] !log mbsantos@deploy1001 Finished deploy [tilerator/deploy@937deb5]: (no justification provided) (duration: 00m 15s) [14:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:35] (03CR) 10CDanis: [C: 03+1] hieradata: decom ms-be20[16-27] from swift [puppet] - 10https://gerrit.wikimedia.org/r/665298 (https://phabricator.wikimedia.org/T272837) (owner: 10Filippo Giunchedi) [14:36:58] (03CR) 10Ottomata: "Do you plan to do the full webrequest load ingestion so that there is a Hive table, or are you just planning to compare using the raw JSON" [puppet] - 10https://gerrit.wikimedia.org/r/665321 (https://phabricator.wikimedia.org/T254317) (owner: 10Klausman) [14:50:19] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: decom ms-be20[16-27] from swift [puppet] - 10https://gerrit.wikimedia.org/r/665298 (https://phabricator.wikimedia.org/T272837) (owner: 10Filippo Giunchedi) [14:55:48] PROBLEM - Check systemd state on ms-be1040 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:26] (03PS1) 10Alexandros Kosiaris: services: Remove monitoring.image_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/665341 [14:57:19] that was ferm failing to reload re: ms-be1040 [14:57:22] (03CR) 10Ottomata: [C: 03+2] Spark 2.4.4 with manually included Hadoop 2.10.1 dependencies. [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/664922 (https://phabricator.wikimedia.org/T274384) (owner: 10Ottomata) [14:57:30] DNS query timeout that is [14:57:32] RECOVERY - Check systemd state on ms-be1040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:12] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.054 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [15:07:36] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:53] (03PS4) 10Jbond: utils/run_ci_localy.sh: Add a script to allow users to run CI from there laptops [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 [15:09:00] (03PS5) 10Jbond: utils/run_ci_localy.sh: Add a script to allow users to run CI from there laptops [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 (https://phabricator.wikimedia.org/T274338) [15:09:37] (03CR) 10jerkins-bot: [V: 04-1] utils/run_ci_localy.sh: Add a script to allow users to run CI from there laptops [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 (https://phabricator.wikimedia.org/T274338) (owner: 10Jbond) [15:12:44] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:47] (03CR) 10Jbond: utils/run_ci_localy.sh: Add a script to allow users to run CI from there laptops (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 (https://phabricator.wikimedia.org/T274338) (owner: 10Jbond) [15:22:14] (03PS1) 10Elukey: role::analytics_cluster::coordinator: deploy analytics-product users [puppet] - 10https://gerrit.wikimedia.org/r/665352 (https://phabricator.wikimedia.org/T262660) [15:22:46] 10SRE, 10serviceops: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10RLazarus) See T266717 for some related discussion. [15:22:51] (03PS1) 10Hnowlan: maps1009: enable tilerator [puppet] - 10https://gerrit.wikimedia.org/r/665353 [15:24:24] (03CR) 10Klausman: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/665321 (https://phabricator.wikimedia.org/T254317) (owner: 10Klausman) [15:27:16] (03PS1) 10Kormat: integration: Bring mariadb settings closer to prod. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/665355 [15:27:42] (03CR) 10Ottomata: [C: 03+1] "k!" [puppet] - 10https://gerrit.wikimedia.org/r/665321 (https://phabricator.wikimedia.org/T254317) (owner: 10Klausman) [15:27:58] (03CR) 10Ottomata: [C: 03+1] "Let me know if you need any help querying the data." [puppet] - 10https://gerrit.wikimedia.org/r/665321 (https://phabricator.wikimedia.org/T254317) (owner: 10Klausman) [15:28:05] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator: deploy analytics-product users [puppet] - 10https://gerrit.wikimedia.org/r/665352 (https://phabricator.wikimedia.org/T262660) (owner: 10Elukey) [15:36:45] 10SRE, 10LDAP-Access-Requests: LDAP access to the nda group for Uzoma Ozurumba - https://phabricator.wikimedia.org/T275139 (10Elitre) >>! In T275139#6841740, @Aklapper wrote: >> I am tagging you because I am required to do so. Thank you. > Hmm, that surprises me. Could you elaborate why you think that you are... [15:39:48] (03CR) 10David Caro: "Look ok to me, all the files created under .docker_tmp will be owned by nobody:nobody, so the current user will need to do 'sudo chown -R " (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 (https://phabricator.wikimedia.org/T274338) (owner: 10Jbond) [15:50:06] (03PS1) 10Elukey: role::analytics_cluster::coordinator: set oozie option to use roles [puppet] - 10https://gerrit.wikimedia.org/r/665358 [15:51:01] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator: set oozie option to use roles [puppet] - 10https://gerrit.wikimedia.org/r/665358 (owner: 10Elukey) [15:51:46] (03CR) 10Elukey: [C: 03+2] oozie: Use admin groups for permissions [puppet] - 10https://gerrit.wikimedia.org/r/640260 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [15:51:55] (03CR) 10Elukey: oozie: Use admin groups for permissions [puppet] - 10https://gerrit.wikimedia.org/r/640260 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [15:55:17] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10sbassett) >>! In T257066#6843508, @Xover wrote: > So… we're currently waiting for a suitable volunteer to ma... [15:55:33] (03CR) 10Elukey: "Hey Razzi sorry I completely forgot about this change, I wanted to help Mikhail in re-running a job and didn't check the task, sorry! We'l" [puppet] - 10https://gerrit.wikimedia.org/r/640260 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [15:58:18] jouncebot: now [15:58:18] For the next 16 hour(s) and 1 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210219T0800) [15:58:20] (03CR) 10Kosta Harlan: [C: 03+1] Echo::create: Convert UserIdentityValue to plain User [extensions/Echo] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/665177 (https://phabricator.wikimedia.org/T275161) (owner: 10Jforrester) [15:58:59] (03CR) 10Dduvall: [C: 03+2] Echo::create: Convert UserIdentityValue to plain User [extensions/Echo] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/665177 (https://phabricator.wikimedia.org/T275161) (owner: 10Jforrester) [15:59:33] ah we need an sre buddy [15:59:34] um [16:00:13] (03CR) 10Thcipriani: [C: 03+1] "Thanks for spotting this!" [puppet] - 10https://gerrit.wikimedia.org/r/665246 (owner: 10Muehlenhoff) [16:00:22] asking in -sre [16:01:06] (03PS1) 10Ottomata: Update test cluster refine jobs to use event platform schemas [puppet] - 10https://gerrit.wikimedia.org/r/665359 (https://phabricator.wikimedia.org/T274384) [16:02:50] (03CR) 10jerkins-bot: [V: 04-1] Update test cluster refine jobs to use event platform schemas [puppet] - 10https://gerrit.wikimedia.org/r/665359 (https://phabricator.wikimedia.org/T274384) (owner: 10Ottomata) [16:03:30] (03CR) 10Elukey: [C: 03+1] "Thanks for all the hadoop test love :)" [puppet] - 10https://gerrit.wikimedia.org/r/665359 (https://phabricator.wikimedia.org/T274384) (owner: 10Ottomata) [16:04:19] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Replace sata cables for cloudvirt1024 - https://phabricator.wikimedia.org/T275215 (10dcaro) [16:04:29] (03PS2) 10Ottomata: Update test cluster refine jobs to use event platform schemas [puppet] - 10https://gerrit.wikimedia.org/r/665359 (https://phabricator.wikimedia.org/T274384) [16:04:46] (03CR) 10Ottomata: "In progress already :)" [puppet] - 10https://gerrit.wikimedia.org/r/665359 (https://phabricator.wikimedia.org/T274384) (owner: 10Ottomata) [16:04:49] hmm "help test" i this case, well I don't have restore on any wikis so not sure if I can test or rather just watch logstash [16:06:37] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Replace sata cables for cloudvirt1024 - https://phabricator.wikimedia.org/T275215 (10dcaro) Just to clarify, I'm not sure how the raid controller is setup, or which cable is failing so we will need to replace all of them (unless someone can pinpoint if/w... [16:06:39] (03CR) 10Kormat: [C: 03+2] integration: Bring mariadb settings closer to prod. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/665355 (owner: 10Kormat) [16:06:44] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Xover) >>! In T257066#6843760, @sbassett wrote: > There is some progress being made on various protected tas... [16:07:35] (03PS3) 10Ottomata: Update test cluster refine jobs to use event platform schemas [puppet] - 10https://gerrit.wikimedia.org/r/665359 (https://phabricator.wikimedia.org/T274384) [16:09:57] (03Merged) 10jenkins-bot: integration: Bring mariadb settings closer to prod. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/665355 (owner: 10Kormat) [16:10:38] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:11:21] (03PS4) 10Ottomata: Update test cluster refine jobs to use event platform schemas [puppet] - 10https://gerrit.wikimedia.org/r/665359 (https://phabricator.wikimedia.org/T274384) [16:11:36] (03PS5) 10Ottomata: Update test cluster refine jobs to use event platform schemas [puppet] - 10https://gerrit.wikimedia.org/r/665359 (https://phabricator.wikimedia.org/T274384) [16:12:16] 10SRE, 10LDAP-Access-Requests: LDAP access to the nda group for Uzoma Ozurumba - https://phabricator.wikimedia.org/T275139 (10MoritzMuehlenhoff) >>! In T275139#6843719, @Elitre wrote: > Doesn't the manager need at least a headsup? I thought my approval was even necessary :) We only need manager approval for... [16:17:29] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1001/28142/an-test-coord1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/665359 (https://phabricator.wikimedia.org/T274384) (owner: 10Ottomata) [16:17:31] (03CR) 10Ottomata: [C: 03+2] Update test cluster refine jobs to use event platform schemas [puppet] - 10https://gerrit.wikimedia.org/r/665359 (https://phabricator.wikimedia.org/T274384) (owner: 10Ottomata) [16:17:38] (03CR) 10MSantos: [C: 03+1] maps1009: enable tilerator [puppet] - 10https://gerrit.wikimedia.org/r/665353 (owner: 10Hnowlan) [16:19:08] (03PS1) 10Elukey: bigtop: require hadoop users before installing daemon packages [puppet] - 10https://gerrit.wikimedia.org/r/665360 (https://phabricator.wikimedia.org/T231067) [16:20:31] apergos: no takers, huh? [16:21:29] not yet, trying one more place [16:26:34] (03CR) 10Hnowlan: [C: 03+2] maps1009: enable tilerator [puppet] - 10https://gerrit.wikimedia.org/r/665353 (owner: 10Hnowlan) [16:28:20] (03PS2) 10Elukey: bigtop: require hadoop users before installing daemon packages [puppet] - 10https://gerrit.wikimedia.org/r/665360 (https://phabricator.wikimedia.org/T231067) [16:29:25] anything I can do to help as a lowly deployer? [16:30:08] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 70 probes of 597 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:30:39] (03PS2) 10Dzahn: mcrouter: move mcrouter proxy for B6 from mw1287 to mw1288 [puppet] - 10https://gerrit.wikimedia.org/r/664898 (https://phabricator.wikimedia.org/T245757) [16:32:09] (03Merged) 10jenkins-bot: Echo::create: Convert UserIdentityValue to plain User [extensions/Echo] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/665177 (https://phabricator.wikimedia.org/T275161) (owner: 10Jforrester) [16:34:44] (03CR) 10Dzahn: [C: 03+2] mcrouter: move mcrouter proxy for B6 from mw1287 to mw1288 [puppet] - 10https://gerrit.wikimedia.org/r/664898 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn) [16:35:26] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 62 probes of 596 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:35:38] (03PS2) 10Dzahn: mcrouter: move mcrouter proxy for C6 from mw1320 to mw1321 [puppet] - 10https://gerrit.wikimedia.org/r/664691 (https://phabricator.wikimedia.org/T245757) [16:38:35] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1134.eqiad.wmnet',... [16:39:31] 10SRE, 10SRE-Access-Requests: Requesting access to gerrit-root and gerrit-admin for dancy - https://phabricator.wikimedia.org/T275050 (10dancy) Thanks @MoritzMuehlenhoff! I verified that I can log in and sudo as needed. [16:39:54] (03CR) 10Dzahn: [C: 03+2] mcrouter: move mcrouter proxy for C6 from mw1320 to mw1321 [puppet] - 10https://gerrit.wikimedia.org/r/664691 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn) [16:41:38] (03PS1) 10Ottomata: Spark 2.4.4 with Hadoop jars removed [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/665362 (https://phabricator.wikimedia.org/T274384) [16:42:48] (03PS2) 10Ottomata: Spark 2.4.4 with Hadoop jars removed [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/665362 (https://phabricator.wikimedia.org/T274384) [16:43:50] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 70 probes of 596 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:46:12] (03PS2) 10Cwhite: profile: add gerrit log duplication and ecs mutations [puppet] - 10https://gerrit.wikimedia.org/r/663876 (https://phabricator.wikimedia.org/T234565) [16:46:16] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 17 DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28146/console" [puppet] - 10https://gerrit.wikimedia.org/r/665360 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [16:46:23] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Ladsgroup) We are building a new dedicated service with special security considerations so we can make this... [16:46:44] (03CR) 10Cwhite: profile: add gerrit log duplication and ecs mutations (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663876 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [16:49:30] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 47 probes of 596 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:51:25] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1134.eqiad.wmnet with reason: REIMAGE [16:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:06] (03PS1) 10David Caro: gitignore: added vim swapfiles [software/cumin] - 10https://gerrit.wikimedia.org/r/665364 [16:52:08] (03PS1) 10David Caro: tox: added py39 support [software/cumin] - 10https://gerrit.wikimedia.org/r/665365 [16:52:10] (03PS1) 10David Caro: transport.clustershell: handle str when reporting commands [software/cumin] - 10https://gerrit.wikimedia.org/r/665366 (https://phabricator.wikimedia.org/T275210) [16:53:25] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1140.eqiad.wmnet with reason: REIMAGE [16:53:28] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1134.eqiad.wmnet with reason: REIMAGE [16:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:30] (03CR) 10David Caro: transport.clustershell: handle str when reporting commands (033 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/665366 (https://phabricator.wikimedia.org/T275210) (owner: 10David Caro) [16:55:27] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1141.eqiad.wmnet with reason: REIMAGE [16:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:37] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1140.eqiad.wmnet with reason: REIMAGE [16:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:36] (03CR) 10David Caro: transport.clustershell: handle str when reporting commands (033 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/665366 (https://phabricator.wikimedia.org/T275210) (owner: 10David Caro) [16:57:37] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1141.eqiad.wmnet with reason: REIMAGE [16:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:43] (03CR) 10Cwhite: mw_rc_irc: add check_prometheus alert on no messages being relayed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665129 (https://phabricator.wikimedia.org/T216611) (owner: 10Cwhite) [16:58:54] (03PS2) 10David Caro: transport.clustershell: handle str when reporting commands [software/cumin] - 10https://gerrit.wikimedia.org/r/665366 (https://phabricator.wikimedia.org/T275210) [17:03:06] (03CR) 10Elukey: bigtop: require hadoop users before installing daemon packages [puppet] - 10https://gerrit.wikimedia.org/r/665360 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [17:04:11] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1134.eqiad.wmnet', 'an-worker1140.eqiad.wmnet', 'an-worker1141.eqia... [17:05:09] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) [17:06:23] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [17:06:38] (03CR) 10BryanDavis: [C: 03+1] "Ripping the tls on/off option out of the dynamicproxy module could come after this too I think. TLS all the things! :)" [puppet] - 10https://gerrit.wikimedia.org/r/662942 (https://phabricator.wikimedia.org/T274123) (owner: 10Arturo Borrero Gonzalez) [17:14:30] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:17:59] 10SRE, 10SRE-Access-Requests: Add Angie Muigai to analytics-privatedata-users - https://phabricator.wikimedia.org/T275140 (10Dzahn) [17:22:16] 10SRE, 10SRE-Access-Requests: Add Angie Muigai to analytics-privatedata-users - https://phabricator.wikimedia.org/T275140 (10Dzahn) [17:23:15] 10SRE, 10SRE-Access-Requests: Add Angie Muigai to analytics-privatedata-users - https://phabricator.wikimedia.org/T275140 (10Dzahn) confirmed L3 signature, confirmed employee status, confirmed has existing entry in admin.yaml already, removed SSH key checkbox (not needed) needs: sign-off, patch (uploading) [17:25:22] 10SRE, 10GitLab, 10vm-requests, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10thcipriani) [17:26:14] (03PS1) 10Dzahn: admin: add amuigai to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/665367 (https://phabricator.wikimedia.org/T275140) [17:26:41] (03CR) 10jerkins-bot: [V: 04-1] admin: add amuigai to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/665367 (https://phabricator.wikimedia.org/T275140) (owner: 10Dzahn) [17:27:18] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1367.eqiad.wmnet with reason: REIMAGE [17:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:40] (03PS2) 10Dzahn: admin: add amuigai to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/665367 (https://phabricator.wikimedia.org/T275140) [17:27:59] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add Angie Muigai to analytics-privatedata-users - https://phabricator.wikimedia.org/T275140 (10Dzahn) [17:28:06] (03CR) 10jerkins-bot: [V: 04-1] admin: add amuigai to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/665367 (https://phabricator.wikimedia.org/T275140) (owner: 10Dzahn) [17:28:26] (03PS3) 10Dzahn: admin: add Angie Muigai to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/665367 (https://phabricator.wikimedia.org/T275140) [17:28:51] (03CR) 10jerkins-bot: [V: 04-1] admin: add Angie Muigai to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/665367 (https://phabricator.wikimedia.org/T275140) (owner: 10Dzahn) [17:28:57] ah Lucas_WMDE sorry your message got missed, I don't think there's much to do but wait until an sre is available for this emergency deploy [17:29:08] ok, no problem [17:29:19] good luck! [17:29:24] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1367.eqiad.wmnet with reason: REIMAGE [17:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:34] (03CR) 10Dzahn: "apparently the CI checks are not expecting this situation yet where we have LDAP-only admins without shell access but in the analytics-pri" [puppet] - 10https://gerrit.wikimedia.org/r/665367 (https://phabricator.wikimedia.org/T275140) (owner: 10Dzahn) [17:31:26] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2272.codfw.wmnet with reason: REIMAGE [17:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:03] thanks! [17:32:53] 10SRE, 10GitLab, 10vm-requests, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10thcipriani) Updated task description for the network to be merely accessible over http vs external IP after discussing with @mark The original thinking was that this is a Gerrit r... [17:33:28] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2272.codfw.wmnet with reason: REIMAGE [17:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:20] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1341.eqiad.wmnet with reason: REIMAGE [17:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:21] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1341.eqiad.wmnet with reason: REIMAGE [17:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:21] 10SRE, 10SRE-Access-Requests: Requesting access to stat boxes for mlitn - https://phabricator.wikimedia.org/T274749 (10Dzahn) [17:39:57] 10SRE, 10SRE-Access-Requests: Requesting access to stat boxes for mlitn - https://phabricator.wikimedia.org/T274749 (10Dzahn) confirmed Matthias has signed L3 back in 2017. confirmed employee status (is listed as contractor though in corp LDAP, this would require an expiry_date, but also likely this informati... [17:40:29] mutante: any chance you could be my deploy buddy for re-rolling wmf.31 at some point in the next few hours? :) [17:41:38] 10SRE, 10serviceops: mcrouter v0.41 fails to connect to a mcrouter v0.37 ssl proxy - https://phabricator.wikimedia.org/T275202 (10jijiki) 05Open→03Resolved a:03jijiki We may have discovered this sooner if we had a relevant alert. I am closing this task and cont the alert/monitoring discussion on T253384 [17:53:05] 10SRE, 10SRE-Access-Requests: Add Pau Giner to analytics-privatedata-users - https://phabricator.wikimedia.org/T275138 (10Dzahn) [17:53:56] 10SRE, 10SRE-Access-Requests: Add Pau Giner to analytics-privatedata-users - https://phabricator.wikimedia.org/T275138 (10Dzahn) confirmed L3 signature, confirmed employee status (is listed as contractor in corp LDAP though, this would require an expiry_date but it's also likely that information is outdated) [17:53:58] (03CR) 10Ottomata: "Will install this in main cluster on monday" [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/665362 (https://phabricator.wikimedia.org/T274384) (owner: 10Ottomata) [17:54:02] (03CR) 10Ottomata: [C: 03+2] Spark 2.4.4 with Hadoop jars removed [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/665362 (https://phabricator.wikimedia.org/T274384) (owner: 10Ottomata) [17:56:31] be back in a few minutes [17:57:28] (03CR) 10Muehlenhoff: "You need to move the user to the shell access table, but without an SSH key, there are a few existing users commented with # Added with no" [puppet] - 10https://gerrit.wikimedia.org/r/665367 (https://phabricator.wikimedia.org/T275140) (owner: 10Dzahn) [17:59:08] marxarelli: i dont know anything about it but I can be here in about 45 min [17:59:16] guessing it's UBN [17:59:24] or we wouldnt do it on Friday [17:59:30] it's a UBN, yeah [17:59:36] thanks [17:59:56] will be back soon. just picking up food [18:00:02] no problem [18:01:01] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2272.codfw.wmnet'] ` an... [18:04:43] someone poke me when the gang's all here :-) [18:07:24] !log Password reset for User:Kolyma (T274737) [18:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:32] T274737: Account recovery for Kolyma - https://phabricator.wikimedia.org/T274737 [18:07:51] marxarelli: apergos: if you need a help with UBN deployment, happy to help [18:08:19] we are flush on deployers now! there's you, plus Lucas plus us :-) [18:08:55] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 103 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:09:09] uh. grrrrrrrr [18:09:12] what? [18:09:29] NOW we get a spike? [18:10:01] did we do something? [18:10:03] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Epic, 10Goal: WMF media storage must be adequately backed up in a remote location - https://phabricator.wikimedia.org/T262668 (10jcrespo) This is an example of an ongoing backup (testwiki): ` db2102.codfw.wmnet[mediabackups]> select backup_status_n... [18:10:03] I'm not seeing the spike in logstash though [18:10:11] not to my knowledge no [18:11:36] i see a lot of warnings in cache-cookies channel [18:12:14] 10SRE, 10observability, 10serviceops, 10Sustainability (Incident Followup), 10User-jijiki: add monitoring of sustained memcached TKO rates - https://phabricator.wikimedia.org/T253384 (10jijiki) I thought about this a bit, I do have some reservations as it is towards the right direction, but here it is. A... [18:12:46] 21k in over 15 minutes apergos [18:13:02] it's lots of timeouts [18:13:14] which the dashboard filters out, I'm on mwlog1001 looking [18:13:31] this looks...suspicious https://usercontent.irccloud-cdn.com/file/lTYQ51up/image.png [18:13:49] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 104 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:13:56] there is lots of coockie warnings, but that doesn't correlated the latest spike [18:13:58] those are old [18:14:36] what even is using mwdebug at that rate? no mediawiki changes has even been deployed in some time now [18:15:11] nice [18:15:30] not sure [18:15:39] does anyone know whether mwdebug events are included in the alert? [18:16:24] most likely, but sadly the way to verify (mwdebug dashboard doesnt work well) [18:16:59] for the timeouts the top wiki is ja wikivoyage by far but I don't see those coming via mwdebugxxx [18:17:12] they are all wtp something, and parsoid timing out [18:17:23] there was a spike of DEBUG on mwdebug, but no more, I see on other dashboard [18:18:37] urls all look different [18:18:58] /w/rest.php/ja.wikivoyage.org/v3/page/pagebundle/%E6%84%9B%E5%AA%9B%E7%9C%8C/15898 but different values for the number and pageubndle name [18:19:07] apergos: can you post a link to logstash dashboard showing this? [18:19:40] https://logstash.wikimedia.org/goto/45b5133c481a8765cad4472de518c8d6 I didn't filter for them but they are the main error [18:20:06] without the timeouts it's almost nothing right now, so that's gotta be the spike [18:20:46] easy to verify by uninstalling parsoid (joking, right) [18:22:06] I can confirm independently it is WMFtimeout exception [18:22:38] pagebundle: A JSON blob containing the above html with the data-parsoid attributes split out and ids added to each node. Content type is application/json [18:22:56] the "above html" is "Parsoid's XHTML5 + RDFa output, which includes inlined data-parsoid attributes." [18:23:02] copy pasting from some old mediawiki docs [18:23:53] increse started around 17:55-17:57 [18:24:06] maybe traffic-driven, not deployment? [18:24:30] someone's busy over there in rc but not nearly that busy [18:24:45] I mean it can't be deployment because nothing was deployed [18:25:03] yeah, I double checked- in case there was a stealth deploy or something [18:25:25] so there is no actionable for deployment [18:25:40] nope [18:25:58] except get rid of this somehow *so* an emergency deployment can happen later >_< [18:26:20] anyways: no user agent associated with this, don't know what/who is requesting them, don't know why .... [18:27:33] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:28:07] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 68 probes of 596 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:28:20] even more alerts :( [18:29:02] tried a sample reqId, there's no other entries accompanying it [18:29:28] (03CR) 10Razzi: [C: 03+2] analytics:refinery:job:data_purge: Remove Growth deletion jobs [puppet] - 10https://gerrit.wikimedia.org/r/665328 (https://phabricator.wikimedia.org/T273821) (owner: 10Mforns) [18:29:29] I'm here and would have said keep the deploy coming, but I see we already have issues. So what I will be doing now is make sure the _ongoing_ reimaiges are finished and put servers back in pool .. but not start anything new [18:29:46] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2272.codfw.wmnet [18:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:11] !log mw1367 - powercycled - stuck in reboot [18:30:12] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1341.eqiad.wmnet'] ` an... [18:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:27] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1341.eqiad.wmnet [18:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:42] removed all filters, same: oly the one entry. no way to tie it to something else [18:30:52] we need someone with more clue about parsoid/restbase [18:31:27] (03CR) 10Razzi: [C: 03+2] "Thanks for cleaning this kind of technical debt up!" [puppet] - 10https://gerrit.wikimedia.org/r/665326 (https://phabricator.wikimedia.org/T273821) (owner: 10Mforns) [18:32:41] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2272.codfw.wmnet [18:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:37] subbu: are you around for a parsoid issue? [18:33:50] sure. [18:34:27] subbu: it was based on 18:30 < apergos> we need someone with more clue about parsoid/restbase [18:34:31] subbu: we are getting a lot of timeouts on jawikivoyage, enough for alerts to fire [18:34:50] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1367.eqiad.wmnet'] ` an... [18:35:06] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1367.eqiad.wmnet [18:35:10] a sample is at https://logstash.wikimedia.org/goto/fbb1d59201fb45baa2b0918eabd078ae for one entry, subbu [18:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:26] ok .. looking. [18:35:51] this started about 28 minutes ago (?), or at least the alarm fired then [18:35:55] thanks a lot [18:35:57] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 107 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:36:23] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1341.eqiad.wmnet [18:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:31] https://ja.wikivoyage.org/wiki/%E9%95%B7%E5%B4%8E%E7%9C%8C seems vandalized? [18:36:37] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10Jclark-ctr) memory swap completed @Marostegui [18:36:46] would that cause such an issue? [18:37:15] it is possibly triggering some corner case in Parsoid's tokenizing. [18:37:27] https://logstash.wikimedia.org/goto/45b5133c481a8765cad4472de518c8d6 here's a link to the full set (plus some other errors but they are few in comparison) [18:37:31] huh [18:37:51] we've fixed them over the years, but we keep discovering newer ones once in a while. [18:37:54] https://ja.wikivoyage.org/wiki/%E7%89%B9%E5%88%A5:%E6%9C%80%E8%BF%91%E3%81%AE%E6%9B%B4%E6%96%B0?hidebots=1&hidecategorization=1&hideWikibase=1&limit=50&days=7&urlversion=2&uselang=en I looked here but there's oly a few things a minute happening [18:37:56] i blocked User:Die - sister project - die [18:38:35] it is a crap user name, blocable just for that i guess [18:38:39] yeah [18:38:47] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1367.eqiad.wmnet [18:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:01] i don't see anything else in RC [18:39:08] does each failure trigger a bunch of retries or something? [18:39:21] yeah I didn't see anything either, and the numbers didn't add up so I wrote it off [18:39:40] https://ja.wikivoyage.org/wiki/%E5%88%A9%E7%94%A8%E8%80%85%E3%83%BB%E3%83%88%E3%83%BC%E3%82%AF%3A-revi same thing [18:39:56] RESTBase has some retry count yes [18:40:13] {{template:ww}}{{template:ww}}{{template:ww}}{ bunch of cruft like this i there [18:40:15] so, till it gives up, we'll see those repeated failres [18:40:22] ah the autoretry is probably what gets us then [18:40:31] how unfortunate [18:40:53] sadly that was a lot of vandalism, needs revert: https://ja.wikivoyage.org/w/index.php?title=%E7%89%B9%E5%88%A5:%E6%8A%95%E7%A8%BF%E8%A8%98%E9%8C%B2/Die_-_sister_project_-_die&offset=&limit=500&target=Die+-+sister+project+-+die [18:40:59] filled with {{template:破壊なう}} and other garbage. [18:41:02] people seem to be on it [18:41:14] jynus: I can revert it, but i don't want to cause any issues [18:41:23] yeah [18:41:26] oh no I am wrong [18:41:28] if rollbacking it is fine, i can do it [18:41:30] misreading... [18:41:37] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 46 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:41:41] ohoho [18:41:45] probably it caused lots of parsing activity [18:42:19] well that was exciting in a bad way [18:42:45] now I know that every restbase failure or parsoid one, whichever, is magnified, I'll make a mental note [18:42:57] is that editor blocked now? [18:43:01] subbu: yes [18:43:07] oh i see you did. [18:43:09] thankx [18:43:10] if you mean https://ja.wikivoyage.org/w/index.php?title=%E7%89%B9%E5%88%A5:%E6%8A%95%E7%A8%BF%E8%A8%98%E9%8C%B2/Die_-_sister_project_-_die [18:43:20] my inclination would be to (yes they are) check with someoe on the small wikis antivandalism group or the stewards [18:43:25] but ymmv, Urbanecm [18:43:51] what is the concern with rolling back those vandalized pages? [18:44:04] no concern with having them rolled back, just who should do it [18:44:09] ah, ok. [18:44:09] if vandalizing the pages can cause an outage, maybe reverting it can too? [18:44:25] the new version of the page is rendered [18:44:28] without the vandalism [18:44:32] should be fine [18:44:34] right. [18:44:36] that, if we should do it slowly or doesn't matter? [18:44:43] (famous last words :-P) [18:44:44] in that case, it's just a single button on my side to revert it all :) [18:44:48] nah, jfdi [18:45:14] but leave a message if you find a place to leave one, in case there are local admins, after. [18:45:22] less than a 100 pages? should be fine afaict. [18:45:32] then, proceed [18:46:21] apergos: sure, I'm a steward, so handling small wiki issues is normal for me :) [18:46:30] oh well then, please consult yourself :_D [18:46:31] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:46:37] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10Jclark-ctr) an-worker1139 corrected dac cable for host moved to port 25 [18:47:03] oh, hey, I already reverted all of that separately :) [18:47:07] thanks :) [18:47:08] lol [18:47:10] but thanks Urbanecm for nuking [18:47:11] well done! [18:48:41] thanks everybody, and sharp-eyed subbu in particular [18:49:04] huh, so .... [18:49:06] *cough* [18:49:15] how about that train eh :-P :-D [18:49:18] picking up reimages again [18:49:24] and deploy it :o [18:49:27] we already had a recovery [18:49:34] so...all right now it seems? [18:49:39] marxarelli: heat up the water for the train engine [18:49:44] thx for the heads up. i'll go back into manager and paperwork land again. :) [18:49:50] subbu: thank you [18:49:51] yeah looks fine [18:50:13] :) aye aye [18:50:19] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10Jclark-ctr) an-worker1129 verified host is plugged into xe-4/0/3 [18:50:45] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:52:20] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:52:45] !log fetching backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/665177 for sync prior to all wikis (re)deploy (T275161) [18:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:52] T275161: InvalidArgumentException: Invalid user parameter in EchoEvent::create - https://phabricator.wikimedia.org/T275161 [18:53:26] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:55:27] apergos: no of a way to test this Echo patch on mwdebug or should i just sync all? [18:55:30] *know* [18:55:44] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:56:09] well it was special:undelete but I don't know of which things, we [18:56:09] sync-file to all servers that is, not sync world [18:56:18] really shouldn't monkey around on frwiki or ptwiki I guess [18:56:44] so this needs to go with 31 [18:56:49] so it can go where 31 is already [18:57:09] that's groups 0 and 1 right? [18:57:09] yeah, sounds good. i'll just sync-file everywhere then [18:57:30] right [18:57:36] i'm not advancing wikiversions yet. just getting the Echo fix out [18:57:53] right right [18:58:55] 10SRE, 10Traffic: Wikipedia not opening images in any browser except Opera. - https://phabricator.wikimedia.org/T275211 (10CDanis) [18:59:05] apergos: I can see special:undelete everywhere, if we need some testing [18:59:35] it would have been nice to have a url that we know was broken on e.g. testwiki [18:59:47] and then see it not be broken with the fix [19:00:00] but we don't and we won't.... [19:00:27] :( [19:00:30] yeah well [19:01:14] !log dduvall@deploy1001 Synchronized php-1.36.0-wmf.31/extensions/Echo/includes/model/Event.php: backport: [[gerrit:665177|Echo::create: Convert UserIdentityValue to plain User (T275161)]] (duration: 01m 20s) [19:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:20] T275161: InvalidArgumentException: Invalid user parameter in EchoEvent::create - https://phabricator.wikimedia.org/T275161 [19:01:28] I do't ven know what event is trying to be created there [19:02:10] a deferred LinkUpdate iirc [19:02:58] okey dokey. well, backport is deployed. time to roll [19:03:16] uh [19:03:34] this is notification to a user isn't it? echo I mean. so maybe someting is undeleted that is on a watchlist, or I dunno [19:03:38] wildly guessing [19:03:42] anyways [19:03:52] we'll just watch logstash .... [19:04:08] apergos: users can get notifications when their article is linked in another article [19:04:23] if that's the thing marxarelli mentioned, it might be triggered on undeletion as well? [19:04:36] could be [19:07:33] * apergos waits for the roll to roll [19:07:46] (03PS1) 10Dduvall: all wikis to 1.36.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665381 [19:07:48] (03CR) 10Dduvall: [C: 03+2] all wikis to 1.36.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665381 (owner: 10Dduvall) [19:08:44] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665381 (owner: 10Dduvall) [19:09:16] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.738 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [19:09:18] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1129.eqiad.wmnet ` The log can be found in... [19:11:06] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.31 [19:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:14] bam! [19:12:24] wham-o! [19:13:06] :-D [19:14:40] (03PS1) 10Jcrespo: WIP [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/665383 [19:14:43] watching the logstash grass grow [19:15:16] (03CR) 10jerkins-bot: [V: 04-1] WIP [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/665383 (owner: 10Jcrespo) [19:15:17] no weeds yet :) [19:16:10] or worse, daisies being pushed up <- sure sign of imminent zombie attack [19:17:46] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Legoktm) [19:17:50] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1261.eqiad.wmnet with reason: REIMAGE [19:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:12] lol [19:18:47] yeah and I'm also waitig to see an undelete on en wp [19:19:14] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1139.eqiad.wmnet ` The log can be found in... [19:19:22] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1270.eqiad.wmnet with reason: REIMAGE [19:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:56] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1261.eqiad.wmnet with reason: REIMAGE [19:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:26] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1287.eqiad.wmnet with reason: REIMAGE [19:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:06] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1270.eqiad.wmnet with reason: REIMAGE [19:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:13] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1129.eqiad.wmnet with reason: REIMAGE [19:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:10] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1287.eqiad.wmnet with reason: REIMAGE [19:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2257.codfw.wmnet with reason: REIMAGE [19:26:04] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1129.eqiad.wmnet with reason: REIMAGE [19:28:09] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2257.codfw.wmnet with reason: REIMAGE [19:31:57] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 68 probes of 681 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:32:03] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1139.eqiad.wmnet with reason: REIMAGE [19:32:32] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1129.eqiad.wmnet'] ` and were **ALL** successful. [19:32:48] !gb 88.246.198.97 [19:32:51] sorry, wrong channel [19:33:48] apergos: haven't seen anything yet. i'm thinking of calling it a (tentative as is always the case) success [19:34:06] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1139.eqiad.wmnet with reason: REIMAGE [19:34:07] thanks for the support! [19:34:08] the only thing i do see for the past half hour is two labswiki errors (it was already on .31) [19:34:11] (03PS2) 10Jcrespo: WIP [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/665383 [19:34:37] which are however also echo related, might be good for $someone who works on that extension to check it out but not urgent [19:34:50] (03CR) 10jerkins-bot: [V: 04-1] WIP [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/665383 (owner: 10Jcrespo) [19:34:52] (03PS4) 10Dzahn: admin: add Angie Muigai to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/665367 (https://phabricator.wikimedia.org/T275140) [19:35:11] https://logstash.wikimedia.org/goto/953281aa84a0b267655444a212c0202e for the record [19:35:16] the "Field 'notification_bundle_display_hash' doesn't have a default" error? [19:35:22] yeah that thig [19:35:27] i'm pretty sure that's a known issue [19:35:45] (03CR) 10jerkins-bot: [V: 04-1] admin: add Angie Muigai to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/665367 (https://phabricator.wikimedia.org/T275140) (owner: 10Dzahn) [19:35:46] something to do with buster's default mariadb config iirc [19:36:11] https://phabricator.wikimedia.org/T262033 well it was [19:36:14] but uh [19:36:26] I mean unless there is not a new task for it [19:36:30] https://phabricator.wikimedia.org/T262033 [19:36:46] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 49 probes of 596 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:36:46] ah, you beat me :) [19:36:46] (03PS3) 10Jcrespo: WIP [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/665383 [19:36:48] i can mention that we're still seeing it [19:36:55] re-open and mention labswiki? yeah [19:37:03] it's probably specific to that or something [19:40:28] otherwise, it's been lovely watching logstash with you, may our weekend be blissfully quiet! [19:40:52] i hope so! thanks again [19:41:05] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Legoktm) I updated the task description with the current plan. Progress is being made regularly on various s... [19:41:34] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1139.eqiad.wmnet'] ` and were **ALL** successful. [19:41:36] * marxarelli eyes, with great contempt, the pile driver at the marina a block away [19:41:56] oh. 'quiet". eh [19:42:23] maybe they won't work on the weekend... [19:44:22] no, fortunately they're not allowed to due to city ordinances. getting it all done now haha [19:44:58] (03PS5) 10Dzahn: admin: add Angie Muigai to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/665367 (https://phabricator.wikimedia.org/T275140) [19:45:21] makes for... "rhythmic" train deploys [19:46:29] ouch!! [19:47:39] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 10 probes of 681 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:48:52] !log 1.36.0-wmf.31 re-rolled to all wikis (T271345) [19:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:58] T271345: 1.36.0-wmf.31 deployment blockers - https://phabricator.wikimedia.org/T271345 [19:49:26] (03PS6) 10Dzahn: phabricator: convert statistics mail crons to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/661536 (https://phabricator.wikimedia.org/T273673) [19:50:17] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) [19:50:40] 10SRE, 10ops-eqiad: eqiad: Move maps1001 same rack A4 - https://phabricator.wikimedia.org/T273983 (10RobH) [19:50:42] 10SRE, 10ops-eqiad, 10DBA: eqiad: move db1111 to rack A8 - https://phabricator.wikimedia.org/T273982 (10RobH) [19:51:19] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) 05Open→03Resolved a:05Jclark-ctr→03RobH All hosts installed and staged in netbox. [19:52:25] there sure are a lot of old mw versions lying around. i'm going to `scap clean` some of those before i jet [19:52:39] yay cleanup [19:57:20] marxarelli: should the No atomic section is open (got LocalFile::lockingTransaction) one be a train blocker for next week? it appears it was around in at least wmf.27 and maybe not related to any deploy, see https://phabricator.wikimedia.org/T274589#6832113 [19:57:26] !log dduvall@deploy1001 Pruned MediaWiki: 1.36.0-wmf.25 (duration: 04m 09s) [19:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:32] apergos: i'm not sure. i think it's been ferried along to keep deployers' eyes on it, but no it's not really a blocker [19:59:16] can we leave it in production-errors and take if off as a blocker maybe? [19:59:25] that sounds reasonable to me [19:59:29] i'll do that [19:59:36] thanks a lot [20:01:28] !log dduvall@deploy1001 Pruned MediaWiki: 1.36.0-wmf.26 (duration: 02m 12s) [20:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:19] !log dduvall@deploy1001 Pruned MediaWiki: 1.36.0-wmf.27 (duration: 02m 12s) [20:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:19] !log dduvall@deploy1001 Pruned MediaWiki: 1.36.0-wmf.28 (duration: 01m 50s) [20:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:02] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1261.eqiad.wmnet'] ` an... [20:12:26] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2257.codfw.wmnet'] ` an... [20:13:48] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1270.eqiad.wmnet'] ` an... [20:14:32] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1287.eqiad.wmnet'] ` an... [20:15:30] !log dduvall@deploy1001 Pruned MediaWiki: 1.36.0-wmf.29 (duration: 01m 42s) [20:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:07] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1261.eqiad.wmnet [20:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:18] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1270.eqiad.wmnet [20:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:49] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2257.codfw.wmnet [20:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:25] 10ops-eqiad: Eqiad: Port with no description xe-4/0/24 xe-4/0/3 - https://phabricator.wikimedia.org/T275241 (10Papaul) [20:24:18] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Replace sata cables for cloudvirt1024 - https://phabricator.wikimedia.org/T275215 (10wiki_willy) a:03Jclark-ctr [20:24:51] 10ops-eqiad: Eqiad: Port with no description xe-4/0/24 xe-4/0/3 - https://phabricator.wikimedia.org/T275241 (10wiki_willy) a:03Cmjohnson [20:26:02] * marxarelli is done pruning old mw versions now [20:27:11] Got a bunch of ` .31 i/c/l/LocalisationCache:512 No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php.` [20:27:17] Already known marxarelli? [20:28:35] I will investigate [20:30:00] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 121 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:30:39] that's odd [20:30:45] recently repooled server maybe? [20:30:46] re-imaged machines? [20:30:49] yeah. [20:31:01] mutante: ^ [20:31:34] I've heard that `scap sync-wikiversions` will fix up a re-imaged machine [20:31:47] do you have machine name? [20:31:50] mw1287 is one. [20:31:54] one sec [20:31:57] mw1261 [20:32:02] mw1270. [20:32:03] mw1261 and mw1270 [20:32:11] yeah and mw1287 [20:32:30] !log mw1287 - scap pull [20:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:13] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕞🍵 sudo cumin 'mw1261*,mw1270*,mw1287*' 'depool' [20:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:17] !log mw1261, mw1270 - scap pull [20:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:22] i wonder if they were out during the initial scap sync [20:33:24] eh.. [20:33:34] mutante: ah didn't realize you were already on it -- repool at your leisure [20:33:35] now we won't know though [20:33:42] cdanis: ok, ACK! [20:34:21] Started scap-cdb-rebuild [20:34:30] i think that's it [20:34:46] it's building the CDP files [20:34:49] CDB [20:35:36] ack [20:37:10] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 29 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:37:34] ACK, that is done and should be fixed [20:37:37] my bad [20:37:37] whew [20:38:43] i was worried i broke something with the `scap clean` [20:39:03] like... cleaned wmf.31 by mistake :) [20:40:11] I think it doesnt always need the CDB rebuild but sometimes it does [20:40:20] nice. looks ok. thanks for the quick recovery mutante. and thanks for the catch dancy [20:40:37] yeah, curious why that is [20:41:04] i missed a step in the workflow though for these, sry [20:41:54] too many split terminals [20:42:15] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1287.eqiad.wmnet [20:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:24] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1261.eqiad.wmnet [20:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:36] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1270.eqiad.wmnet [20:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:02] let me make sure the next ones are fine and repool them as well [20:43:45] (03PS1) 10RobH: updating skus for r740xd2 [software] - 10https://gerrit.wikimedia.org/r/665414 [20:45:32] yea, so for example on mw1261 it took over a full minute to "cdb-rebuild" [20:45:44] now if i run that again it is not even a second [20:46:18] (03CR) 10RobH: [C: 03+2] updating skus for r740xd2 [software] - 10https://gerrit.wikimedia.org/r/665414 (owner: 10RobH) [20:46:40] PROBLEM - SSH on analytics1058.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:47:32] mw2257 is scap pulling for over 4 minutes.. mw1270 is done right away [20:48:51] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1261.eqiad.wmnet [20:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:16] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1270.eqiad.wmnet [20:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:12] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2257.codfw.cwmnet [20:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:26] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2257.codfw.wmnet [20:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:02] everything seems still ok, right? [20:54:07] reimaging the last canary in eqiad [20:55:18] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:56:52] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:57:27] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1287.eqiad.wmnet [20:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:46] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1287.eqiad.wmnet [20:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:46] yeah still ok [21:01:47] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [21:03:19] mutante: Yes, everything looks normal. [21:04:20] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [21:04:33] dancy: apergos: great, thanks [21:05:34] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [21:07:46] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [21:08:45] yea, cant confirm that. list info page works for me [21:08:57] rescheduled those checks [21:09:18] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 87697 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [21:09:18] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 15527 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [21:09:25] the HTTP check was already back.. and yea [21:12:08] (03CR) 10Hashar: profile: add gerrit log duplication and ecs mutations (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663876 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [21:15:23] (03CR) 10Hashar: [C: 03+1] "Thank you so much for the detailed explanation! I have learned a few things and indeed having all events to have the syslog fields sounds " [puppet] - 10https://gerrit.wikimedia.org/r/663876 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [21:19:21] (03PS1) 10Ahmon Dancy: Supply default value for profile::memcached::enable_16 for cloud [puppet] - 10https://gerrit.wikimedia.org/r/665417 (https://phabricator.wikimedia.org/T270315) [21:20:49] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2336.codfw.wmnet with reason: REIMAGE [21:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:21] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1262.eqiad.wmnet with reason: REIMAGE [21:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:49] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2336.codfw.wmnet with reason: REIMAGE [21:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:52] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1320.eqiad.wmnet with reason: REIMAGE [21:24:54] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1262.eqiad.wmnet with reason: REIMAGE [21:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:03] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1320.eqiad.wmnet with reason: REIMAGE [21:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:44] (03CR) 10Hashar: [C: 03+1] "We have:" [puppet] - 10https://gerrit.wikimedia.org/r/665417 (https://phabricator.wikimedia.org/T270315) (owner: 10Ahmon Dancy) [21:29:05] (03CR) 10Effie Mouzeli: [C: 03+1] "oops. I will merge on monday, sorry!" [puppet] - 10https://gerrit.wikimedia.org/r/665417 (https://phabricator.wikimedia.org/T270315) (owner: 10Ahmon Dancy) [21:29:46] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jijiki: Upgrade memcached to version 1.6.x - https://phabricator.wikimedia.org/T270315 (10hashar) [21:32:21] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1340.eqiad.wmnet with reason: REIMAGE [21:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:26] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1340.eqiad.wmnet with reason: REIMAGE [21:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:51] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2336.codfw.wmnet'] ` an... [21:44:15] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2336.codfw.wmnet [21:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:16] 10SRE, 10observability, 10serviceops, 10Sustainability (Incident Followup), 10User-jijiki: add monitoring of sustained memcached TKO rates - https://phabricator.wikimedia.org/T253384 (10jijiki) After some fiddling, another thing I am wondering if possible things to alert on the sum of hard TKO states re... [21:50:38] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2336.codfw.wmnet [21:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:59] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:00:26] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 11.21 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [22:07:52] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1262.eqiad.wmnet'] ` an... [22:08:07] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1261.eqiad.wmnet [22:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:39] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1261.eqiad.wmnet [22:09:42] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1320.eqiad.wmnet'] ` an... [22:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:01] (03CR) 10Cwhite: profile: add gerrit log duplication and ecs mutations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663876 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [22:10:39] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1261.eqiad.wmnet [22:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:47] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1262.eqiad.wmnet [22:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:03] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:14:55] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1320.eqiad.wmnet [22:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:18] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1320.eqiad.wmnet [22:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:17] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1340.eqiad.wmnet'] ` an... [22:18:02] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1340.eqiad.wmnet [22:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:05] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:24:00] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1339.eqiad.wmnet with reason: REIMAGE [22:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:11] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1339.eqiad.wmnet with reason: REIMAGE [22:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:29] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1340.eqiad.wmnet [22:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:08] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:42:02] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1333.eqiad.wmnet with reason: REIMAGE [22:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:12] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1333.eqiad.wmnet with reason: REIMAGE [22:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:04] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1342.eqiad.wmnet with reason: REIMAGE [22:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:05] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1342.eqiad.wmnet with reason: REIMAGE [22:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:47] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:03:07] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1317.eqiad.wmnet with reason: REIMAGE [23:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:09] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1317.eqiad.wmnet with reason: REIMAGE [23:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:09] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:08:43] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:09:22] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1339.eqiad.wmnet'] ` an... [23:09:46] (03PS1) 10Dzahn: docker::engine: replace hiera_hash with lookup with hash merge behaviour [puppet] - 10https://gerrit.wikimedia.org/r/665459 (https://phabricator.wikimedia.org/T209953) [23:09:51] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1339.eqiad.wmnet [23:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:34] (03PS2) 10Dzahn: docker::engine: replace hiera_hash with lookup with hash merge behaviour [puppet] - 10https://gerrit.wikimedia.org/r/665459 (https://phabricator.wikimedia.org/T209953) [23:19:19] (03PS1) 10Dzahn: ldap::config::labs: replace hiera_hash with lookup [puppet] - 10https://gerrit.wikimedia.org/r/665461 (https://phabricator.wikimedia.org/T209953) [23:19:37] (03PS2) 10Dzahn: ldap::config::labs: replace hiera_hash with lookup [puppet] - 10https://gerrit.wikimedia.org/r/665461 (https://phabricator.wikimedia.org/T209953) [23:22:29] (03PS3) 10Dzahn: docker::engine: replace hiera_hash with lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/665459 (https://phabricator.wikimedia.org/T209953) [23:27:39] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1333.eqiad.wmnet'] ` an... [23:29:24] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:32:19] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1342.eqiad.wmnet'] ` an... [23:34:47] (03PS1) 10Dzahn: profile::wmcs::instance: replace hiera_include with lookup [puppet] - 10https://gerrit.wikimedia.org/r/665462 (https://phabricator.wikimedia.org/T209953) [23:36:30] (03CR) 10jerkins-bot: [V: 04-1] profile::wmcs::instance: replace hiera_include with lookup [puppet] - 10https://gerrit.wikimedia.org/r/665462 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [23:37:48] (03PS3) 10Dzahn: ldap::config::labs: replace hiera_hash with lookup [puppet] - 10https://gerrit.wikimedia.org/r/665461 (https://phabricator.wikimedia.org/T209953) [23:40:49] (03PS4) 10Dzahn: docker::engine: replace hiera_hash with lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/665459 (https://phabricator.wikimedia.org/T209953) [23:43:34] (03PS2) 10Dzahn: profile::wmcs::instance: replace hiera_include with lookup [puppet] - 10https://gerrit.wikimedia.org/r/665462 (https://phabricator.wikimedia.org/T209953) [23:45:22] (03CR) 10jerkins-bot: [V: 04-1] profile::wmcs::instance: replace hiera_include with lookup [puppet] - 10https://gerrit.wikimedia.org/r/665462 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [23:47:55] (03CR) 10Dzahn: "15:45:16 wmf-style: total violations delta 2" [puppet] - 10https://gerrit.wikimedia.org/r/665462 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [23:48:18] RECOVERY - SSH on analytics1058.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:48:53] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1317.eqiad.wmnet'] ` an... [23:54:24] (03PS3) 10Dzahn: profile::wmcs::instance: replace hiera_include with lookup [puppet] - 10https://gerrit.wikimedia.org/r/665462 (https://phabricator.wikimedia.org/T209953) [23:55:31] (03CR) 10Dzahn: [V: 04-1 C: 04-1] "still not working like this because of the calendar format" [puppet] - 10https://gerrit.wikimedia.org/r/661536 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [23:56:00] (03CR) 10jerkins-bot: [V: 04-1] profile::wmcs::instance: replace hiera_include with lookup [puppet] - 10https://gerrit.wikimedia.org/r/665462 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [23:56:11] 10SRE, 10ops-eqiad, 10DC-Ops: update hostname labels on logstash103[345] & db11[51-76] - https://phabricator.wikimedia.org/T273922 (10wiki_willy) a:03Cmjohnson [23:56:22] (03CR) 10Dzahn: [V: 04-1 C: 04-1] "/usr/bin/systemd-analyze calendar *-*-1 0:0:00" [puppet] - 10https://gerrit.wikimedia.org/r/661536 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [23:59:33] (03PS4) 10Dzahn: profile::wmcs::instance: replace hiera_include with lookup [puppet] - 10https://gerrit.wikimedia.org/r/665462 (https://phabricator.wikimedia.org/T209953)