[00:00:05] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Evening SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200123T0000). Please do the needful. [00:00:05] No GERRIT patches in the queue for this window AFAICS. [00:00:12] Nope [00:00:16] Nobody deploy anything please [00:26:46] RoanKattouw: hrm, your update seems to suggest that we are somehow triggering a rebuild of the cache, but with old values... [00:27:34] which is odd :) [00:28:09] Yes, we are and iti s [00:29:16] My theory was that maybe there's a race condition, and that would have been plausible if all the bad servers had really small mtime deltas [00:29:42] And honestly a quarter-second race condition is already implausbile. But 3 seconds, no way [00:30:08] So maybe it's the opcache? idk [00:30:14] there's an addition dimension in that the cache is only generated after a server is hit [00:30:26] a server is hit on a given subdomain [00:30:28] Yes [00:30:29] so it's tricky [00:30:41] but, yeah, in general, this seems to implicate opcache [00:30:45] Which is why the deltas are so high on some servers [00:30:47] which is...new to me [00:30:51] And that spread of deltas is fine [00:31:11] But it doesn't seem to correlate strongly enough with bad/good to really tell us all that much [00:32:00] The way the config cache gets rebuilt is: [00:32:07] 1) get the mtime of InitialiseSettings.php [00:32:22] 2) read the config cache JSON file, and look at the mtime value encoded in the data [00:32:57] 3) If #1 equals #2, use the cached values, otherwise regenerate [00:33:28] right, and this is a recent improvement by Krinkle to bypass a rare race-condition. [00:33:35] 4) To regenerate, obtain the new values using require_once "InitialiseSettings.php" and then get $settings [00:33:53] It's the require_once usage in #4 that I currently find the most suspicious [00:33:56] that is, it used to compare the mtime of the cache and the mtime of IS.php and regenerate the cache if it were older [00:34:12] if the cache were older, that is [00:34:13] Right, yeah I see why it was changed to what it is now [00:34:26] To cover the case where IS changes while the cache is being rebuilt [00:34:35] yeah [00:35:09] it does seem #4 is where something is going wrong looking at the files on disk as they stand now [00:37:30] Yeah [00:37:43] OK so I am going to resync to fix the current issue and destroy the evidence [00:37:53] Because I don't think I can get any farther by looking at the state of files on disk [00:38:16] Instead I'm going to comb through the CommonSettings.php code paths to see if InitialiseSettings.php could possibly be loaded before its mtime is checked [00:38:42] It uses require_once, so if other code also loads it, it won't be loaded again [00:38:57] right, +1 to that [00:39:08] And generally I'm going to comb through everything that happens before IS is loaded to see if anything jumps out [00:39:34] Right now all I have is "vague suspicions mumble require_once mumble opcache mumble mumble" [00:40:30] !log Deployment freeze lifted [00:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:21] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: resync (duration: 01m 07s) [00:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:33] here's a question: why did the sync get logged twice? [00:41:33] (03CR) 10Dzahn: [C: 03+2] DNS: Add mgmt DNS for elastic205[5-9],elastic2060 [dns] - 10https://gerrit.wikimedia.org/r/566623 (owner: 10Papaul) [00:41:56] https://tools.wmflabs.org/sal/log/AW_PYUSwfYQT6VcDwxjO and https://tools.wmflabs.org/sal/log/AW_PaNcg0fjmsHBa3J_X 9 seconds apart? [00:42:26] thcipriani: Those are different, help panel vs homepage [00:43:21] ah, also: 9 minutes apart :) [00:43:26] brain fart :) [00:44:13] (03CR) 10Dzahn: "can't say much about weblog and wtp is probably not going to exist anymore for another reinstall, will be folded into regular mw appserver" [puppet] - 10https://gerrit.wikimedia.org/r/566293 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [00:49:13] I just had an IRL chat with James and we considered making the following change: [00:49:18] Current: [00:49:31] (03CR) 10Dzahn: [C: 04-1] "seems to cause a duplicate declaration on maps1001:" [puppet] - 10https://gerrit.wikimedia.org/r/566490 (owner: 10Muehlenhoff) [00:49:36] InitialiseSettings.php: function wmfGetVariantSettings() { .... } [00:49:56] CommonSettings.php: require_once "InitialiseSettings.php"; $settings = wmfGetVariantSettings(); [00:50:02] We could change that to: [00:50:09] InitialiseSettings.php: return function () { ... } [00:50:27] CommonSettings.php: $settingsFunc = require "InitialiseSettings.php"; $settings = $settingsFunc(); [00:51:01] But that's a somewhat risky change and we have no idea whether it would help [00:56:03] Not risky, just tedious switch-over. [00:56:35] But yeah, no idea if it'd help (except that we could require instead of require_once, so re-setting if an opcache bug is triggered on the latter but not the former). [01:00:04] twentyafterfour: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200123T0100). [01:04:34] (03PS1) 10Dzahn: add IP for etherpad1002 [dns] - 10https://gerrit.wikimedia.org/r/566628 (https://phabricator.wikimedia.org/T224580) [01:06:20] (03PS1) 10Dzahn: add etherpad-new.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/566629 (https://phabricator.wikimedia.org/T224580) [01:08:49] (03PS1) 10Dzahn: install_server: add etherpad1002 to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/566630 [01:08:51] (03PS1) 10Dzahn: install_server: add etherpad1002 to netboot/partman [puppet] - 10https://gerrit.wikimedia.org/r/566631 (https://phabricator.wikimedia.org/T224580) [01:09:50] (03CR) 10jerkins-bot: [V: 04-1] install_server: add etherpad1002 to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/566630 (owner: 10Dzahn) [01:13:18] (03PS1) 10Dzahn: site: add etherpad1002 with role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/566633 [01:13:20] (03PS1) 10Dzahn: site: add etherpad role to etherpad1002 [puppet] - 10https://gerrit.wikimedia.org/r/566634 (https://phabricator.wikimedia.org/T224580) [01:13:22] (03PS1) 10Dzahn: site: remove etherpad1001 [puppet] - 10https://gerrit.wikimedia.org/r/566635 (https://phabricator.wikimedia.org/T224580) [01:18:21] (03PS1) 10Dzahn: trafficserver/cache: add etherpad-new -> etherpad1002 [puppet] - 10https://gerrit.wikimedia.org/r/566636 (https://phabricator.wikimedia.org/T224580) [01:18:23] (03PS1) 10Dzahn: trafficserver/cache: switch backend for etherpad to etherpad1002 [puppet] - 10https://gerrit.wikimedia.org/r/566637 (https://phabricator.wikimedia.org/T224580) [01:20:40] (03PS1) 10Dzahn: remove etherpad-new.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/566638 (https://phabricator.wikimedia.org/T224580) [01:27:20] !log Deploying phabricator update tagged release/2020-01-23/1 [01:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:34] (03PS2) 10Dzahn: add IP for etherpad1002 [dns] - 10https://gerrit.wikimedia.org/r/566628 (https://phabricator.wikimedia.org/T224580) [01:28:35] (03CR) 10Dzahn: [C: 03+2] add IP for etherpad1002 [dns] - 10https://gerrit.wikimedia.org/r/566628 (https://phabricator.wikimedia.org/T224580) (owner: 10Dzahn) [01:28:39] (03PS3) 10Dzahn: add IP for etherpad1002 [dns] - 10https://gerrit.wikimedia.org/r/566628 (https://phabricator.wikimedia.org/T224580) [01:33:34] (03PS2) 10Dzahn: install_server: add etherpad1002 to netboot/partman [puppet] - 10https://gerrit.wikimedia.org/r/566631 (https://phabricator.wikimedia.org/T224580) [01:33:59] (03PS3) 10Dzahn: install_server: add etherpad1002 to netboot/partman [puppet] - 10https://gerrit.wikimedia.org/r/566631 (https://phabricator.wikimedia.org/T224580) [01:34:28] (03CR) 10Legoktm: [WIP] toolforge: Port portgrabber related code to Python 3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/566491 (https://phabricator.wikimedia.org/T218427) (owner: 10Legoktm) [01:35:06] (03CR) 10Dzahn: [C: 03+2] install_server: add etherpad1002 to netboot/partman [puppet] - 10https://gerrit.wikimedia.org/r/566631 (https://phabricator.wikimedia.org/T224580) (owner: 10Dzahn) [01:35:15] (03PS3) 10Legoktm: toolforge: Port portgrabber related code to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/566491 (https://phabricator.wikimedia.org/T218427) [01:37:39] !log Phabricator deployment completed with no apparent issues. [01:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:40] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.16/extensions/AbuseFilter/includes/AFComputedVariable.php: T243469 When no registration date is recorded, use 2008-01-15 (duration: 01m 08s) [01:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:43] T243469: Regression from 1.35.0-wmf.16: user_age abusefilter variable reports 0 for users with no registration time, rather than 11+ years - https://phabricator.wikimedia.org/T243469 [01:51:42] (03CR) 10Dzahn: [C: 03+2] site: add etherpad1002 with role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/566633 (owner: 10Dzahn) [02:02:47] 10Puppet, 10VPS-project-codesearch, 10Patch-For-Review: Puppetize codesearch - https://phabricator.wikimedia.org/T242319 (10Legoktm) 05Open→03Resolved a:03Legoktm Woohoo! Major thank you to @Dzahn for all of his help :) I've left codesearch4 shut down but not deleted just in case something goes wrong,... [02:12:35] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) 25 servers in row B racked and Netbox updated mw2310-mw2335 [02:19:04] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.wikimedia.org - https://phabricator.wikimedia.org/T241337 (10Papaul) [02:34:03] (03CR) 10BryanDavis: [C: 03+1] "Eyeball based review looks good. Testing via cherry pick in toolsbeta would probably be a good idea." [puppet] - 10https://gerrit.wikimedia.org/r/566491 (https://phabricator.wikimedia.org/T218427) (owner: 10Legoktm) [02:35:13] bd808: do you know why profile::toolforge::node::web isn't listed on https://tools.wmflabs.org/openstack-browser/puppetclass/ ? [02:36:24] maybe it's just not used directly... [02:36:33] legoktm: just a quick guess, but I would imagine we have another profile that includes it [02:38:21] * bd808 goes to look at prefix puppet to find out [02:41:13] legoktm: it is applied indirectly via role::wmcs::toolforge::grid::web::lighttpd and profile::toolforge::grid::node::web::generic [02:41:38] aha, thank you [02:42:21] without PuppetDB, the data we have on which puppet code applies to which Cloud VPS instances is pretty limited [02:43:00] and as far as I know we still haven't figured out any good way to have PuppetDB in a multi-tenant environment. [02:50:50] bd808: can you add me to the toolsbeta project? [02:51:04] yes I can! [02:53:54] thanks [02:53:58] I'm in :) [02:54:08] is there a dummy tool to test with? [02:59:07] found a few [03:04:18] bd808: do I need to be listed as an admin somewhere else? I'm getting a password prompt on sudo [03:05:57] legoktm: oh, I bet there is a custom sudoers group. Let me fix that [03:07:45] perfect, thank you! [05:08:32] 10Operations, 10Phabricator, 10SRE-tools, 10Technical-Debt: Update Puppet repo code that uses deprecated maniphest.update/.createtask/.query Conduit API - https://phabricator.wikimedia.org/T159045 (10DannyS712) a:03Volans [05:11:02] 10Operations, 10Phabricator, 10Traffic, 10serviceops, 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)): Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10DannyS712) a:03mmodell [05:55:37] !log Compress some tables on db1124:3318, this might generate lag on s8 labs - T232446 [05:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:41] T232446: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 [05:56:59] (03PS1) 10Marostegui: db1103: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/566649 (https://phabricator.wikimedia.org/T239453) [05:58:13] (03CR) 10Marostegui: [C: 03+2] db1103: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/566649 (https://phabricator.wikimedia.org/T239453) (owner: 10Marostegui) [05:59:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1103:3314 - T239453', diff saved to https://phabricator.wikimedia.org/P10247 and previous config saved to /var/cache/conftool/dbconfig/20200123-055919-marostegui.json [05:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:24] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [06:03:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1097:3314 - T239453', diff saved to https://phabricator.wikimedia.org/P10248 and previous config saved to /var/cache/conftool/dbconfig/20200123-060308-marostegui.json [06:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:10] !log Remove partitions from db1097:3314 - T239453 [06:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:13] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [06:06:41] (03PS1) 10Marostegui: db1097: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/566650 (https://phabricator.wikimedia.org/T239453) [06:07:33] (03CR) 10Marostegui: [C: 03+2] db1097: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/566650 (https://phabricator.wikimedia.org/T239453) (owner: 10Marostegui) [06:16:25] (03PS1) 10BryanDavis: toollabs: add Buster support to toollabs::apt_pinning [puppet] - 10https://gerrit.wikimedia.org/r/566651 [07:44:50] (03PS4) 10Legoktm: toolforge: Port portgrabber related code to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/566491 (https://phabricator.wikimedia.org/T218427) [07:50:02] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::php: remove useless inclusion of lvs::configuration [puppet] - 10https://gerrit.wikimedia.org/r/566461 [07:57:10] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10Joe) [07:57:43] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10Joe) @Dzahn can we please ensure this procedure is finished before next week? [08:00:46] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::mediawiki::php: remove useless inclusion of lvs::configuration [puppet] - 10https://gerrit.wikimedia.org/r/566461 (owner: 10Giuseppe Lavagetto) [08:05:04] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:07:41] (03CR) 10Muehlenhoff: [C: 03+1] contint: use package_from_component, stop using docker class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/566383 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [08:07:46] (03CR) 10Muehlenhoff: [C: 04-1] contint: use package_from_component, stop using docker class [puppet] - 10https://gerrit.wikimedia.org/r/566383 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [08:09:45] ^ arturo: the alert on sodium is the new openstack mirror sync [08:23:22] 10Operations, 10Citoid: Request took down both zotero and citoid (exceeding memory) - https://phabricator.wikimedia.org/T243444 (10MoritzMuehlenhoff) p:05Triage→03High [08:23:31] 10Operations, 10Wikimedia-Etherpad, 10serviceops: vm request for etherpad1002 - https://phabricator.wikimedia.org/T243475 (10MoritzMuehlenhoff) p:05Triage→03Normal [08:24:07] 10Operations, 10Continuous-Integration-Config: cergen CI fails to run on Debian Stretch because cryptography dependency cannot be built against newer openssl version - https://phabricator.wikimedia.org/T212395 (10MoritzMuehlenhoff) p:05Triage→03Normal This task is from 2018, is that still an issue? [08:34:17] 10Operations, 10Citoid: Request took down both zotero and citoid (exceeding memory) - https://phabricator.wikimedia.org/T243444 (10akosiaris) > it should be in the raw logs Not sure what this refers to. If it is zotero logs, note that those don't exist. We had to turn them off as they were in a really bad for... [08:42:29] (03PS1) 10Muehlenhoff: Remove tor relay profile/role [puppet] - 10https://gerrit.wikimedia.org/r/566687 (https://phabricator.wikimedia.org/T243288) [08:44:28] (03PS1) 10Marostegui: mariadb: Productionize es2025 as es5 codfw slave [puppet] - 10https://gerrit.wikimedia.org/r/566688 (https://phabricator.wikimedia.org/T243052) [08:45:21] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize es2025 as es5 codfw slave [puppet] - 10https://gerrit.wikimedia.org/r/566688 (https://phabricator.wikimedia.org/T243052) (owner: 10Marostegui) [08:45:56] !log Stop mysql on es2024 to "clone" es2025 - T243052 [08:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:04] T243052: Productionize es1020-es1025, es2020-es2025 - https://phabricator.wikimedia.org/T243052 [08:47:23] 10Operations, 10ops-eqiad, 10serviceops: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10Jclark-ctr) [08:48:48] 10Operations, 10ops-eqiad, 10serviceops: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson Host racked bios, ip , and password set. Needs dns server ip asset tag rack switch port mw1349 10.65.1.24 WMF5291 D1... [09:01:07] (03CR) 10Muehlenhoff: [C: 03+2] Remove tor relay profile/role [puppet] - 10https://gerrit.wikimedia.org/r/566687 (https://phabricator.wikimedia.org/T243288) (owner: 10Muehlenhoff) [09:02:26] (03CR) 10Alexandros Kosiaris: "This is pretty cool, I think we only need to populate the configuration file now and we should be good to go. Comments inline on how to do" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/565788 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [09:03:16] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1, pending the discussion on service.name" [deployment-charts] - 10https://gerrit.wikimedia.org/r/551843 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [09:04:08] (03CR) 10Marostegui: WIP - Introduce profile::mariadb::misc::analytics (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [09:13:08] !log installing xen updates (only pulled in via deps, otherwise unused) [09:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:21] (03CR) 10Ema: [C: 03+2] cache: remove backend-specific VCL files [puppet] - 10https://gerrit.wikimedia.org/r/566554 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [09:22:18] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:28] (03PS1) 10Arturo Borrero Gonzalez: mirrors: fix openstack rsync command [puppet] - 10https://gerrit.wikimedia.org/r/566694 (https://phabricator.wikimedia.org/T238820) [09:27:10] (03CR) 10jerkins-bot: [V: 04-1] mirrors: fix openstack rsync command [puppet] - 10https://gerrit.wikimedia.org/r/566694 (https://phabricator.wikimedia.org/T238820) (owner: 10Arturo Borrero Gonzalez) [09:31:15] 10Operations, 10Goal: 2020 Q3 DC switchover and switchback - https://phabricator.wikimedia.org/T243314 (10Marostegui) [09:36:32] (03PS2) 10Arturo Borrero Gonzalez: mirrors: fix openstack rsync command [puppet] - 10https://gerrit.wikimedia.org/r/566694 (https://phabricator.wikimedia.org/T238820) [09:40:36] (03PS1) 10Ema: cache: remove app_directors and app_def_be_opts [puppet] - 10https://gerrit.wikimedia.org/r/566696 (https://phabricator.wikimedia.org/T241239) [09:44:14] (03CR) 10Ema: "Functional NOOP, a few newlines and comments removed from rendered VCL files: https://puppet-compiler.wmflabs.org/compiler1002/20528/" [puppet] - 10https://gerrit.wikimedia.org/r/566696 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [09:46:49] (03PS2) 10Vgutierrez: prometheus: Remove varnish-backend cluster config [puppet] - 10https://gerrit.wikimedia.org/r/566565 (https://phabricator.wikimedia.org/T241239) [09:46:51] (03PS2) 10Vgutierrez: prometheus: Clean up varnish-backend cluster config [puppet] - 10https://gerrit.wikimedia.org/r/566566 (https://phabricator.wikimedia.org/T241239) [09:47:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/566694 (https://phabricator.wikimedia.org/T238820) (owner: 10Arturo Borrero Gonzalez) [09:47:47] (03CR) 10Vgutierrez: [C: 03+1] cache: remove app_directors and app_def_be_opts [puppet] - 10https://gerrit.wikimedia.org/r/566696 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [09:48:30] (03PS1) 10Ema: cache: remove unused parameter 'cache_route' [puppet] - 10https://gerrit.wikimedia.org/r/566700 (https://phabricator.wikimedia.org/T241239) [09:48:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] mirrors: fix openstack rsync command [puppet] - 10https://gerrit.wikimedia.org/r/566694 (https://phabricator.wikimedia.org/T238820) (owner: 10Arturo Borrero Gonzalez) [09:49:16] (03CR) 10Ema: [C: 03+1] prometheus: Clean up varnish-backend cluster config [puppet] - 10https://gerrit.wikimedia.org/r/566566 (https://phabricator.wikimedia.org/T241239) (owner: 10Vgutierrez) [09:49:23] (03CR) 10Ema: [C: 03+1] prometheus: Remove varnish-backend cluster config [puppet] - 10https://gerrit.wikimedia.org/r/566565 (https://phabricator.wikimedia.org/T241239) (owner: 10Vgutierrez) [09:49:37] (03CR) 10Vgutierrez: [C: 03+2] prometheus: Remove varnish-backend cluster config [puppet] - 10https://gerrit.wikimedia.org/r/566565 (https://phabricator.wikimedia.org/T241239) (owner: 10Vgutierrez) [09:50:29] (03CR) 10Ema: [C: 03+2] cache: remove app_directors and app_def_be_opts [puppet] - 10https://gerrit.wikimedia.org/r/566696 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [09:50:59] (03PS1) 10Ladsgroup: Revert "Revert "Revert "Revert "Set useEntitySourceBasedFederation to true for Wikidata"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566701 [09:51:10] (03PS2) 10Ladsgroup: Revert "Revert "Revert "Revert "Set useEntitySourceBasedFederation to true for Wikidata"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566701 [09:58:55] (03PS2) 10Ema: cache: remove unused parameter 'cache_route' [puppet] - 10https://gerrit.wikimedia.org/r/566700 (https://phabricator.wikimedia.org/T241239) [10:00:58] (03CR) 10Ema: "https://puppet-compiler.wmflabs.org/compiler1002/20531/" [puppet] - 10https://gerrit.wikimedia.org/r/566700 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [10:01:09] (03CR) 10Vgutierrez: [C: 03+2] prometheus: Clean up varnish-backend cluster config [puppet] - 10https://gerrit.wikimedia.org/r/566566 (https://phabricator.wikimedia.org/T241239) (owner: 10Vgutierrez) [10:01:20] (03PS3) 10Vgutierrez: prometheus: Clean up varnish-backend cluster config [puppet] - 10https://gerrit.wikimedia.org/r/566566 (https://phabricator.wikimedia.org/T241239) [10:01:50] (03CR) 10Vgutierrez: [C: 03+1] cache: remove unused parameter 'cache_route' [puppet] - 10https://gerrit.wikimedia.org/r/566700 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [10:02:12] (03CR) 10Ema: [C: 03+2] cache: remove unused parameter 'cache_route' [puppet] - 10https://gerrit.wikimedia.org/r/566700 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [10:07:14] (03PS1) 10Vgutierrez: ATS: Move to a hiera parameter the ats-tls conftool service name [puppet] - 10https://gerrit.wikimedia.org/r/566702 (https://phabricator.wikimedia.org/T242093) [10:09:10] (03CR) 10Vgutierrez: "pcc reports almost a NOOP: https://puppet-compiler.wmflabs.org/compiler1003/20532/" [puppet] - 10https://gerrit.wikimedia.org/r/566702 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [10:13:38] (03CR) 10Ema: [C: 03+1] ATS: Move to a hiera parameter the ats-tls conftool service name [puppet] - 10https://gerrit.wikimedia.org/r/566702 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [10:15:46] (03CR) 10Vgutierrez: [C: 03+2] ATS: Move to a hiera parameter the ats-tls conftool service name [puppet] - 10https://gerrit.wikimedia.org/r/566702 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [10:15:47] 10Operations, 10Phabricator, 10SRE-tools, 10Technical-Debt: Update Puppet repo code that uses deprecated maniphest.update/.createtask/.query Conduit API - https://phabricator.wikimedia.org/T159045 (10Aklapper) [10:16:40] (03PS1) 10Vgutierrez: install_server: Reimage cp4026 as buster [puppet] - 10https://gerrit.wikimedia.org/r/566703 (https://phabricator.wikimedia.org/T242093) [10:18:38] 10Operations, 10Phabricator, 10Traffic, 10serviceops, 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)): Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Aklapper) [10:20:09] (03CR) 10Vgutierrez: "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1002/20533/" [puppet] - 10https://gerrit.wikimedia.org/r/566703 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [10:25:15] (03CR) 10Ema: [C: 03+1] install_server: Reimage cp4026 as buster [puppet] - 10https://gerrit.wikimedia.org/r/566703 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [10:25:45] !log depooling and reimaging cp4026 as buster - T242093 [10:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:49] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [10:26:30] (03CR) 10Vgutierrez: [C: 03+2] install_server: Reimage cp4026 as buster [puppet] - 10https://gerrit.wikimedia.org/r/566703 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [10:26:42] \o/ [10:28:39] (03PS1) 10Arturo Borrero Gonzalez: toolforge: refactor elastic role/profile into modern layout [puppet] - 10https://gerrit.wikimedia.org/r/566704 (https://phabricator.wikimedia.org/T236606) [10:30:55] (03CR) 10Arturo Borrero Gonzalez: "counter attack: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/566704" [puppet] - 10https://gerrit.wikimedia.org/r/566651 (owner: 10BryanDavis) [10:31:58] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp4026.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [10:36:54] (03PS1) 10Ema: cache: remove 'inst' VCL template variable [puppet] - 10https://gerrit.wikimedia.org/r/566705 (https://phabricator.wikimedia.org/T241239) [10:37:36] (03CR) 10jerkins-bot: [V: 04-1] cache: remove 'inst' VCL template variable [puppet] - 10https://gerrit.wikimedia.org/r/566705 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [10:43:14] (03PS2) 10Ema: cache: remove 'inst' VCL template variable [puppet] - 10https://gerrit.wikimedia.org/r/566705 (https://phabricator.wikimedia.org/T241239) [10:44:05] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4026.ulsfo.wmnet'] ` Of which those **FAILED**: ` ['cp4026.ulsfo.wmnet'] ` [10:44:24] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp4026.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [10:44:27] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4026.ulsfo.wmnet'] ` Of which those **FAILED**: ` ['cp4026.ulsfo.wmnet'] ` [10:44:58] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp4026.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [10:48:37] (03PS3) 10Ema: cache: remove 'inst' VCL template variable [puppet] - 10https://gerrit.wikimedia.org/r/566705 (https://phabricator.wikimedia.org/T241239) [10:52:57] (03CR) 10Ema: "pcc seems satisfactory: https://puppet-compiler.wmflabs.org/compiler1002/20535/" [puppet] - 10https://gerrit.wikimedia.org/r/566705 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [11:01:31] (03PS1) 10Ema: cache: remove be_runtime_params [puppet] - 10https://gerrit.wikimedia.org/r/566706 (https://phabricator.wikimedia.org/T241239) [11:01:50] (03CR) 10Vgutierrez: [C: 03+1] cache: remove 'inst' VCL template variable [puppet] - 10https://gerrit.wikimedia.org/r/566705 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [11:04:27] (03CR) 10Ema: "https://puppet-compiler.wmflabs.org/compiler1002/20536/" [puppet] - 10https://gerrit.wikimedia.org/r/566706 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [11:05:15] (03CR) 10Ema: [C: 03+2] cache: remove 'inst' VCL template variable [puppet] - 10https://gerrit.wikimedia.org/r/566705 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [11:05:16] (03CR) 10Vgutierrez: [C: 03+1] cache: remove be_runtime_params [puppet] - 10https://gerrit.wikimedia.org/r/566706 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [11:07:15] (03CR) 10Ema: [C: 03+2] cache: remove be_runtime_params [puppet] - 10https://gerrit.wikimedia.org/r/566706 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [11:12:40] 10Operations, 10Citoid: Request took down both zotero and citoid (exceeding memory) - https://phabricator.wikimedia.org/T243444 (10Mvolz) [11:12:41] 10Operations, 10Citoid: Request took down both zotero and citoid (exceeding memory) - https://phabricator.wikimedia.org/T243444 (10akosiaris) I just noticed that for some reason setting DEBUG_LEVEL: 0 for zotero no longer works however. [11:16:50] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4026.ulsfo.wmnet'] ` Of which those **FAILED**: ` ['cp4026.ulsfo.wmnet'] ` [11:16:51] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp4026.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202001231116_vgutie... [11:20:22] 10Operations, 10Citoid: Request took down both zotero and citoid (exceeding memory) - https://phabricator.wikimedia.org/T243444 (10Mvolz) >>! In T243444#5825779, @akosiaris wrote: >> it should be in the raw logs > > Not sure what this refers to. If it is zotero logs, note that those don't exist. We had to tur... [11:20:42] (03PS1) 10Ema: cache: move fe_runtime_params out of varnish::common [puppet] - 10https://gerrit.wikimedia.org/r/566707 (https://phabricator.wikimedia.org/T241239) [11:25:04] (03CR) 10Ema: "https://puppet-compiler.wmflabs.org/compiler1003/20537/" [puppet] - 10https://gerrit.wikimedia.org/r/566707 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [11:27:29] (03CR) 10Vgutierrez: [C: 03+1] cache: move fe_runtime_params out of varnish::common [puppet] - 10https://gerrit.wikimedia.org/r/566707 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [11:30:39] (03PS1) 10Hnowlan: mediawiki: check mw versions match those on the deploy server [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) [11:31:12] 10Operations, 10Citoid: Use log level warn in citoid whenever zotero is unresponsive - https://phabricator.wikimedia.org/T243504 (10Mvolz) [11:37:56] !log updating order in resolve search list https://gerrit.wikimedia.org/r/c/operations/puppet/+/566567 [11:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:01] (03CR) 10Jbond: [C: 03+2] base::resolv: ensure the server domain name is always first in the search list [puppet] - 10https://gerrit.wikimedia.org/r/566567 (owner: 10Jbond) [11:39:23] (03CR) 10Ema: [C: 03+2] cache: move fe_runtime_params out of varnish::common [puppet] - 10https://gerrit.wikimedia.org/r/566707 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [11:43:15] (03PS1) 10Jbond: puppetmaster2003: enable puppet master for live traffic [puppet] - 10https://gerrit.wikimedia.org/r/566709 (https://phabricator.wikimedia.org/T239732) [11:45:11] 10Operations, 10Traffic: buster installation issues on cache nodes - https://phabricator.wikimedia.org/T243506 (10Vgutierrez) [11:45:56] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/566709 (https://phabricator.wikimedia.org/T239732) (owner: 10Jbond) [11:45:59] 10Operations, 10Traffic: buster installation issues on cache nodes - https://phabricator.wikimedia.org/T243506 (10Vgutierrez) p:05Triage→03Normal [11:46:19] (03CR) 10Jbond: [C: 03+2] puppetmaster2003: enable puppet master for live traffic [puppet] - 10https://gerrit.wikimedia.org/r/566709 (https://phabricator.wikimedia.org/T239732) (owner: 10Jbond) [11:49:20] (03PS1) 10Vgutierrez: install_server: Switch cp4026 to the standard partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/566711 (https://phabricator.wikimedia.org/T243506) [11:53:17] (03CR) 10Muehlenhoff: [C: 03+1] install_server: Switch cp4026 to the standard partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/566711 (https://phabricator.wikimedia.org/T243506) (owner: 10Vgutierrez) [11:53:54] (03CR) 10Vgutierrez: [C: 03+2] install_server: Switch cp4026 to the standard partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/566711 (https://phabricator.wikimedia.org/T243506) (owner: 10Vgutierrez) [11:54:25] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4026.ulsfo.wmnet'] ` Of which those **FAILED**: ` ['cp4026.ulsfo.wmnet'] ` [11:56:16] 10Operations, 10Traffic, 10Patch-For-Review: buster installation issues on cache nodes - https://phabricator.wikimedia.org/T243506 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp4026.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-re... [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200123T1200). [12:00:04] kart_ and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:10] * Urbanecm around [12:00:13] o/ [12:00:23] o/ [12:00:51] * kart_ is here. [12:01:03] Amir1: can you deploy my patch too? :) [12:01:03] Amir1: want to SWAT, or should I? [12:01:30] Or Urbanecm if you can.. [12:01:30] nah, I do SWAT today [12:01:50] Okay. Ping me once done, please :) [12:02:15] sure [12:02:57] (03PS2) 10Ladsgroup: Move CX out of beta for af, is, lv and ne WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566123 (https://phabricator.wikimedia.org/T242011) (owner: 10KartikMistry) [12:03:08] thanks [12:03:17] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566123 (https://phabricator.wikimedia.org/T242011) (owner: 10KartikMistry) [12:03:57] (03Merged) 10jenkins-bot: Move CX out of beta for af, is, lv and ne WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566123 (https://phabricator.wikimedia.org/T242011) (owner: 10KartikMistry) [12:04:54] kart_: it's live in mwdebug1001 [12:05:21] OK. Testing. [12:06:15] (03PS3) 10Ladsgroup: Revert "Revert "Revert "Revert "Set useEntitySourceBasedFederation to true for Wikidata"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566701 [12:07:27] Amir1: looks good. Please deploy! [12:08:04] okie dokie [12:08:49] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566701 (owner: 10Ladsgroup) [12:08:54] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:566123|Move CX out of beta for af, is, lv and ne WPs (T242011 T242012 T242014 T242016)]] (duration: 01m 08s) [12:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:01] T242014: Enable Content Translation in Latvian Wikipedia as a default tool - https://phabricator.wikimedia.org/T242014 [12:09:01] T242016: Enable Content Translation in Nepali Wikipedia as a default tool - https://phabricator.wikimedia.org/T242016 [12:09:01] T242011: Enable Content Translation in Afrikaans Wikipedia as a default tool - https://phabricator.wikimedia.org/T242011 [12:09:01] T242012: Enable Content Translation in Icelandic Wikipedia as a default tool - https://phabricator.wikimedia.org/T242012 [12:09:43] Thanks Amir1 ! [12:09:45] (03Merged) 10jenkins-bot: Revert "Revert "Revert "Revert "Set useEntitySourceBasedFederation to true for Wikidata"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566701 (owner: 10Ladsgroup) [12:09:59] doing twice for the issue of IS.php cache [12:10:42] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:566123|Move CX out of beta for af, is, lv and ne WPs (T242011 T242012 T242014 T242016)]] (duration: 01m 05s) [12:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:56] Oh, OK! [12:13:05] kart_: it's live now ^_^ [12:14:02] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:566701|Set useEntitySourceBasedFederation to true for Wikidata (T241972)]] (duration: 01m 06s) [12:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:05] T241972: wmgUseEntitySourceBasedFederation true for Wikidata.org - https://phabricator.wikimedia.org/T241972 [12:15:20] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:566701|Set useEntitySourceBasedFederation to true for Wikidata (T241972)]] (duration: 01m 04s) [12:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:37] Amir1: still deploying? [12:19:42] yup [12:19:50] ok [12:20:01] Urbanecm: go forward, I need to stare at logs for a bit [12:20:44] Okay [12:21:49] Thanks again Amir1 [12:22:06] (03PS10) 10Urbanecm: Use editautopatrolprotected right for pages protected for autopatrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529043 (https://phabricator.wikimedia.org/T230103) [12:22:08] kart_: don't mention it :) [12:22:12] (03CR) 10Urbanecm: [C: 03+2] Use editautopatrolprotected right for pages protected for autopatrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529043 (https://phabricator.wikimedia.org/T230103) (owner: 10Urbanecm) [12:23:05] (03Merged) 10jenkins-bot: Use editautopatrolprotected right for pages protected for autopatrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529043 (https://phabricator.wikimedia.org/T230103) (owner: 10Urbanecm) [12:26:59] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Good start, but the script needs some improvements, and we need to figure out how to prevent this from firing during a deployment." (0312 comments) [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) (owner: 10Hnowlan) [12:28:36] (03PS1) 10Ladsgroup: Set EntitySourceBasedFederation true for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566716 (https://phabricator.wikimedia.org/T243395) [12:28:39] Urbanecm: let me know once you're done [12:31:06] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 0c2fb70: Use editautopatrolprotected right for pages protected for autopatrollers (T230103) (duration: 01m 06s) [12:31:07] 10Operations, 10Citoid: Request took down both zotero and citoid (exceeding memory) - https://phabricator.wikimedia.org/T243444 (10Mvolz) >>! In T243444#5825779, @akosiaris wrote: >> it should be in the raw logs > > Not sure what this refers to. If it is zotero logs, note that those don't exist. We had to tur... [12:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:09] T230103: Systematize wgRestrictionLevels - https://phabricator.wikimedia.org/T230103 [12:31:33] !log Deploying hotfix for T243479, restarting php7.3-fpm on phab1003 [12:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:36] T243479: Phab Global Search broken: Unhandled Exception: Argument 1 passed to PhabricatorHandleQuery::withPHIDs() must be of the type array, object given - https://phabricator.wikimedia.org/T243479 [12:33:21] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 0c2fb70: Use editautopatrolprotected right for pages protected for autopatrollers; fixing broken cache (T230103) (duration: 01m 04s) [12:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:36] Amir1: deploying done, I'll run a few scripts now [12:34:03] cool, I deploy something quickly [12:34:32] ah, no, forgot to sync two files :D [12:34:36] CS.php and flaggedrevs.php is not safe [12:35:18] !log mwscript renameRestrictions.php --wiki=ckbwiki 'autopatrol' 'editautopatrolprotected' (T230103) [12:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:10] Urbanecm: let me know once you're done [12:37:34] thanks [12:39:08] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: 0c2fb70: Use editautopatrolprotected right for pages protected for autopatrollers (2/3; T230103) (duration: 01m 08s) [12:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:11] T230103: Systematize wgRestrictionLevels - https://phabricator.wikimedia.org/T230103 [12:40:58] !log urbanecm@deploy1001 Synchronized wmf-config/flaggedrevs.php: SWAT: 0c2fb70: Use editautopatrolprotected right for pages protected for autopatrollers (3/3; T230103) (duration: 01m 05s) [12:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:16] Amir1: done for good now [12:42:45] cool [12:43:01] (03CR) 10Ladsgroup: [C: 03+2] Set EntitySourceBasedFederation true for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566716 (https://phabricator.wikimedia.org/T243395) (owner: 10Ladsgroup) [12:43:55] (03Merged) 10jenkins-bot: Set EntitySourceBasedFederation true for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566716 (https://phabricator.wikimedia.org/T243395) (owner: 10Ladsgroup) [12:44:06] !log mwscript renameRestrictions.php --wiki=etwiki 'autopatrol' 'editautopatrolprotected' (T230103) [12:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:14] T230103: Systematize wgRestrictionLevels - https://phabricator.wikimedia.org/T230103 [12:44:32] PROBLEM - Disk space on notebook1003 is CRITICAL: DISK CRITICAL - free space: /srv 4565 MB (3% inode=85%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [12:44:49] !log mwscript renameRestrictions.php --wiki=hewiki 'autopatrol' 'editautopatrolprotected' (T230103) [12:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:56] !log Run renameRestrictions.php 'autopatrol' 'editautopatrolprotected' for all Serbian wikis (T230103) [12:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:47] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:566716|Set EntitySourceBasedFederation true for testwiki (T243395)]] (duration: 01m 05s) [12:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:50] T243395: wmgUseEntitySourceBasedFederation true for Wikimedia clients (all sites) - https://phabricator.wikimedia.org/T243395 [12:49:00] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:566716|Set EntitySourceBasedFederation true for testwiki (T243395)]] (duration: 01m 06s) [12:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:23] !log EU SWAT is done [12:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:03] (03Abandoned) 10Urbanecm: Use editeditorprotected for protecting pages for editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529046 (https://phabricator.wikimedia.org/T230103) (owner: 10Urbanecm) [12:52:50] (03PS1) 10Urbanecm: Use editeditorprotected for protecting pages for editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566722 (https://phabricator.wikimedia.org/T230103) [12:56:55] (03PS1) 10Vgutierrez: install_server: Fix syntax error on an-conf* partman configuration [puppet] - 10https://gerrit.wikimedia.org/r/566723 [12:58:26] (03CR) 10Muehlenhoff: install_server: Fix syntax error on an-conf* partman configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/566723 (owner: 10Vgutierrez) [12:59:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10faidon) #Traffic team, ping? This task has been open since August last year and as I was just saying on IRC, cp1008 is a constant outlier in all of... [12:59:48] (03PS2) 10Vgutierrez: install_server: Fix syntax error on an-{conf,coord,master}* netboot [puppet] - 10https://gerrit.wikimedia.org/r/566723 [13:01:19] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/566723 (owner: 10Vgutierrez) [13:01:30] (03CR) 10Vgutierrez: [C: 03+2] install_server: Fix syntax error on an-{conf,coord,master}* netboot [puppet] - 10https://gerrit.wikimedia.org/r/566723 (owner: 10Vgutierrez) [13:04:03] (03PS1) 10Vgutierrez: Revert "install_server: Switch cp4026 to the standard partman recipe" [puppet] - 10https://gerrit.wikimedia.org/r/566727 [13:04:25] 10Operations, 10Traffic: buster installation issues on cache nodes - https://phabricator.wikimedia.org/T243506 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4026.ulsfo.wmnet'] ` Of which those **FAILED**: ` ['cp4026.ulsfo.wmnet'] ` [13:05:39] (03CR) 10Vgutierrez: [C: 03+2] Revert "install_server: Switch cp4026 to the standard partman recipe" [puppet] - 10https://gerrit.wikimedia.org/r/566727 (owner: 10Vgutierrez) [13:05:53] (03PS2) 10Vgutierrez: Revert "install_server: Switch cp4026 to the standard partman recipe" [puppet] - 10https://gerrit.wikimedia.org/r/566727 [13:08:00] 10Operations, 10Citoid: Request took down both zotero and citoid (exceeding memory) - https://phabricator.wikimedia.org/T243444 (10akosiaris) `-l app=citoid` as that's the value for the app label, not citoid-production. You can have a look at all the labels that are attached to pods with either https://gerri... [13:10:07] 10Operations, 10Traffic: buster installation issues on cache nodes - https://phabricator.wikimedia.org/T243506 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp4026.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202001231309_vgu... [13:19:12] 10Operations, 10Traffic: buster installation issues on cache nodes - https://phabricator.wikimedia.org/T243506 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez The final culprit were 3 syntax errors on netboot.cfg as part of https://gerrit.wikimedia.org/r/#/q/Id93d599c6ef0efc5caa2d8cccc83773644bd7ec6 as s... [13:19:14] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [13:19:20] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:22] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:12] 10Operations, 10Citoid: Request took down both zotero and citoid (exceeding memory) - https://phabricator.wikimedia.org/T243444 (10akosiaris) >>! In T243444#5826294, @akosiaris wrote: > I just noticed that for some reason setting DEBUG_LEVEL: 0 for zotero no longer works however. Turns out those aren't the l... [13:32:05] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [13:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:23] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:30] 10Operations, 10vm-requests: Add a second CPU to debmonitor hosts - https://phabricator.wikimedia.org/T241046 (10MoritzMuehlenhoff) 05Open→03Resolved Both debmonitor instances now have two CPUs. [13:41:38] 10Operations, 10Traffic: buster installation issues on cache nodes - https://phabricator.wikimedia.org/T243506 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4026.ulsfo.wmnet'] ` and were **ALL** successful. [13:44:46] \o/ [13:50:20] 10Operations, 10Design-Research, 10Domains, 10Traffic: Register wikipersonas.org and redirect URL - https://phabricator.wikimedia.org/T241944 (10Aklapper) @Dendelele: Can you please reply to the last three comments? Thanks. [13:53:50] 10Operations, 10Traffic, 10Performance Issue: Current performance issues - https://phabricator.wikimedia.org/T242228 (10Aklapper) >>! In T242228#5786329, @Joe wrote: > An incident report will be published later on wikitech at https://wikitech.wikimedia.org/wiki/Incident_documentation This is https://wikitec... [13:55:58] 10Operations, 10Citoid: Request took down both zotero and citoid (exceeding memory) - https://phabricator.wikimedia.org/T243444 (10Mvolz) >>! In T243444#5826662, @akosiaris wrote: > `-l app=citoid` as that's the value for the app label, not citoid-production. > > You can have a look at all the labels that ar... [13:58:25] (03PS1) 10Arturo Borrero Gonzalez: realm: introduce global variable $wmcs_deployment [puppet] - 10https://gerrit.wikimedia.org/r/566735 (https://phabricator.wikimedia.org/T242607) [13:58:27] (03PS1) 10Arturo Borrero Gonzalez: wmcs: instance: introduce per-deployment openstack profiles [puppet] - 10https://gerrit.wikimedia.org/r/566736 (https://phabricator.wikimedia.org/T242607) [13:58:49] 10Operations, 10Citoid: Request took down both zotero and citoid (exceeding memory) - https://phabricator.wikimedia.org/T243444 (10akosiaris) > Thanks. That's working now, but I've downloaded the log file and it's just what's already available on kibana, warn level or higher. There's no debug level or message... [13:59:28] (03CR) 10jerkins-bot: [V: 04-1] wmcs: instance: introduce per-deployment openstack profiles [puppet] - 10https://gerrit.wikimedia.org/r/566736 (https://phabricator.wikimedia.org/T242607) (owner: 10Arturo Borrero Gonzalez) [14:00:09] (03PS2) 10Arturo Borrero Gonzalez: wmcs: instance: introduce per-deployment openstack profiles [puppet] - 10https://gerrit.wikimedia.org/r/566736 (https://phabricator.wikimedia.org/T242607) [14:05:36] (03PS1) 10Ema: cache: remove unused parameter 'varnish_version' [puppet] - 10https://gerrit.wikimedia.org/r/566738 [14:14:47] (03CR) 10Ema: [C: 03+2] cache: remove unused parameter 'varnish_version' [puppet] - 10https://gerrit.wikimedia.org/r/566738 (owner: 10Ema) [14:17:43] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 108039760 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:17:44] !log Remove wikiadmin2 user from codfw x1 hosts - T243512 [14:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:48] T243512: Clean up wikiadmin2 user from core hosts - https://phabricator.wikimedia.org/T243512 [14:19:31] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 304 and 78 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:28:14] (03PS2) 10Giuseppe Lavagetto: service::catalog: fix host header for check_http* monitoring [puppet] - 10https://gerrit.wikimedia.org/r/566250 [14:28:15] <_joe_> alea iacta est [14:28:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] service::catalog: fix host header for check_http* monitoring [puppet] - 10https://gerrit.wikimedia.org/r/566250 (owner: 10Giuseppe Lavagetto) [14:31:44] <_joe_> running puppet on icinga1001 right now [14:39:54] !log vgutierrez@puppetmaster1001 conftool action : set/weight=1; selector: service=ats-tls,name=cp4026.ulsfo.wmnet [14:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:53] 10Operations, 10Goal: 2020 Q3 DC switchover and switchback - https://phabricator.wikimedia.org/T243314 (10Joe) [14:43:57] 10Operations, 10serviceops, 10Patch-For-Review, 10User-jijiki: Create a mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10Joe) [14:54:59] (03PS1) 10Vgutierrez: conftool: Add ats-tls service to text and upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/566750 (https://phabricator.wikimedia.org/T238625) [14:56:17] (03PS17) 10Ottomata: New eventstreams chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551843 (https://phabricator.wikimedia.org/T238658) [14:59:05] (03PS2) 10Vgutierrez: conftool: Add ats-tls service to text and upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/566750 (https://phabricator.wikimedia.org/T238625) [14:59:10] (03PS1) 10Muehlenhoff: Switch the default installer to Buster [puppet] - 10https://gerrit.wikimedia.org/r/566751 [14:59:37] (03PS1) 10Ema: cache: merge misc-common VCL into misc-frontend [puppet] - 10https://gerrit.wikimedia.org/r/566752 (https://phabricator.wikimedia.org/T241239) [14:59:51] (03CR) 10Ottomata: [C: 03+2] New eventstreams chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551843 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [15:00:18] (03CR) 10Ema: [C: 03+1] conftool: Add ats-tls service to text and upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/566750 (https://phabricator.wikimedia.org/T238625) (owner: 10Vgutierrez) [15:01:20] (03CR) 10Vgutierrez: [C: 03+2] conftool: Add ats-tls service to text and upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/566750 (https://phabricator.wikimedia.org/T238625) (owner: 10Vgutierrez) [15:04:00] (03CR) 10Ema: "pcc here: https://puppet-compiler.wmflabs.org/compiler1002/20539/" [puppet] - 10https://gerrit.wikimedia.org/r/566752 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [15:06:44] !log vgutierrez@puppetmaster1001 conftool action : set/weight=1; selector: name=cp4026.ulsfo.wmnet,service=nginx [15:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:44] (03CR) 10CDanis: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/566751 (owner: 10Muehlenhoff) [15:08:54] (03CR) 10Vgutierrez: [C: 03+1] cache: merge misc-common VCL into misc-frontend [puppet] - 10https://gerrit.wikimedia.org/r/566752 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [15:09:47] (03CR) 10CDanis: [C: 03+1] "lgtm, a little curious about migration plan, but not very curious ;)" [puppet] - 10https://gerrit.wikimedia.org/r/566303 (owner: 10Filippo Giunchedi) [15:09:59] 10Operations, 10serviceops: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T240684 (10jijiki) Gutter pool has been initially tested in Beta and looks well. To make this test work, we deployed the a config to mcrouter (attached at the bottom) running o... [15:10:47] (03CR) 10CDanis: [C: 03+1] install_server: switch ms-fe to standard partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/566291 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [15:13:21] (03PS2) 10Muehlenhoff: Switch the default installer to Buster [puppet] - 10https://gerrit.wikimedia.org/r/566751 [15:14:58] (03PS1) 10Vgutierrez: lvs: Switch from nginx to ats-tls service name on text & upload [puppet] - 10https://gerrit.wikimedia.org/r/566756 (https://phabricator.wikimedia.org/T238625) [15:17:21] (03CR) 10Ema: [C: 03+1] lvs: Switch from nginx to ats-tls service name on text & upload [puppet] - 10https://gerrit.wikimedia.org/r/566756 (https://phabricator.wikimedia.org/T238625) (owner: 10Vgutierrez) [15:20:01] (03PS3) 10Muehlenhoff: Switch the default installer to Buster [puppet] - 10https://gerrit.wikimedia.org/r/566751 [15:20:38] (03CR) 10Effie Mouzeli: [C: 03+1] "> can't say much about weblog and wtp is probably not going to exist" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/566293 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [15:20:45] (03PS2) 10Vgutierrez: lvs,ATS: Switch from nginx to ats-tls service name on text & upload [puppet] - 10https://gerrit.wikimedia.org/r/566756 (https://phabricator.wikimedia.org/T238625) [15:21:00] PROBLEM - Check systemd state on debmonitor2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:22:13] (03CR) 10Ema: [C: 03+1] lvs,ATS: Switch from nginx to ats-tls service name on text & upload [puppet] - 10https://gerrit.wikimedia.org/r/566756 (https://phabricator.wikimedia.org/T238625) (owner: 10Vgutierrez) [15:22:25] !log mask uwsgi.service on debmonitor2001 T222874 [15:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:28] T222874: service::uwsgi should mask uwsgi.service - https://phabricator.wikimedia.org/T222874 [15:22:50] RECOVERY - Check systemd state on debmonitor2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:23:28] 10Operations, 10serviceops: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T240684 (10jijiki) [15:26:25] 10Operations, 10Core Platform Team, 10Goal: Decommission the "session redis" cluster - https://phabricator.wikimedia.org/T243520 (10Joe) [15:30:30] (03CR) 10Vgutierrez: [C: 03+2] lvs,ATS: Switch from nginx to ats-tls service name on text & upload [puppet] - 10https://gerrit.wikimedia.org/r/566756 (https://phabricator.wikimedia.org/T238625) (owner: 10Vgutierrez) [15:32:26] !log restarting secondary LVSs - T236120 T238625 [15:32:27] (03PS1) 10Alexandros Kosiaris: eventstreams: Add k8s tokens [labs/private] - 10https://gerrit.wikimedia.org/r/566769 (https://phabricator.wikimedia.org/T238658) [15:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:31] T238625: Remove nginx puppetization for cache text/text_ats - https://phabricator.wikimedia.org/T238625 [15:32:31] T236120: Get rid of nginx puppetization for cache upload - https://phabricator.wikimedia.org/T236120 [15:32:49] (03PS1) 10Alexandros Kosiaris: eventstreams: Create kubernetes token [puppet] - 10https://gerrit.wikimedia.org/r/566770 (https://phabricator.wikimedia.org/T238658) [15:32:51] (03PS1) 10Alexandros Kosiaris: eventstreams: Add kubernetes hosts to conftool [puppet] - 10https://gerrit.wikimedia.org/r/566771 (https://phabricator.wikimedia.org/T238658) [15:32:53] (03PS1) 10Alexandros Kosiaris: DNM: eventstreams: switch lvs to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/566772 (https://phabricator.wikimedia.org/T238658) [15:32:55] (03PS1) 10Alexandros Kosiaris: eventstreams: Remove all conftool data [puppet] - 10https://gerrit.wikimedia.org/r/566773 (https://phabricator.wikimedia.org/T238658) [15:34:09] (03PS2) 10Ema: cache: merge misc-common VCL into misc-frontend [puppet] - 10https://gerrit.wikimedia.org/r/566752 (https://phabricator.wikimedia.org/T241239) [15:34:11] (03PS1) 10Ema: cache: merge upload-common VCL into upload-frontend [puppet] - 10https://gerrit.wikimedia.org/r/566774 (https://phabricator.wikimedia.org/T241239) [15:34:13] (03CR) 10Muehlenhoff: [C: 03+2] Switch the default installer to Buster [puppet] - 10https://gerrit.wikimedia.org/r/566751 (owner: 10Muehlenhoff) [15:34:40] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] eventstreams: Add k8s tokens [labs/private] - 10https://gerrit.wikimedia.org/r/566769 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [15:36:02] (03CR) 10Ema: [C: 03+2] cache: merge misc-common VCL into misc-frontend [puppet] - 10https://gerrit.wikimedia.org/r/566752 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [15:36:54] akosiaris: OK to puppet-merge your change? [15:40:08] (03PS4) 10Cparle: Re-enable delayed new upload jobs for MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565615 (https://phabricator.wikimedia.org/T241072) [15:40:18] akosiaris: the dummy tokens in labs/private, I mean [15:40:25] (03PS1) 10Alexandros Kosiaris: admin: Move blubberoid into a more DRY format [deployment-charts] - 10https://gerrit.wikimedia.org/r/566775 [15:40:38] ema: yes please do [15:40:59] done, ty [15:41:03] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Move blubberoid into a more DRY format [deployment-charts] - 10https://gerrit.wikimedia.org/r/566775 (owner: 10Alexandros Kosiaris) [15:41:03] ty [15:41:24] (03Merged) 10jenkins-bot: admin: Move blubberoid into a more DRY format [deployment-charts] - 10https://gerrit.wikimedia.org/r/566775 (owner: 10Alexandros Kosiaris) [15:44:00] (03PS1) 10Muehlenhoff: Fix two errors in DHCP config spotted during default OS flip [puppet] - 10https://gerrit.wikimedia.org/r/566776 [15:44:42] (03CR) 10Cparle: [C: 03+2] Re-enable delayed new upload jobs for MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565615 (https://phabricator.wikimedia.org/T241072) (owner: 10Cparle) [15:45:54] !log restarting high-traffic1 && high-traffic2 primary LVSs - T236120 T238625 [15:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:58] T238625: Remove nginx puppetization for cache text/text_ats - https://phabricator.wikimedia.org/T238625 [15:45:59] T236120: Get rid of nginx puppetization for cache upload - https://phabricator.wikimedia.org/T236120 [15:46:00] (03CR) 10Muehlenhoff: [C: 03+2] Fix two errors in DHCP config spotted during default OS flip [puppet] - 10https://gerrit.wikimedia.org/r/566776 (owner: 10Muehlenhoff) [15:46:05] (03Merged) 10jenkins-bot: Re-enable delayed new upload jobs for MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565615 (https://phabricator.wikimedia.org/T241072) (owner: 10Cparle) [15:47:24] Going to deploy this quickly https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/566721 [15:50:32] (03PS2) 10Ema: cache: merge upload-common VCL into upload-frontend [puppet] - 10https://gerrit.wikimedia.org/r/566774 (https://phabricator.wikimedia.org/T241239) [15:52:51] (03CR) 10Ema: "pcc here: https://puppet-compiler.wmflabs.org/compiler1001/20541/" [puppet] - 10https://gerrit.wikimedia.org/r/566774 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [15:54:53] (03PS1) 10RLazarus: Shorten TTL for conftool SRV entries, preparatory for main cluster switchover. [dns] - 10https://gerrit.wikimedia.org/r/566778 [15:58:06] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Shorten TTL for conftool SRV entries, preparatory for main cluster switchover. [dns] - 10https://gerrit.wikimedia.org/r/566778 (owner: 10RLazarus) [16:00:11] !log Starting etcd main cluster switchover from codfw to eqiad [16:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:29] (03CR) 10RLazarus: [C: 03+2] Shorten TTL for conftool SRV entries, preparatory for main cluster switchover. [dns] - 10https://gerrit.wikimedia.org/r/566778 (owner: 10RLazarus) [16:01:58] rlazarus: err.. ok /o\ [16:02:22] cormacparle__: have you deployed these changes? ^ [16:02:34] I don't see any deployment happening [16:02:40] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.16/extensions/Wikibase/data-access/src/EntitySourceDefinitions.php: [[gerrit:566721|EntitySourceDefitions::getEntityTypeToSourceMapping fix for sub entities (T242415 T214557)]] (duration: 01m 08s) [16:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:47] T214557: Allow accessing Wikibase entities from multiple (Wikibase) databases - https://phabricator.wikimedia.org/T214557 [16:02:47] T242415: EntitySourceDefinitions to use 'sub-entity-type' entity type definition information - https://phabricator.wikimedia.org/T242415 [16:02:56] which changes? [16:03:10] vgutierrez: whoop sorry -- I haven't done anything but that TTL change, waiting here [16:03:39] rlazarus: give me a sec till I finish with the primary LVSs restarts plz :) [16:03:48] yep for sure, thanks for the ping [16:03:50] Amir1: which changes? [16:03:50] "(CR) Cparle: [C: +2] Re-enable delayed new upload jobs for MachineVision extension [mediawiki-config] - https://gerrit.wikimedia.org/r/565615 (https://phabricator.wikimedia.org/T241072) (owner: Cparle)" [16:03:55] ah shit [16:03:57] yes [16:04:22] I forgot it was an operations change, and would get deployed instantly [16:04:46] PROBLEM - PyBal connections to etcd on lvs2002 is CRITICAL: CRITICAL: 5 connections established with conf2001.codfw.wmnet:2379 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [16:04:46] do I need to revert? [16:04:56] hmmm [16:05:01] Amir1: ^ [16:05:07] it's not deployed, it's just going to sit there blocking other deployments [16:05:16] cormacparle__: yes please. Is there anything else [16:05:20] let me check [16:05:29] no, nothing else [16:05:45] Amir1: just make a revert, then +2 it myself? [16:06:11] yes, I will rebase deploy1001 [16:06:15] thanks! [16:08:58] RECOVERY - PyBal connections to etcd on lvs2002 is OK: OK: 8 connections established with conf2001.codfw.wmnet:2379 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [16:09:10] cool :) [16:09:27] (03PS1) 10Cparle: Revert "Re-enable delayed new upload jobs for MachineVision extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566780 (https://phabricator.wikimedia.org/T241072) [16:09:58] (03PS1) 10Alexandros Kosiaris: admin: DRY all environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/566781 [16:10:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10RobH) a:05RobH→03Volans [16:10:54] Amir1: revert is here https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/566780/ [16:10:57] rlazarus: only eqiad missing... [16:11:11] Thanks [16:11:21] (03CR) 10Ladsgroup: [C: 03+2] Revert "Re-enable delayed new upload jobs for MachineVision extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566780 (https://phabricator.wikimedia.org/T241072) (owner: 10Cparle) [16:11:35] <_joe_> Amir1: uh, today? [16:11:39] I rebase it on deploy1001 [16:11:46] <_joe_> oh just revert, ok [16:11:51] <_joe_> then I agree :P [16:12:14] _joe_: the main patch got merged by mistake [16:12:21] <_joe_> ack [16:12:27] <_joe_> it broke the jobqueue last time [16:12:43] (03Merged) 10jenkins-bot: Revert "Re-enable delayed new upload jobs for MachineVision extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566780 (https://phabricator.wikimedia.org/T241072) (owner: 10Cparle) [16:12:43] the machine vision thing? [16:13:11] Petr has (hopefully) fixed the reason it broke the jobqueue [16:13:25] rebased on deploy1001, all good now [16:13:39] <_joe_> cormacparle__: he just told me, we're in a meeting :D [16:14:00] but yeah - didn't mean to re-enable it just now [16:14:27] rlazarus: I'm done, thanks :D [16:14:47] thank you! sorry for the scare [16:17:23] dodging meetings so I'm going to wait and start the etcd main cluster switchover about 1700 UTC (45m from now), etcd will be read-only for ~30m, maybe less [16:18:26] (03PS1) 10Vgutierrez: conftool: Remove nginx service from text and upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/566785 (https://phabricator.wikimedia.org/T238625) [16:18:40] rlazarus: so may I proceed with the last step of my nginx cleaning? ^^ [16:18:46] 10Operations, 10serviceops: Move debugging symbols and tools to a new class - https://phabricator.wikimedia.org/T236048 (10Jdforrester-WMF) Boldly unlinking this from the parent as it can't block it if that's Resolved. [16:18:50] vgutierrez: go for it [16:18:50] 10Operations, 10ops-eqiad, 10serviceops: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [16:18:53] 10Operations, 10serviceops: Move debugging symbols and tools to a new class - https://phabricator.wikimedia.org/T236048 (10Jdforrester-WMF) [16:18:56] 10Operations, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.3; 2019-10-22), 10Patch-For-Review, 10Performance-Team (Radar): Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10Jdforrester-WMF) [16:21:58] (03CR) 10Ema: [C: 03+1] conftool: Remove nginx service from text and upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/566785 (https://phabricator.wikimedia.org/T238625) (owner: 10Vgutierrez) [16:22:12] (03CR) 10Vgutierrez: [C: 03+2] conftool: Remove nginx service from text and upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/566785 (https://phabricator.wikimedia.org/T238625) (owner: 10Vgutierrez) [16:22:24] <_joe_> rlazarus: or we could skip that meeting and do the migration [16:22:31] (03PS1) 10Ema: cache: merge text-common VCL into text-frontend [puppet] - 10https://gerrit.wikimedia.org/r/566787 (https://phabricator.wikimedia.org/T241239) [16:22:34] <_joe_> I don't want to be around after 7 pm :) [16:24:33] sure, also valid -- we could start whenever it's okay for vgutierrez in that case [16:24:34] (03PS1) 10Jhedden: wmcs-cold-migrate: update glance for v2 client [puppet] - 10https://gerrit.wikimedia.org/r/566788 [16:24:44] rlazarus: go for it [16:24:47] or if it gets too late in the day, no worries, I'll back out the TTL change and we can try tomorrow [16:25:21] 10Operations, 10Traffic: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [16:25:23] 10Operations, 10Traffic, 10Patch-For-Review: Get rid of nginx puppetization for cache upload - https://phabricator.wikimedia.org/T236120 (10Vgutierrez) 05Open→03Resolved [16:25:29] (03PS1) 10CDanis: add API key for scraping of LibreNMS's API by Icinga [labs/private] - 10https://gerrit.wikimedia.org/r/566789 (https://phabricator.wikimedia.org/T224888) [16:25:38] 10Operations, 10Traffic, 10Patch-For-Review: Remove nginx puppetization for cache text/text_ats - https://phabricator.wikimedia.org/T238625 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [16:25:46] (03CR) 10Ema: "pcc: https://puppet-compiler.wmflabs.org/compiler1001/20544/" [puppet] - 10https://gerrit.wikimedia.org/r/566787 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [16:25:47] 10Operations, 10fundraising-tech-ops: (ASAP) rack/setup/install frban1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T234068 (10Jgreen) [16:26:18] !log pooling cp4026 running buster - T242093 [16:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:21] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [16:27:10] !log depool cp4032 and reimage as buster - T242093 [16:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:49] (03PS4) 10Ayounsi: Initial flowspec support [homer/public] - 10https://gerrit.wikimedia.org/r/562505 (https://phabricator.wikimedia.org/T243482) [16:28:38] (03CR) 10Vgutierrez: [C: 03+1] cache: merge upload-common VCL into upload-frontend [puppet] - 10https://gerrit.wikimedia.org/r/566774 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [16:29:07] (03CR) 10Vgutierrez: [C: 03+1] cache: merge text-common VCL into text-frontend [puppet] - 10https://gerrit.wikimedia.org/r/566787 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [16:30:05] 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review, 10User-RhinosF1: Allow Cloud mailing list to be indexed - https://phabricator.wikimedia.org/T242520 (10RhinosF1) a:05RhinosF1→03None Nothing for me to do here [16:30:55] (03PS1) 10Vgutierrez: install_server: Reimage cp4032 as buster [puppet] - 10https://gerrit.wikimedia.org/r/566792 (https://phabricator.wikimedia.org/T242093) [16:32:14] (03CR) 10Ema: [C: 03+1] install_server: Reimage cp4032 as buster [puppet] - 10https://gerrit.wikimedia.org/r/566792 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [16:32:25] (03CR) 10Vgutierrez: [C: 03+2] install_server: Reimage cp4032 as buster [puppet] - 10https://gerrit.wikimedia.org/r/566792 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [16:35:05] (03PS1) 10Alexandros Kosiaris: eventstreams: Add the namespace and calico rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/566793 (https://phabricator.wikimedia.org/T238658) [16:35:38] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: DRY all environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/566781 (owner: 10Alexandros Kosiaris) [16:36:00] (03Merged) 10jenkins-bot: admin: DRY all environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/566781 (owner: 10Alexandros Kosiaris) [16:36:12] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Double checked, there are a few minor diffs to limitranges but those are ok. I 'll double diff on the clusters as well" [deployment-charts] - 10https://gerrit.wikimedia.org/r/566781 (owner: 10Alexandros Kosiaris) [16:36:16] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp4032.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [16:36:31] (03PS1) 10Ladsgroup: Enable EntitySourceBasedFederation for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566795 (https://phabricator.wikimedia.org/T243395) [16:41:12] (03PS1) 10Ema: cache: remove/update cache_text backend VTC [puppet] - 10https://gerrit.wikimedia.org/r/566797 (https://phabricator.wikimedia.org/T241239) [16:41:23] (03PS2) 10Alexandros Kosiaris: eventstreams: Add the namespace and calico rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/566793 (https://phabricator.wikimedia.org/T238658) [16:41:25] (03PS1) 10Alexandros Kosiaris: admin: Remove graphoid remnants [deployment-charts] - 10https://gerrit.wikimedia.org/r/566798 [16:42:00] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Remove graphoid remnants [deployment-charts] - 10https://gerrit.wikimedia.org/r/566798 (owner: 10Alexandros Kosiaris) [16:42:19] (03Merged) 10jenkins-bot: admin: Remove graphoid remnants [deployment-charts] - 10https://gerrit.wikimedia.org/r/566798 (owner: 10Alexandros Kosiaris) [16:45:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs-cold-migrate: update glance for v2 client [puppet] - 10https://gerrit.wikimedia.org/r/566788 (owner: 10Jhedden) [16:50:23] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventstreams: Add the namespace and calico rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/566793 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [16:51:15] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [16:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:16] (03PS1) 10Alexandros Kosiaris: evenstreams: Add forgotten admin/ values files [deployment-charts] - 10https://gerrit.wikimedia.org/r/566803 (https://phabricator.wikimedia.org/T238658) [16:56:05] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [16:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:00] (03CR) 10Alexandros Kosiaris: [C: 03+2] evenstreams: Add forgotten admin/ values files [deployment-charts] - 10https://gerrit.wikimedia.org/r/566803 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [16:57:17] (03Merged) 10jenkins-bot: evenstreams: Add forgotten admin/ values files [deployment-charts] - 10https://gerrit.wikimedia.org/r/566803 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [16:58:18] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:17] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [16:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] godog and _joe_: Time to snap out of that daydream and deploy Puppet SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200123T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:46] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [17:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:43] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4032.ulsfo.wmnet'] ` Of which those **FAILED**: ` ['cp4032.ulsfo.wmnet'] ` [17:02:49] uh... [17:03:42] somehow the reimage script was unable to check that the boot device got back to normal [17:05:38] RECOVERY - Disk space on notebook1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [17:10:26] (03CR) 10Alexandros Kosiaris: "https://puppet-compiler.wmflabs.org/compiler1003/20547/ for PCC. looks ok" [puppet] - 10https://gerrit.wikimedia.org/r/566770 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [17:10:28] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventstreams: Create kubernetes token [puppet] - 10https://gerrit.wikimedia.org/r/566770 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [17:13:46] vgutierrez: getting back to it, still clear on your end? [17:13:53] yep [17:14:02] thanks! [17:15:12] (03PS1) 10RLazarus: etcd main cluster switchover: Set codfw to read-only. [puppet] - 10https://gerrit.wikimedia.org/r/566807 [17:17:08] (03CR) 10Giuseppe Lavagetto: [C: 03+1] etcd main cluster switchover: Set codfw to read-only. [puppet] - 10https://gerrit.wikimedia.org/r/566807 (owner: 10RLazarus) [17:20:04] (03CR) 10RLazarus: [C: 03+2] etcd main cluster switchover: Set codfw to read-only. [puppet] - 10https://gerrit.wikimedia.org/r/566807 (owner: 10RLazarus) [17:26:05] (03PS1) 10RLazarus: etcd main cluster switchover: Disable replication in eqiad (conf1005). [puppet] - 10https://gerrit.wikimedia.org/r/566810 [17:26:38] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10RobH) a:05Volans→03None [17:27:26] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10faidon) a:03RobH (@Volans is not in Traffic), but regardles... judging from @BBlack comments before the flurry of Gerrit commits, it seems like I... [17:28:40] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) on buster, systemd is not quite happy with trafficserver using /var/run: `Jan 23 17:15:40 cp4032 systemd[1]: /lib/systemd/system/trafficserver.service:8: PIDFile= references path below leg... [17:28:53] (03CR) 10Giuseppe Lavagetto: [C: 03+1] etcd main cluster switchover: Disable replication in eqiad (conf1005). [puppet] - 10https://gerrit.wikimedia.org/r/566810 (owner: 10RLazarus) [17:29:02] (03PS2) 10Jforrester: [trwiki] Tweak unblocking logo versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565401 (https://phabricator.wikimedia.org/T242977) [17:29:08] jouncebot: next [17:29:08] In 0 hour(s) and 30 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200123T1800) [17:29:24] 10Operations, 10ops-codfw: codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Papaul) [17:29:29] (03CR) 10Jforrester: [C: 03+2] [trwiki] Tweak unblocking logo versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565401 (https://phabricator.wikimedia.org/T242977) (owner: 10Jforrester) [17:30:30] (03Merged) 10jenkins-bot: [trwiki] Tweak unblocking logo versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565401 (https://phabricator.wikimedia.org/T242977) (owner: 10Jforrester) [17:33:01] !log jforrester@deploy1001 Synchronized static/images/project-logos: [trwiki] Tweak logo versions T242977 (duration: 01m 07s) [17:33:02] !log Poweroff db2085:3311 and db2085:3318 for maintenance - T243148 [17:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:04] T242977: Change Turkish Wikipedia logo to reflect occasion of unblocking - https://phabricator.wikimedia.org/T242977 [17:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:08] T243148: db2085 crashed - memory issues - https://phabricator.wikimedia.org/T243148 [17:33:24] 10Operations, 10Scap, 10serviceops: Make canary wait time configurable - https://phabricator.wikimedia.org/T217924 (10jijiki) If it is a lot of work to limit when `--canary-wait-time` is available, we could do a graceful rollout, by asking deployers, via `utils.ask/utils.confirm`), to try this flag on, say,... [17:34:01] (03CR) 10RLazarus: [C: 03+2] etcd main cluster switchover: Disable replication in eqiad (conf1005). [puppet] - 10https://gerrit.wikimedia.org/r/566810 (owner: 10RLazarus) [17:35:02] <_joe_> jouncebot: next [17:35:02] In 0 hour(s) and 24 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200123T1800) [17:35:22] (03PS4) 10Jforrester: Add 3d-patents page to wgForceUIMsgAsContentMsg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416730 (owner: 10Matthias Mullie) [17:36:34] (03CR) 10Jforrester: [C: 03+1] Add 3d-patents page to wgForceUIMsgAsContentMsg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416730 (owner: 10Matthias Mullie) [17:37:49] (03PS1) 10Ayounsi: Juniper to Netbox import script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/566812 [17:38:16] (03CR) 10jerkins-bot: [V: 04-1] Juniper to Netbox import script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/566812 (owner: 10Ayounsi) [17:39:04] (03Abandoned) 10Ayounsi: Juniper to Netbox import script [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/507217 (owner: 10Ayounsi) [17:41:38] (03CR) 10Ayounsi: "You can find an example run in af-netbox, with mr1-ulsfo, cr3-ulsfo and asw2-ulsfo." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/566812 (owner: 10Ayounsi) [17:41:51] PROBLEM - etcdmirror-conftool-codfw-wmnet service on conf1005 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-codfw-wmnet is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:42:42] _joe_: doing something on etcd? ^^^ [17:42:45] (03PS1) 10RLazarus: etcd main cluster switchover: Move read-write DNS entries to eqiad. [dns] - 10https://gerrit.wikimedia.org/r/566813 [17:42:54] <_joe_> volans: yes [17:42:56] <_joe_> it's downtimed [17:43:06] didn't looked like :-P [17:43:45] (03PS1) 10Vgutierrez: Replace /var/run with /run [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/566814 (https://phabricator.wikimedia.org/T242093) [17:43:50] (03CR) 10Giuseppe Lavagetto: [C: 03+1] etcd main cluster switchover: Move read-write DNS entries to eqiad. [dns] - 10https://gerrit.wikimedia.org/r/566813 (owner: 10RLazarus) [17:43:56] (03CR) 10jerkins-bot: [V: 04-1] Replace /var/run with /run [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/566814 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [17:44:17] <_joe_> volans: well the paging alert is [17:44:26] (03CR) 10RLazarus: [C: 03+2] etcd main cluster switchover: Move read-write DNS entries to eqiad. [dns] - 10https://gerrit.wikimedia.org/r/566813 (owner: 10RLazarus) [17:45:15] (03PS1) 10Alexandros Kosiaris: admin: DRY environments by using a common one [deployment-charts] - 10https://gerrit.wikimedia.org/r/566816 [17:48:20] (03PS1) 10Jhedden: Revert "openstack: depool cloudvirt1007 for ceph testing" [puppet] - 10https://gerrit.wikimedia.org/r/566817 [17:49:19] PROBLEM - Check systemd state on conf1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:49:42] ^^ rlazarus [17:50:00] thanks, I believe expected but _joe_ to confirm [17:52:18] <_joe_> it's expected [17:52:24] <_joe_> lemme clear all those for you people [17:52:58] <_joe_> !log running systemctl reset-failed on conf1005 to clear useless alerts [17:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:03] ack [17:53:15] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [17:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:49] hrmm, i really wish that output into chat what host i was running it on [17:53:58] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [17:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:02] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `cp1008.wikimedia.org` - cp1008.wikimedia.org (**PASS**... [17:54:23] robh: +1 and have the -P option for logging to task, right? [17:54:40] oh, -t logs to task [17:54:45] RECOVERY - Check systemd state on conf1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:54:46] yea, that one [17:55:03] honestly it should output the entire line if it makes it easier, include host and task in the SAL [17:55:28] I mean, we like a terse admin log but that is too terse to be useful. [17:56:11] PROBLEM - Host db2085.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:56:43] mutante: heh, see dcops chat they are already on it ;D [17:56:46] (updating decom script) [17:56:52] the db2085.mgmt is not me. [17:56:54] ^expected [17:57:07] (03PS1) 10RLazarus: etcd main cluster switchover: Enable replication in codfw (conf2002). [puppet] - 10https://gerrit.wikimedia.org/r/566819 [17:57:15] robh: nice :) [17:57:17] marostegui: ack [17:57:20] thx =] [17:57:52] (03CR) 10Giuseppe Lavagetto: [C: 03+2] etcd main cluster switchover: Enable replication in codfw (conf2002). [puppet] - 10https://gerrit.wikimedia.org/r/566819 (owner: 10RLazarus) [17:59:12] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [17:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:42] <_joe_> jouncebot: next [17:59:42] In 0 hour(s) and 0 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200123T1800) [17:59:46] <_joe_> ahem [17:59:48] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [17:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `cp1073.eqiad.wmnet` - cp1073.eqiad.wmnet (**PASS**)... [17:59:53] <_joe_> can I ask to stall deployments? [18:00:04] cscott, arlolra, subbu, halfak, and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / Parsoid / Citoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200123T1800). [18:01:17] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [18:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:53] (03PS2) 10Jforrester: Remove partial blocks banner from all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566302 (https://phabricator.wikimedia.org/T240300) (owner: 10Tchanders) [18:02:00] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [18:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `cp1074.eqiad.wmnet` - cp1074.eqiad.wmnet (**PASS**)... [18:02:07] RECOVERY - Host db2085.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.32 ms [18:03:09] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [18:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:47] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [18:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `cp1099.eqiad.wmnet` - cp1099.eqiad.wmnet (**PASS**)... [18:03:53] (03CR) 10Volans: [C: 03+2] "> Patch Set 2: Code-Review+1" [software/spicerack] - 10https://gerrit.wikimedia.org/r/566054 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [18:04:04] (03CR) 10Volans: "Optional post-merge nit inline" (031 comment) [software] - 10https://gerrit.wikimedia.org/r/566528 (https://phabricator.wikimedia.org/T243052) (owner: 10Marostegui) [18:04:14] (03PS1) 10Giuseppe Lavagetto: etcd: fix port for eqiad SRV records [dns] - 10https://gerrit.wikimedia.org/r/566821 [18:05:12] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [18:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:39] FYI the conftool sync stage of puppet-merge is currently broken due to a snag in the etcd main cluster move, _joe_ and I are working on it [18:05:55] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [18:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:58] <_joe_> I will also ask all deployers to stall deployments [18:05:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `cp1071.eqiad.wmnet` - cp1071.eqiad.wmnet (**PASS**)... [18:06:02] <_joe_> untile we are ok [18:06:08] (03PS1) 10Joal: Update labstore mediawiki-history readme file [puppet] - 10https://gerrit.wikimedia.org/r/566822 (https://phabricator.wikimedia.org/T243426) [18:06:15] (03CR) 10RLazarus: [C: 03+1] etcd: fix port for eqiad SRV records [dns] - 10https://gerrit.wikimedia.org/r/566821 (owner: 10Giuseppe Lavagetto) [18:06:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] etcd: fix port for eqiad SRV records [dns] - 10https://gerrit.wikimedia.org/r/566821 (owner: 10Giuseppe Lavagetto) [18:06:25] (03CR) 10Jhedden: [C: 03+2] Revert "openstack: depool cloudvirt1007 for ceph testing" [puppet] - 10https://gerrit.wikimedia.org/r/566817 (owner: 10Jhedden) [18:07:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10RobH) [18:07:26] <_joe_> jeh: can you wait 15 seconds before running puppet-merge? [18:07:42] _joe_: sorry, I just ran that :( [18:07:47] 10Operations, 10SRE-tools, 10Patch-For-Review: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10Volans) As I've seen some efforts to use black also in other part of the org, ideally it would be nice to have a single way to set it up: * black configuration (line length, qu... [18:07:49] <_joe_> it did work well? [18:08:07] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10RobH) a:05RobH→03Jclark-ctr @Jclark-ctr: These hosts are ready for the on-site wipe steps. I've also left the puppet and dns updates, so duri... [18:08:12] (03PS1) 10Dzahn: icinga: let Hugh Nowlan run commands on all hosts and services [puppet] - 10https://gerrit.wikimedia.org/r/566823 (https://phabricator.wikimedia.org/T242309) [18:08:16] (03CR) 10Effie Mouzeli: "Just added some suggestions:)" (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) (owner: 10Hnowlan) [18:08:26] _joe_: yeah, I believe so [18:08:31] <_joe_> ok great [18:08:34] <_joe_> you just tested my fix [18:08:40] <_joe_> :P [18:09:01] \o/ [18:09:10] (03Merged) 10jenkins-bot: ganeti: add initial support for gnt-instance [software/spicerack] - 10https://gerrit.wikimedia.org/r/566054 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [18:09:54] (03PS3) 10Tchanders: Remove unused config for partial blocks banner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566302 (https://phabricator.wikimedia.org/T240300) [18:10:42] is it ok to write again on etcd? _joe_ rlazarus? [18:10:47] not yet [18:10:51] ack [18:10:57] <_joe_> vgutierrez: in a few minutes [18:12:24] (03PS1) 10RLazarus: etcd main cluster switchover: Set eqiad to read-write. [puppet] - 10https://gerrit.wikimedia.org/r/566826 [18:13:31] <_joe_> rlazarus: running the compiler just in case [18:13:37] <_joe_> but +1 I think [18:13:38] 👍 [18:13:54] <_joe_> I just don't trust my previous work, not your patch per-se [18:13:59] haha understood [18:14:46] <_joe_> https://puppet-compiler.wmflabs.org/compiler1002/20548/conf1005.eqiad.wmnet/ seems correct [18:15:07] agreed, looks like the opposite of what I saw in codfw [18:15:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] etcd main cluster switchover: Set eqiad to read-write. [puppet] - 10https://gerrit.wikimedia.org/r/566826 (owner: 10RLazarus) [18:15:12] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [18:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:17] <_joe_> rlazarus: merge it! [18:15:24] o7 [18:15:31] <_joe_> ottomata: I asked to abstain ffrom releases [18:15:44] <_joe_> for another couple minutes [18:16:02] <_joe_> the time that rlazarus and I run puppet on the eqiad etcd cluster [18:16:17] <_joe_> (staging is ok, just don't move to the prod clusters) [18:19:44] (03PS1) 10Ottomata: Fix chart name in eventstreams staging helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/566827 (https://phabricator.wikimedia.org/T238658) [18:20:07] (03PS2) 10Ottomata: Fix chart name in eventstreams staging helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/566827 (https://phabricator.wikimedia.org/T238658) [18:20:33] (03CR) 10Ottomata: [C: 03+2] Fix chart name in eventstreams staging helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/566827 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [18:22:09] <_joe_> vgutierrez: you are now free [18:22:13] awesome [18:22:33] !log pooling cp4032 running buster - T242093 [18:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:36] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [18:23:42] (03PS1) 10Ottomata: Bump eventstreams chart version and repackage [deployment-charts] - 10https://gerrit.wikimedia.org/r/566828 (https://phabricator.wikimedia.org/T238658) [18:23:54] oh _joe_ sorry! [18:24:04] ok yeah i'm justu doing staging [18:24:06] normal puppet-merge ok, right? [18:24:08] to do some benchmarking there [18:24:23] mutante: puppet-merge should be back to normal yep, ping me/_joe_ if any lingering problems [18:24:26] (03CR) 10Dzahn: [C: 03+2] icinga: let Hugh Nowlan run commands on all hosts and services [puppet] - 10https://gerrit.wikimedia.org/r/566823 (https://phabricator.wikimedia.org/T242309) (owner: 10Dzahn) [18:24:43] vgutierrez: etcd is read-write again, you're all clear, thanks again [18:24:54] (03CR) 10Ottomata: [C: 03+2] Bump eventstreams chart version and repackage [deployment-charts] - 10https://gerrit.wikimedia.org/r/566828 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [18:24:58] oops missed that in scrollback, good :) [18:25:03] :) [18:25:26] rlazarus: no issues on puppet-merge, thanks [18:25:48] it ended with conftool removing stale objects as normal [18:26:55] jouncebot: next [18:26:55] In 0 hour(s) and 33 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200123T1900) [18:27:26] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [18:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:21] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [18:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:34] (03PS1) 10RLazarus: Restore 5M TTL for conftool SRV entries, following main cluster switchover. [dns] - 10https://gerrit.wikimedia.org/r/566830 [18:34:00] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Restore 5M TTL for conftool SRV entries, following main cluster switchover. [dns] - 10https://gerrit.wikimedia.org/r/566830 (owner: 10RLazarus) [18:34:14] (03CR) 10RLazarus: [C: 03+2] Restore 5M TTL for conftool SRV entries, following main cluster switchover. [dns] - 10https://gerrit.wikimedia.org/r/566830 (owner: 10RLazarus) [18:34:58] (03PS1) 10Ottomata: eventstreams - define tls.telemetry in values.yaml files [deployment-charts] - 10https://gerrit.wikimedia.org/r/566831 (https://phabricator.wikimedia.org/T238658) [18:35:24] !log etcd main cluster switchover complete, eqiad is now read-write [18:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:52] (03CR) 10Ottomata: [C: 03+2] eventstreams - define tls.telemetry in values.yaml files [deployment-charts] - 10https://gerrit.wikimedia.org/r/566831 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [18:36:08] _joe_: Can I do a regular scap MW config deploy now? [18:36:24] <_joe_> James_F: yes, we're all clear [18:36:30] Excellent. Thank you! [18:36:30] (03CR) 10Dmaza: [C: 03+1] Remove unused config for partial blocks banner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566302 (https://phabricator.wikimedia.org/T240300) (owner: 10Tchanders) [18:36:57] (03CR) 10Jforrester: [C: 03+2] Remove unused config for partial blocks banner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566302 (https://phabricator.wikimedia.org/T240300) (owner: 10Tchanders) [18:37:28] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1022 memory errors causing host to crash - https://phabricator.wikimedia.org/T243536 (10JHedden) [18:37:53] (03Merged) 10jenkins-bot: Remove unused config for partial blocks banner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566302 (https://phabricator.wikimedia.org/T240300) (owner: 10Tchanders) [18:40:04] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Stop setting wgWikimediaMessagesPartialBlockBanner, never read T240300 (duration: 01m 06s) [18:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:08] T240300: Introduce a temporary banner on Special:Block to inform users about upcoming partial blocks deploy - https://phabricator.wikimedia.org/T240300 [18:42:08] (03PS4) 10Jforrester: Document why we have duplicate false value for wmgEnableGeoData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565751 (https://phabricator.wikimedia.org/T183549) (owner: 10Ammarpad) [18:42:13] (03CR) 10Jforrester: [C: 03+2] Document why we have duplicate false value for wmgEnableGeoData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565751 (https://phabricator.wikimedia.org/T183549) (owner: 10Ammarpad) [18:43:10] (03Merged) 10jenkins-bot: Document why we have duplicate false value for wmgEnableGeoData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565751 (https://phabricator.wikimedia.org/T183549) (owner: 10Ammarpad) [18:44:32] (03PS1) 10Ottomata: eventstreams - Bump image version to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/566835 (https://phabricator.wikimedia.org/T238658) [18:45:57] (03CR) 10Ottomata: [C: 03+2] eventstreams - Bump image version to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/566835 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [18:46:05] (03PS2) 10Jforrester: Disable MobileFrontend Mainpage special casing on frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562025 (https://phabricator.wikimedia.org/T241888) (owner: 10Ammarpad) [18:47:24] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [18:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:38] 10Operations, 10ops-eqiad, 10serviceops: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10Cmjohnson) @elukey @jijiki I am going as fast as I can ...there are several racking tasks that need to be completed. John updated switch ports this morning. [18:47:52] (03CR) 10Jforrester: "Hey, this looks good to go. Were you going to schedule it for a SWAT?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562025 (https://phabricator.wikimedia.org/T241888) (owner: 10Ammarpad) [18:47:59] (03CR) 10Jforrester: [C: 03+1] "Hey, this looks good to go. Were you going to schedule it for a SWAT?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416730 (owner: 10Matthias Mullie) [18:48:55] (03PS12) 10Jforrester: [lawiki] Add minerva custom logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728) (owner: 10Ammarpad) [18:50:05] 10Operations, 10ops-eqiad, 10serviceops: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10Cmjohnson) not hitting the installer...still working on them [18:50:53] (03CR) 10Jforrester: "Hey there. Thank you for these logos, but they are too big." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565724 (https://phabricator.wikimedia.org/T242416) (owner: 10Majavah) [18:50:59] (03CR) 10Jforrester: [C: 04-1] Add logos for ngwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565724 (https://phabricator.wikimedia.org/T242416) (owner: 10Majavah) [18:51:06] 10Operations, 10ops-eqiad, 10DBA: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Cmjohnson) @marostegui I do not know yet, there are several racking/setup tasks that I am trying to get through. I need to check with @Jclark-ctr and se... [18:51:09] (03PS1) 10Ottomata: eventstreams - fix missing '-' in '---', and '.' in template reference [deployment-charts] - 10https://gerrit.wikimedia.org/r/566837 (https://phabricator.wikimedia.org/T238658) [18:51:35] 10Operations, 10ops-eqsin, 10Traffic: rack/setup/install ps[12]-60[34]-eqsin - https://phabricator.wikimedia.org/T242250 (10RobH) I've not seen @bblack in IRC since posting the above comment, I suspect due to pre-all-hands-rush. We have SRE meeting time set aside during all hands, so I'll sync up with @bbla... [18:51:45] (03PS3) 10Jforrester: Add vzg-easydb.gbv.de to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565723 (https://phabricator.wikimedia.org/T243118) (owner: 10Zoranzoki21) [18:51:51] (03CR) 10Jforrester: "Hey, this looks good to go. Were you going to schedule it for a SWAT?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565723 (https://phabricator.wikimedia.org/T243118) (owner: 10Zoranzoki21) [18:52:08] (03PS3) 10Jforrester: Add wordmark for etwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565549 (https://phabricator.wikimedia.org/T230379) (owner: 10Pikne) [18:52:33] (03CR) 10Ottomata: [C: 03+2] eventstreams - fix missing '-' in '---', and '.' in template reference [deployment-charts] - 10https://gerrit.wikimedia.org/r/566837 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [18:54:08] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [18:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:20] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [18:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:37] robh: ^ and this too :) [18:58:45] that's what i was thinking about earlier, actually [18:58:58] yeah, need the hostname echo in there [18:59:38] !log ganeti1003 - creating new VM etherpad1002.eqiad.wmnet with 1GB RAM and 10GB disk, row C, private link (T243475) [18:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:41] T243475: vm request for etherpad1002 - https://phabricator.wikimedia.org/T243475 [19:00:04] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Morning SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200123T1900). [19:00:04] matthiasmullie, cparle, and stephanebisson: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:18] I can SWAT today! [19:00:36] great [19:01:04] Urbanecm: I'll do mine myself, but stephanebisson's patches can go first - I'm waiting on something ATM [19:01:15] okay [19:01:26] buenos dias (el) gato de casa [19:02:03] mutante: ? [19:02:14] 10Operations, 10Wikimedia-Etherpad, 10serviceops: vm request for etherpad1002 - https://phabricator.wikimedia.org/T243475 (10Dzahn) a:03Dzahn [19:02:48] (03PS2) 10Urbanecm: Add *.eso.org to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566560 (https://phabricator.wikimedia.org/T243423) [19:02:54] (03PS3) 10Urbanecm: Add *.eso.org to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566560 (https://phabricator.wikimedia.org/T243423) [19:02:59] (03CR) 10Urbanecm: [C: 03+2] Add *.eso.org to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566560 (https://phabricator.wikimedia.org/T243423) (owner: 10Urbanecm) [19:03:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Hardware): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Jclark-ctr) firmware updated and bios. @jhedden can your team test to see if it will fail still [19:03:08] Urbanecm: just greeting Hauskatze, i'll stop distracting now [19:03:11] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [19:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:23] gotit, thanks mutante [19:04:08] (03Merged) 10jenkins-bot: Add *.eso.org to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566560 (https://phabricator.wikimedia.org/T243423) (owner: 10Urbanecm) [19:04:34] (03PS2) 10Urbanecm: Use editeditorprotected for protecting pages for editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566722 (https://phabricator.wikimedia.org/T230103) [19:05:02] mutante: Gracias, ser genéticamente modificado :P [19:06:01] (03PS3) 10Urbanecm: Use editeditorprotected for protecting pages for editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566722 (https://phabricator.wikimedia.org/T230103) [19:06:11] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566722 (https://phabricator.wikimedia.org/T230103) (owner: 10Urbanecm) [19:06:11] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 629b5fc: Add *.eso.org to the wgCopyUploadsDomains (T243423) (duration: 01m 06s) [19:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:15] T243423: Add www.eso.org to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T243423 [19:07:04] (03Merged) 10jenkins-bot: Use editeditorprotected for protecting pages for editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566722 (https://phabricator.wikimedia.org/T230103) (owner: 10Urbanecm) [19:07:30] stephanebisson: just in case: is it intended to do just .16? [19:07:47] Urbanecm: yes [19:07:51] (03PS2) 10Dzahn: install_server: add etherpad1002 to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/566630 [19:07:56] ok [19:08:23] stephanebisson: just landed at mwdebug1001 [19:08:29] Urbanecm: It's been tested in betalabs but it is not testable in prod because the feature is not enabled yet. This is just to be ready to enable the feature before the next train. [19:08:34] You can sync [19:08:38] Okay, syncing then [19:09:40] PROBLEM - Check systemd state on cp4032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:10:15] (03PS2) 10Joal: Update labstore mediawiki-history readme file [puppet] - 10https://gerrit.wikimedia.org/r/566822 (https://phabricator.wikimedia.org/T243426) [19:10:31] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.16/extensions/WikimediaMessages/extension.json: SWAT: 23a6f8e: InukaPageView: update schema version (T238029) (duration: 01m 05s) [19:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:35] T238029: Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 [19:10:36] stephanebisson: here you are [19:10:48] Urbanecm: thanks! [19:10:59] happy to help! [19:13:51] Urbanecm: ok if I go, or anything still going on? [19:14:07] matthiasmullie: I'm still working, i'll ping you once done [19:14:20] perfect! [19:14:30] (03CR) 10Dzahn: [C: 03+2] install_server: add etherpad1002 to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/566630 (owner: 10Dzahn) [19:14:41] (03PS3) 10Dzahn: install_server: add etherpad1002 to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/566630 [19:14:42] RECOVERY - Check systemd state on cp4032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:40] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 2d8f773: Use editeditorprotected for protecting pages for editors (T230103) (duration: 01m 05s) [19:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:43] T230103: Systematize wgRestrictionLevels - https://phabricator.wikimedia.org/T230103 [19:15:48] matthiasmullie: the floor is yours [19:16:05] thanks [19:17:02] (03PS5) 10Matthias Mullie: Remove handler deleted from the MachineVision extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565987 (https://phabricator.wikimedia.org/T241242) (owner: 10Cparle) [19:17:08] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1022 memory errors causing host to crash - https://phabricator.wikimedia.org/T243536 (10Jclark-ctr) Confirmed: Service Request 1011922914 was successfully submitted. [19:17:53] (03CR) 10Matthias Mullie: [C: 03+2] Remove handler deleted from the MachineVision extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565987 (https://phabricator.wikimedia.org/T241242) (owner: 10Cparle) [19:18:50] (03Merged) 10jenkins-bot: Remove handler deleted from the MachineVision extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565987 (https://phabricator.wikimedia.org/T241242) (owner: 10Cparle) [19:19:37] (03PS1) 10Matthias Mullie: Revert "Remove handler deleted from the MachineVision extension on beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566841 [19:20:32] 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Cmjohnson) [19:26:04] (03PS1) 10Ottomata: Add EventStreamConfig extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566842 (https://phabricator.wikimedia.org/T242122) [19:26:24] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 36.96 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:28:14] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:33:17] (03PS4) 10Matthias Mullie: Remove handler deleted from the MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565614 (https://phabricator.wikimedia.org/T241242) (owner: 10Cparle) [19:33:30] (03CR) 10Matthias Mullie: [C: 03+2] Remove handler deleted from the MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565614 (https://phabricator.wikimedia.org/T241242) (owner: 10Cparle) [19:34:07] (03PS1) 10Ottomata: InitialiseSettings.php - wmgUseEventStreamConfig = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566846 (https://phabricator.wikimedia.org/T242122) [19:34:08] (03PS1) 10Ottomata: Enable EventStreamConfig in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566847 (https://phabricator.wikimedia.org/T242122) [19:34:27] (03Merged) 10jenkins-bot: Remove handler deleted from the MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565614 (https://phabricator.wikimedia.org/T241242) (owner: 10Cparle) [19:35:50] (03PS2) 10Matthias Mullie: Revert "Remove handler deleted from the MachineVision extension on beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566841 [19:36:25] (03PS1) 10Ottomata: Load EventStreamConfig if wmgUseEventStreamConfig is true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566850 (https://phabricator.wikimedia.org/T242122) [19:37:43] (03CR) 10jerkins-bot: [V: 04-1] Load EventStreamConfig if wmgUseEventStreamConfig is true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566850 (https://phabricator.wikimedia.org/T242122) (owner: 10Ottomata) [19:40:24] (03CR) 10Jforrester: [C: 04-2] "(Waiting for security review.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566842 (https://phabricator.wikimedia.org/T242122) (owner: 10Ottomata) [19:45:00] (03PS1) 10Matthias Mullie: Revert "Remove handler deleted from the MachineVision extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566854 [19:50:06] (03CR) 10Matthias Mullie: [C: 03+2] Revert "Remove handler deleted from the MachineVision extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566854 (owner: 10Matthias Mullie) [19:51:07] (03Merged) 10jenkins-bot: Revert "Remove handler deleted from the MachineVision extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566854 (owner: 10Matthias Mullie) [19:51:24] (03PS3) 10Matthias Mullie: Revert "Remove handler deleted from the MachineVision extension on beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566841 [19:51:31] (03CR) 10Matthias Mullie: [C: 03+2] Revert "Remove handler deleted from the MachineVision extension on beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566841 (owner: 10Matthias Mullie) [19:52:23] (03Merged) 10jenkins-bot: Revert "Remove handler deleted from the MachineVision extension on beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566841 (owner: 10Matthias Mullie) [19:54:02] (03PS5) 10Matthias Mullie: Add 3d-patents page to wgForceUIMsgAsContentMsg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416730 [19:54:11] (03CR) 10Matthias Mullie: [C: 03+2] Add 3d-patents page to wgForceUIMsgAsContentMsg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416730 (owner: 10Matthias Mullie) [19:55:12] (03Merged) 10jenkins-bot: Add 3d-patents page to wgForceUIMsgAsContentMsg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416730 (owner: 10Matthias Mullie) [19:56:45] (03PS1) 10Ori.livneh: Update flamegraph.pl to brendangregg/Flamegraph@1a0dc6985a [puppet] - 10https://gerrit.wikimedia.org/r/566858 [19:56:46] !log mlitn@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add 3d-patents page to wgForceUIMsgAsContentMsg (duration: 01m 08s) [19:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:48] (03PS1) 10Matthias Mullie: Remove handler deleted from the MachineVision extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566859 (https://phabricator.wikimedia.org/T241242) [19:57:34] (03PS1) 10Matthias Mullie: Remove handler deleted from the MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566860 (https://phabricator.wikimedia.org/T241242) [20:00:05] brennen and twentyafterfour: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Mediawiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200123T2000). [20:00:05] Urbanecm: I'm done [20:00:06] PROBLEM - Check systemd state on cp4032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:13] I suppose there was nothing else? [20:00:19] (end of window anyway, but hey) [20:00:36] Yes,nothing more :) [20:00:42] !log Morning SWAT done [20:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:14] [train noises] [20:03:35] brennen: As in puff-puff-getting-there or screech-slam-derailment? :-) [20:04:19] one moment and we'll find out. :) [20:04:23] * James_F grins. [20:05:48] (03PS1) 10Brennen Bearnes: all wikis to 1.35.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566863 [20:05:50] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.35.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566863 (owner: 10Brennen Bearnes) [20:06:50] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566863 (owner: 10Brennen Bearnes) [20:10:11] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.16 [20:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:48] RECOVERY - Check systemd state on cp4032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:13:02] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:14:01] yow, that's pronounced [20:14:34] CPU is up a lot [20:14:41] (03PS1) 10Jforrester: [nlwiki] Enable VisualEditor by default for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566866 (https://phabricator.wikimedia.org/T161365) [20:14:45] it's during deployment [20:15:04] yeah [20:15:09] it is right after the sync-wikiversions [20:15:15] roll back, you think? [20:15:27] let's see if it continues [20:15:30] (03PS1) 10Jforrester: [officewiki] Enable VisualEditor desktop section editing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566867 [20:15:36] i meant to say that it often happens only during the deployment itself [20:15:37] it's not uncommon to have large CPU consumption right after a deploy [20:15:39] and then recovers [20:15:41] that :) [20:16:08] btw, plugging the CPU usage heatmap I added to the 'cluster overview' dashboard https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&from=now-30m&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-instance=All&fullscreen&panelId=2607 [20:16:26] ooh fancy [20:16:46] if you look at history here, you can see effectively three flavors of appservers [20:16:55] i'm quite sure the small number of nodes that ~always have 0 CPU load are the mwdebugs [20:18:18] how quickly does it recover? [20:20:11] usually faster than icinga rechecks the grafana graph [20:20:16] but this time it hasnt yet [20:20:30] spike of ~20 "MediaWiki::restInPeace: transaction round 'LinksUpdate::doUpdate' still running" errors a bit ago but otherwise error logs look roughly status quo i think, but CPU still elevated... [20:21:46] PROBLEM - PHP7 rendering on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:22:42] RECOVERY - PHP7 rendering on mw1269 is OK: HTTP OK: HTTP/1.1 200 OK - 75043 bytes in 0.628 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:26:28] PROBLEM - Check systemd state on cp4032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:26:47] cdanis, rlazarus: doesn't seem to be tapering off. [20:27:44] yea, this case looks different [20:27:45] * brennen rolling back. [20:27:51] thanks brennen [20:27:58] +1 [20:28:00] thanks [20:29:55] !log reverting group2 to 1.35.0-wmf.15 [20:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:50] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: Revert "group2 wikis to 1.35.0-wmf.15" [20:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:19] well, that's pretty much an immediate effect. [20:32:49] yup that's an "it hurts when I do this" graph [20:33:05] hopefully whatever is going on is just as obvious in profiling :) [20:33:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1013: server down for no reason (power issue?) - https://phabricator.wikimedia.org/T241313 (10Jclark-ctr) 313-hpe smart storage battery 1 Failure - battery shutdown event code: 0x400 action: restart system Needs replacement b... [20:33:14] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:34:27] filing a ticket on this one, any suggestions for tags? [20:43:56] RECOVERY - Check systemd state on cp4032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:44:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1013: server down for no reason (power issue?) - https://phabricator.wikimedia.org/T241313 (10wiki_willy) [20:46:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1013: server down for no reason (power issue?) - https://phabricator.wikimedia.org/T241313 (10wiki_willy) Sure, no problem @Jclark-ctr. I've opened up a procurement task via T243547 for @RobH to order a replacement bbu. Thank... [20:47:00] 10Operations, 10ops-eqiad, 10DBA: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) >>! In T241359#5827691, @Cmjohnson wrote: > @marostegui I do not know yet, there are several racking/setup tasks that I am trying to get thro... [20:48:39] filed T243548 [20:48:40] T243548: Elevated response times and CPU usage after deploy of 1.35.0-wmf.16 to all wikis - https://phabricator.wikimedia.org/T243548 [21:01:30] (03PS1) 10Ottomata: eventstreams - support Kafka client TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/566870 [21:03:37] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventstreams - support Kafka client TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/566870 (owner: 10Ottomata) [21:05:30] (03PS1) 10Ottomata: eventstreams - fix typo in kafka hostname [deployment-charts] - 10https://gerrit.wikimedia.org/r/566871 [21:06:09] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventstreams - fix typo in kafka hostname [deployment-charts] - 10https://gerrit.wikimedia.org/r/566871 (owner: 10Ottomata) [21:06:42] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1013 - https://phabricator.wikimedia.org/T242472 (10Jclark-ctr) 05Open→03Resolved Replaced bbu no errrors at this time [21:07:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1013: server down for no reason (power issue?) - https://phabricator.wikimedia.org/T241313 (10Jclark-ctr) Replaced bbu no errrors at this time closing procurement task T243547 not needed at this time [21:08:24] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [21:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:59] (03PS1) 10Volans: raid handler: update Phabricator API call [puppet] - 10https://gerrit.wikimedia.org/r/566872 (https://phabricator.wikimedia.org/T159045) [21:11:44] 10Operations, 10Phabricator, 10SRE-tools, 10Patch-For-Review, 10Technical-Debt: Update Puppet repo code that uses deprecated maniphest.update/.createtask/.query Conduit API - https://phabricator.wikimedia.org/T159045 (10Volans) [21:12:11] (03CR) 10jerkins-bot: [V: 04-1] raid handler: update Phabricator API call [puppet] - 10https://gerrit.wikimedia.org/r/566872 (https://phabricator.wikimedia.org/T159045) (owner: 10Volans) [21:13:11] (03PS2) 10Volans: raid handler: update Phabricator API call [puppet] - 10https://gerrit.wikimedia.org/r/566872 (https://phabricator.wikimedia.org/T159045) [21:14:49] (03CR) 10Volans: "Yes, it should me migrated to Py3, but it will be done in a separate patch ;)" [puppet] - 10https://gerrit.wikimedia.org/r/566872 (https://phabricator.wikimedia.org/T159045) (owner: 10Volans) [21:17:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1022 memory errors causing host to crash - https://phabricator.wikimedia.org/T243536 (10wiki_willy) a:03Jclark-ctr [21:17:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1022 memory errors causing host to crash - https://phabricator.wikimedia.org/T243536 (10Jclark-ctr) emailed TSR report to dell [21:21:24] (03Abandoned) 10BryanDavis: toollabs: add Buster support to toollabs::apt_pinning [puppet] - 10https://gerrit.wikimedia.org/r/566651 (owner: 10BryanDavis) [21:21:34] (03CR) 10BryanDavis: [C: 03+1] toolforge: refactor elastic role/profile into modern layout [puppet] - 10https://gerrit.wikimedia.org/r/566704 (https://phabricator.wikimedia.org/T236606) (owner: 10Arturo Borrero Gonzalez) [21:24:54] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 (10Bstorm) [21:26:36] (03PS1) 10Bstorm: nfs: puppetize a cloud-vps nfs testbed [puppet] - 10https://gerrit.wikimedia.org/r/566873 (https://phabricator.wikimedia.org/T224582) [21:27:47] Hola Platonides [21:34:56] PROBLEM - Check systemd state on cp4032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:36:41] (03PS1) 10Brennen Bearnes: Revert "all wikis to 1.35.0-wmf.16" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566875 [21:36:43] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "all wikis to 1.35.0-wmf.16" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566875 (owner: 10Brennen Bearnes) [21:37:53] (03Merged) 10jenkins-bot: Revert "all wikis to 1.35.0-wmf.16" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566875 (owner: 10Brennen Bearnes) [21:42:07] (03PS1) 10Volans: wmf-auto-reimage: update Phabricator API call [puppet] - 10https://gerrit.wikimedia.org/r/566876 (https://phabricator.wikimedia.org/T159045) [21:42:27] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10JHedden) [21:42:29] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10JHedden) 05Open→03Resolved Drives 2 and 4 had a foreign configuration. I've cleared the configuration and reassigned them as global host spares.... [21:43:50] 10Operations, 10ops-eqiad, 10User-jbond, 10cloud-services-team (Hardware): (OoW) cloudvirt1006 - RAID battery failed - https://phabricator.wikimedia.org/T222950 (10JHedden) [21:44:02] RECOVERY - Check systemd state on cp4032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:37] (03CR) 10Paladox: [C: 03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/566876 (https://phabricator.wikimedia.org/T159045) (owner: 10Volans) [21:50:54] (03CR) 10Alex Monk: "Doesn't sound like my new instance will match:" [puppet] - 10https://gerrit.wikimedia.org/r/566735 (https://phabricator.wikimedia.org/T242607) (owner: 10Arturo Borrero Gonzalez) [21:51:52] (03PS2) 10Matthias Mullie: Remove handler deleted from the MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566860 (https://phabricator.wikimedia.org/T241242) [21:54:01] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T243555 (10ops-monitoring-bot) [21:55:55] (03CR) 10Alex Monk: "Spoke to Bryan, it sounds like our existing instances need fixing. .cloud is a real TLD (I should've noticed that having used it before :|" [puppet] - 10https://gerrit.wikimedia.org/r/566735 (https://phabricator.wikimedia.org/T242607) (owner: 10Arturo Borrero Gonzalez) [21:59:53] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T243555 (10JHedden) 05Open→03Invalid This is not a failure, the drive is currently rebuilding from task T241884 [22:00:33] (03CR) 10Nuria: "Just one typo" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/566822 (https://phabricator.wikimedia.org/T243426) (owner: 10Joal) [22:01:08] (03CR) 10Bstorm: [C: 03+2] kubernetes: add support for multiple objects of any kind [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/566045 (https://phabricator.wikimedia.org/T156626) (owner: 10Arturo Borrero Gonzalez) [22:01:22] (03CR) 10Alex Monk: [C: 03+1] realm: introduce global variable $wmcs_deployment [puppet] - 10https://gerrit.wikimedia.org/r/566735 (https://phabricator.wikimedia.org/T242607) (owner: 10Arturo Borrero Gonzalez) [22:01:25] (03CR) 10Bstorm: [C: 03+2] "Checked with Bryan, and is ok to merge" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/566045 (https://phabricator.wikimedia.org/T156626) (owner: 10Arturo Borrero Gonzalez) [22:01:30] (03CR) 10Alex Monk: [C: 03+1] wmcs: instance: introduce per-deployment openstack profiles [puppet] - 10https://gerrit.wikimedia.org/r/566736 (https://phabricator.wikimedia.org/T242607) (owner: 10Arturo Borrero Gonzalez) [22:01:32] (03CR) 10BryanDavis: [C: 03+1] "As Krenair noted we need to fix up the instance names generated in codfw1-dev to go with this per https://wikitech.wikimedia.org/wiki/Wiki" [puppet] - 10https://gerrit.wikimedia.org/r/566735 (https://phabricator.wikimedia.org/T242607) (owner: 10Arturo Borrero Gonzalez) [22:01:52] (03Merged) 10jenkins-bot: kubernetes: add support for multiple objects of any kind [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/566045 (https://phabricator.wikimedia.org/T156626) (owner: 10Arturo Borrero Gonzalez) [22:02:27] (03CR) 10BryanDavis: [C: 03+1] wmcs: instance: introduce per-deployment openstack profiles [puppet] - 10https://gerrit.wikimedia.org/r/566736 (https://phabricator.wikimedia.org/T242607) (owner: 10Arturo Borrero Gonzalez) [22:05:08] (03CR) 10Bstorm: [C: 03+2] nfs: puppetize a cloud-vps nfs testbed [puppet] - 10https://gerrit.wikimedia.org/r/566873 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [22:06:35] (03CR) 10Alex Monk: [C: 03+1] "opened T243556" [puppet] - 10https://gerrit.wikimedia.org/r/566735 (https://phabricator.wikimedia.org/T242607) (owner: 10Arturo Borrero Gonzalez) [22:11:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1013: server down for no reason (power issue?) - https://phabricator.wikimedia.org/T241313 (10Jclark-ctr) 05Open→03Resolved [22:11:07] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install/deploy labvirt1012 labvirt1013 labvirt1014 nodes (cloudvirt1012 cloudvirt1013 cloudvirt1014) - https://phabricator.wikimedia.org/T138509 (10Jclark-ctr) [22:17:10] (03PS1) 10Volans: netbox: consolidate naming of resources [puppet] - 10https://gerrit.wikimedia.org/r/566882 [22:19:24] (03CR) 10Volans: "Compiler result: https://puppet-compiler.wmflabs.org/compiler1003/20549/netbox1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/566882 (owner: 10Volans) [22:23:17] (03CR) 10CRusnov: "as discussed lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/566882 (owner: 10Volans) [22:23:24] (03CR) 10CRusnov: [C: 03+1] netbox: consolidate naming of resources [puppet] - 10https://gerrit.wikimedia.org/r/566882 (owner: 10Volans) [22:27:40] (03PS1) 10Bstorm: nfs: fix the test profile up to work better with an instance [puppet] - 10https://gerrit.wikimedia.org/r/566886 [22:30:07] (03CR) 10Jforrester: [C: 04-1] "Waiting for a week." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566866 (https://phabricator.wikimedia.org/T161365) (owner: 10Jforrester) [22:31:35] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/566888 [22:33:56] (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/566888 [22:36:29] (03CR) 10CDanis: [C: 03+2] add API key for scraping of LibreNMS's API by Icinga [labs/private] - 10https://gerrit.wikimedia.org/r/566789 (https://phabricator.wikimedia.org/T224888) (owner: 10CDanis) [22:36:31] (03CR) 10CDanis: [V: 03+2 C: 03+2] add API key for scraping of LibreNMS's API by Icinga [labs/private] - 10https://gerrit.wikimedia.org/r/566789 (https://phabricator.wikimedia.org/T224888) (owner: 10CDanis) [22:39:24] (03PS3) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/566888 [22:42:47] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:42:55] wha [22:46:17] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:49:17] (03CR) 10C. Scott Ananian: [C: 03+1] "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564805 (https://phabricator.wikimedia.org/T239806) (owner: 10Arlolra) [23:11:18] (03PS2) 10Dzahn: add etherpad-new.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/566629 (https://phabricator.wikimedia.org/T224580) [23:12:49] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [23:13:08] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) 19 servers in row A racked and Netnox updated mw2291-mw2309 [23:14:39] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [23:16:36] 10Operations, 10Wikimedia-Etherpad, 10serviceops: vm request for etherpad1002 - https://phabricator.wikimedia.org/T243475 (10Dzahn) 05Open→03Resolved [23:16:40] 10Operations, 10Wikimedia-Etherpad, 10serviceops, 10Patch-For-Review: Migrate etherpad1001 to Buster - https://phabricator.wikimedia.org/T224580 (10Dzahn) [23:17:12] 10Operations, 10Wikimedia-Etherpad, 10serviceops, 10Patch-For-Review: Migrate etherpad1001 to Buster - https://phabricator.wikimedia.org/T224580 (10Dzahn) >>! In T224580#5823156, @akosiaris wrote: >> * prometheus-etherpad-exporter >> * etherpad-lite > > I think both are done now. Wow that was so quick an... [23:18:18] 10Operations, 10ops-codfw: codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Papaul) [23:21:00] (03PS4) 10CDanis: Icinga alert for LibreNMS critical alerts [puppet] - 10https://gerrit.wikimedia.org/r/566888 (https://phabricator.wikimedia.org/T224888) [23:21:33] (03CR) 10CDanis: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/20555/" [puppet] - 10https://gerrit.wikimedia.org/r/566888 (https://phabricator.wikimedia.org/T224888) (owner: 10CDanis) [23:23:02] (03CR) 10Bstorm: [C: 03+2] nfs: fix the test profile up to work better with an instance [puppet] - 10https://gerrit.wikimedia.org/r/566886 (owner: 10Bstorm) [23:23:49] cdanis: ok to merge your labs/private change? [23:24:08] oh oops! yes bstorm_ and sorry [23:24:15] np :) [23:25:14] (03CR) 10Dzahn: [C: 03+2] add etherpad-new.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/566629 (https://phabricator.wikimedia.org/T224580) (owner: 10Dzahn) [23:34:36] 10Operations, 10ops-codfw: codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Papaul) parse200[1-7] racked and Netbox updated [23:41:50] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) switch port information for mw servers in row B rack B3 mw2310-mw2335 : ge-3/0/[26-40] [23:44:44] (03PS2) 10Dzahn: site: add etherpad role to etherpad1002 [puppet] - 10https://gerrit.wikimedia.org/r/566634 (https://phabricator.wikimedia.org/T224580) [23:45:37] (03CR) 10Paladox: "recheck" [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/562363 (owner: 10Paladox) [23:46:11] (03PS2) 10Paladox: Merge branch 'stable-2.16' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/562363 [23:46:13] (03PS1) 10Paladox: Bump Bazel version to 2.0.0 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/566890 [23:46:54] (03CR) 10jerkins-bot: [V: 04-1] Merge branch 'stable-2.16' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/562363 (owner: 10Paladox) [23:47:03] (03CR) 10jerkins-bot: [V: 04-1] Bump Bazel version to 2.0.0 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/566890 (owner: 10Paladox) [23:53:25] (03CR) 10Dzahn: [C: 03+2] site: add etherpad role to etherpad1002 [puppet] - 10https://gerrit.wikimedia.org/r/566634 (https://phabricator.wikimedia.org/T224580) (owner: 10Dzahn) [23:57:18] (03CR) 10Paladox: "recheck" [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/562363 (owner: 10Paladox) [23:57:22] (03CR) 10Paladox: "recheck" [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/566890 (owner: 10Paladox) [23:59:39] (03PS2) 10Dzahn: trafficserver/cache: add etherpad-new -> etherpad1002 [puppet] - 10https://gerrit.wikimedia.org/r/566636 (https://phabricator.wikimedia.org/T224580)