[00:19:25] 10serviceops, 10Operations, 10Patch-For-Review: move all 86 new codfw appservers into production - https://phabricator.wikimedia.org/T247021 (10ops-monitoring-bot) Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 9 host(s) and their services with reason: new_install ` mw[2301-2309].codfw.wmnet ` [00:39:19] 10serviceops, 10Patch-For-Review: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster - https://phabricator.wikimedia.org/T239835 (10Jdforrester-WMF) [01:01:20] 10serviceops, 10Operations, 10Patch-For-Review: move all 86 new codfw appservers into production - https://phabricator.wikimedia.org/T247021 (10ops-monitoring-bot) Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 9 host(s) and their services with reason: new_install ` mw[2301-2309].codfw.wmnet ` [01:41:09] 10serviceops, 10Operations: move all 86 new codfw appservers into production - https://phabricator.wikimedia.org/T247021 (10Dzahn) mw2301 thru mw2309 pooled and set to active in netbox [01:42:38] 10serviceops, 10Operations: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) [01:51:39] 10serviceops, 10Operations, 10ops-codfw: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Dzahn) mw2291 through mw2324 are now pooled and status active in netbox (34 servers) mw2325 through mw2334 are not pooled but in site.pp and status staged in net... [01:51:52] 10serviceops, 10Operations: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) mw2291 through mw2324 are now pooled and status active in netbox (34 servers) mw2325 through mw2334 are not pooled but in site.pp and status staged in... [02:04:04] 10serviceops, 10Operations, 10Patch-For-Review: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) [02:17:53] 10serviceops, 10Operations, 10Patch-For-Review: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) p:05Triage→03High a:03Dzahn [02:22:08] 10serviceops, 10Operations, 10Patch-For-Review: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) [02:37:47] 10serviceops, 10Operations, 10Performance-Team, 10Performance-Team-publish: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10Dzahn) set mw1385 - mw1413 all to status Active in Netbox. mw1413 is also pooled meanwhile. mw1403 was planned -> act... [05:35:59] 10serviceops, 10Operations, 10Performance-Team, 10Performance-Team-publish: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10Joe) Let's see how those numbers work when we decommission the oldest servers, but this seems very encouraging indeed. [09:20:32] 10serviceops, 10Operations, 10observability, 10vm-requests: Provision grafana VM in codfw - https://phabricator.wikimedia.org/T244357 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Nothing left to do here, resolving [11:31:00] 10serviceops, 10Operations, 10ops-codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10akosiaris) p:05Triage→03High [12:22:55] 10serviceops, 10Proton, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Profile proton memory usage for Helm chart - https://phabricator.wikimedia.org/T238830 (10akosiaris) 05Open→03Resolved https://releases.wikimedia.org/charts/ \o/ Resolving, I 'll track the creation of namespace... [12:23:35] 10serviceops, 10Proton, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Profile proton memory usage for Helm chart - https://phabricator.wikimedia.org/T238830 (10akosiaris) [14:40:53] 10serviceops, 10MediaWiki-JobQueue, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: Allow MW REST API to be called on job runners and video scalers - https://phabricator.wikimedia.org/T246389 (10Joe) There are several issues with this approach: - We need to be able... [15:23:34] 10serviceops, 10MediaWiki-JobQueue, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: Allow MW REST API to be called on job runners and video scalers - https://phabricator.wikimedia.org/T246389 (10Pchelolo) > We need to be able to discern videoscaling and normal jobr... [16:40:10] 10serviceops, 10Operations, 10Patch-For-Review: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10ops-monitoring-bot) Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 5 host(s) and their services with reason: new_install ` mw[2325... [16:40:40] 10serviceops, 10Operations, 10Patch-For-Review: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10ops-monitoring-bot) Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 4 host(s) and their services with reason: new_install ` mw[2331... [17:05:11] 10serviceops, 10Parsing-critical-path, 10Parsoid-PHP, 10MW-1.35-notes (1.35.0-wmf.20; 2020-02-18), 10Patch-For-Review: Craft a deployment strategy to transition Parsoid/PHP from a faux extension to a composer library without breaking incoming requests - https://phabricator.wikimedia.org/T240055 (10ssastry... [17:08:04] 10serviceops, 10Release-Engineering-Team, 10Core Platform Team Workboards (Clinic Duty Team): Enable phpdbg on mwdebug* servers - https://phabricator.wikimedia.org/T244549 (10hnowlan) Good to see. If there's more that can be done on my end please feel free to re-open, I was a little quick to close this one o... [17:26:10] 10serviceops, 10Operations, 10Patch-For-Review: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10ops-monitoring-bot) Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 4 host(s) and their services with reason: new_install ` mw[2331... [17:28:36] 10serviceops, 10Operations, 10Patch-For-Review: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10ops-monitoring-bot) Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 5 host(s) and their services with reason: new_install ` mw[2325... [18:09:50] 10serviceops, 10Operations, 10Patch-For-Review: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) ` {"mw2325.codfw.wmnet": {"weight": 15, "pooled": "yes"}, "tags": "dc=codfw,cluster=appserver,service=apache2"} {"mw2325.codfw.wmn... [18:11:21] 10serviceops, 10Operations, 10Patch-For-Review: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) mw2325 through mw2334 set to Active in Netbox 10 servers pooled at 18:05 UTC, March 6th. [18:13:48] mw in rack B6 in codfw just became active [18:14:44] _joe_: just to verify that I unerstood correctly what you said on the meeting - this https://github.com/wikimedia/operations-mediawiki-config/blob/7a81a13be0c59c23f718952e6c468bdc45aab634/wmf-config/CommonSettings.php#L3819 is not correct? [18:14:53] the 'videoscaler' part is redundant? [18:16:21] Pchelolo: i think so. because of: conftool-data/node/eqiad.yaml: videoscaler: *jobrunner [18:16:27] nowadays they are not separate machines anymore [18:16:45] mutante: ok. thank you. I'll make a patch to clear it [18:17:08] modules/role/manifests/mediawiki/jobrunner.pp: include ::profile::mediawiki::videoscaler [18:19:39] 10serviceops, 10Operations, 10Patch-For-Review: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) [18:21:43] <_joe_> Pchelolo: yes, exactly [18:21:50] <_joe_> it's redundant [18:21:55] <_joe_> but I'm off now :P [18:26:17] 10serviceops, 10MediaWiki-JobQueue, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team): Find a way to set elevated timeouts for job running - https://phabricator.wikimedia.org/T247114 (10Pchelolo) [18:26:28] have a good weekend [18:37:34] 10serviceops, 10Release-Engineering-Team, 10Core Platform Team Workboards (Clinic Duty Team): Enable phpdbg on mwdebug* servers - https://phabricator.wikimedia.org/T244549 (10EBernhardson) Further improvements are probably not really an SRE thing. Essentially, use of `../../../home/ebernhardson/test.php`, es... [19:00:02] 10serviceops, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), 10Test-Coverage: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10Daimona) Alternatively, can we pull... [19:20:38] 10serviceops, 10Parsing-critical-path, 10Parsoid-PHP, 10MW-1.35-notes (1.35.0-wmf.20; 2020-02-18), 10Patch-For-Review: Craft a deployment strategy to transition Parsoid/PHP from a faux extension to a composer library without breaking incoming requests - https://phabricator.wikimedia.org/T240055 (10cscott) [19:21:16] 10serviceops, 10Parsing-critical-path, 10Parsoid-PHP, 10MW-1.35-notes (1.35.0-wmf.20; 2020-02-18), 10Patch-For-Review: Craft a deployment strategy to transition Parsoid/PHP from a faux extension to a composer library without breaking incoming requests - https://phabricator.wikimedia.org/T240055 (10cscott) [20:33:50] mutante, you around? [20:34:31] subbu: he just stepped out to lunch [20:34:44] looks like puppet runs on scandium are dirtying the /srv/parsoid-testing repo ... and we don't know why. [20:34:51] rlazarus, ok .. or anyone who knows about puppet :) [20:35:37] I don't know much but I'll see what I can find out [20:35:59] subbu: adding files to? or messing with permissions? [20:36:40] so .. cscott and i have a theory ... [20:37:08] cscott filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/577654 to fix a perms issue [20:37:14] but he also manually change the perms on the directory on scandium. [20:37:25] so, we htink puppet might be trying to reclone it since the perms differ from the state on disk. [20:37:53] so, maybe we should get 577654 reviewed and merged OR we reset the perms to how puppet thinks it should be? [20:38:41] for verifying tests locally on scandium, we run a script which changes the git sha that runs .. and arlo found that in the middle of the runs, the git repo would get reset / dirtied. [20:39:00] and cscott and i think it is the permissions issue but we don't know more than that. [20:41:51] which permissions are you getting and which do you expect? [20:42:51] the `shared` parameter sets git's core.sharedRepository config to "group", and it also changes the default permissions from 0755 to 2755, but if you want to change the permissions more generally there's the `mode` parameter for that [20:43:20] it is g-w ... and we want it g+w and all new files to also get the g+w perm set. [20:43:35] so, whatever makes that work is what we need :) [20:44:30] I _think_ you want to set both shared and mode, but let me keep reading docs real quick [20:44:55] or if cdanis knows, I'd trust him :D [20:46:26] I don't see any existing instances of git::clone where shared and mode are both set, but I also don't see any reason why it wouldn't work [20:47:16] setting mode might be sufficient, I'm not sure [20:49:23] no, disregard that last -- sorry for the back and forth :) setting *shared* might be sufficient, and we can definitely give 577654 a try -- we might want to follow up with another patch to set mode if that doesn't do the trick [20:56:09] subbu: anything before I merge this, or should I do that right now? [22:23:49] subbu: ping on that question ^ :) happy to merge that if you're ready, just want to confirm first so I don't step on anything [22:24:20] oh yes, I thought you didn't want to merge it yourself ... so, miscommunication. :-) [22:24:28] please +2 it whenever. :) [22:24:42] cool, going ahead [22:27:47] subbu: merged, and ran puppet on scandium -- let me know how that looks [22:29:13] lgtm .. and puppet didn't auto-refresh the repo which is how it should be ... so presumably our manual change of perms is what caused puppet to dirty the repo earlier. [22:29:22] will let you know if we see any problems. [22:31:03] sounds good [22:41:59] subbu: shutting off for the day, but I'll be nearby for a few more hours -- feel free to ping me if anything goes awry with that change, and otherwise have a good weekend [22:42:41] sounds good. it is just a test server, so even if something breaks, it can wait till next week. [22:42:48] 👍 [22:57:20] ah, thanks for dealing with the scandium issue [23:06:52] 10serviceops, 10MediaWiki-JobQueue, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: Enable MW REST API on job runners and video scalers (for the new rest.php job executor) - https://phabricator.wikimedia.org/T246389 (10Krinkle)