[06:54:20] 10serviceops: mcrouter memcached flapping in gutter pool - https://phabricator.wikimedia.org/T255511 (10elukey) @RLazarus @CDanis maybe as interim solution, while we think about a more "final" solution, we could change `--probe-timeout-initial` from 3s to something like 30/60/300s? To avoid constant flaps if ano... [10:51:02] all deployers: scap sync --canary-wait-time option is available (https://phabricator.wikimedia.org/T217924) [11:10:45] hi all could some one let me know how i go about mergeing and deploying a wikimedia-config patch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/606407 [11:11:10] not something i have had to do before [11:14:47] liw: mori.tzm mentione you may be able to help with ^^ [11:20:53] jbond42, I fear I am utterly incompetent with regards to anything related to MediaWiki configuration, sorry [11:21:36] liw: ack thanks will wiat for someone elses guess everyone is on lunch [11:38:52] mutante: jayme: akosiaris: perhaps ^^ [11:41:26] I've not touched that area till now - so I've no real idea either. :/ [11:42:20] ack thanks jayme [11:45:10] * akosiaris around now, looking [11:45:25] akosiaris: thanks [11:46:25] jbond42: scap sync-file and you should be good [11:46:40] +1ed [11:47:26] akosiaris: i have never deployed a mediawiki-config at all. is it just submtit then `scap sync-file $thefile` from deploy1001? [11:48:38] and a git pull before that [11:49:02] ack thanks ill give it a try [11:49:26] dir is /srv/mediawiki-staging btw [11:49:35] yep ack [11:51:41] labs should pick it up automatically in 10 or so mins IIRC [11:52:07] seems to have gone smooly and ack [13:10:23] 10serviceops, 10CX-cxserver, 10Language-Team (Language-2020-Focus-Sprint), 10Release-Engineering-Team (Pipeline): Migrate apertium to the deployment pipeline - https://phabricator.wikimedia.org/T255672 (10KartikMistry) [14:15:09] 10serviceops, 10MediaWiki-General, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team), and 3 others: Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10AMooney) @tstarling anything left to do for this task? [15:36:26] 10serviceops, 10Operations, 10ops-codfw: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10akosiaris) [17:59:19] I want to put some Mediawiki appserver saturation metrics (busy vs idle worker threads, overall CPU usage) on a dashboard somewhere but I'm not quite sure where that should be [17:59:49] what are you trying to get out of the numbers? [18:00:40] mostly I want to make it more obvious when we're in a situation where all the workers are consumed/stuck, and link to it from an alert about such [18:00:59] there's the RED dashboard but RED explicitly doesn't cover saturation 🙃 [18:01:39] (4GS is basically RED plus saturation, though) [18:15:26] ημμ [18:15:29] woops [18:15:30] hmmm [18:16:15] * apergos looks around at the other existing apps-related dashboards [18:16:30] this would be the regular appservers or also api and whatever else? [18:16:52] mostly appserver and api, maybe parsoid as well [18:17:13] i'm more concerned with things are in the user query path, and i suspect (but haven't checked) that a high utilization of jobrunner threads is 'normal' [18:17:28] 10serviceops, 10Operations: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2339.codfw.wmnet ` The log can be found in `/var/log/... [18:18:03] I hate to add to dashboard proliferation but maybe a new one is appropriate [18:20:15] https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1 otoh there is already 'worker saturation' there [18:20:21] unless you just added it that is :-P [18:20:23] yeah, that's for a single worker [18:20:28] this is for the whole fleet [18:20:56] AIUI there isn't really a whole-fleet MW dashboard aside from the RED dashboard [18:21:27] not so much [18:21:29] guess it's time [18:22:39] idk I'm tempted to add a (maybe default-collapsed) 'Saturation' section to the RED dashboard, it's just one panel [18:46:18] 10serviceops, 10MediaWiki-General, 10Security-Team, 10Performance-Team (Radar), 10Security: Create a tmp directory just for MediaWiki - https://phabricator.wikimedia.org/T179901 (10BPirkle) a:05BPirkle→03None [18:51:30] 10serviceops, 10Operations: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2339.codfw.wmnet'] ` and were **ALL** successful. [18:53:19] well if you want to seperate out api and regular and jobrunner also [18:53:23] then that's 4 panels [18:53:32] or i guess you could do them all on one, meh [18:53:42] no, I'd just make it show whichever cluster is selected as a variable at the top [18:53:54] that works too [18:55:01] I do think it is a high-level system health indicator for appservers, and AIUI the RED dashboard is intended to display such things, even if this is stretching the definition of RED a bit [18:56:27] I'd rather have the right graphs in the right place than be religious about the definitions [18:56:46] but the "RED" name predates me, so grain of salt [18:57:55] for context, nice short breakdown of USE/RED/4GS here https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/ (with a longer talk I haven't watched available there too) [22:05:55] 10serviceops, 10Operations, 10observability, 10Patch-For-Review: Reliable metrics for idle/busy PHP-FPM workers - https://phabricator.wikimedia.org/T252605 (10CDanis) 05Open→03Resolved We now have an alert and a graph based on scraping the status string that php-fpm provides to systemd, which is reliab...