[07:40:04] I want to test-drive the 7.2.21 packages on mwdebug (independent of whether we also add the gc patch on top), is the PHP7 flag from X-Wikimedia-Debug browser extension still working (as IIRC there were some changes for the PHP 7 beta option being dropped) [08:13:41] <_joe_> yes [08:13:55] <_joe_> it still works and you can pick the right interpreter from it [08:18:33] thx, gonna do some tests on them in a bit [08:50:17] Hi, since we deployed termbox yesterday we noticed that on a semi regular basis its request to the mediawiki api app servers seems to timeout. I opened at ticket at https://phabricator.wikimedia.org/T230976. It looks to me like something happen semi regularly at around 22-28min intervals. Is there anything like that on the k8s setup that springs to mind? Some kind of GC or something that happens around then? [08:50:31] Or maybe even something that regularly happens on the app servers? [09:00:15] <_joe_> nope [09:00:35] <_joe_> and I won't have time to look into it right now [09:33:24] _joe_: cool! understood :). No hurry; it's *probably* not even noticable to the users [10:11:36] 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Joe) I think this is a reasonable explanation, but how would you suggest we should fix our monitoring? [10:12:00] 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Joe) p:05Triage→03Normal a:03Joe [10:12:10] 10serviceops, 10Operations, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10Joe) p:05Triage→03Normal [11:19:48] 10serviceops, 10Operations, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10jijiki) [11:27:31] 10serviceops, 10Operations, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10jijiki) We are on our way to finishing migration to PHP7, my opinion is to try the PHP7.2 bandaid rather than upgrading production to P... [13:20:56] 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) I'm looking into what it would take to monitor a celery worker pool on a specific machine.... [13:45:17] 10serviceops, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10jijiki) [13:47:51] 10serviceops, 10Scap, 10PHP 7.2 support, 10Patch-For-Review, and 3 others: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 (10jijiki) @thcipriani When a deployer run scap pull on mwdebug, they got `sudo: [/usr/local/sbin/check-and-restart-php,: command not fo... [14:08:45] <_joe_> jijiki, apergos shall we skip the discussion meeting this week? I could use a break :P [14:08:56] we're missing mark of course [14:09:17] who else is not here? is akosiaris also out? [14:09:23] <_joe_> yes [14:09:24] <_joe_> and daniel [14:09:31] <_joe_> it's the three of us [14:09:42] seems fine by me, you have no burning topics I guess? [14:09:50] <_joe_> yeah [14:10:09] <_joe_> unless you want a 30 min whining session on all-hands dates :P [14:10:17] I've been keeping up on the comments [14:10:38] as someone who's never been involved with fossdem, my voice would have no weight here but I do hope it gets sorted [14:11:04] <_joe_> I think it's more about being heard than the actual conference :) [14:12:05] I think it's both: being heard doesn't necessarily mean the problem gets resolved! [14:12:30] <_joe_> indeed :D [14:17:40] 10serviceops, 10Operations, 10Core Platform Team (Needs Cleaning - Services Operations): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Jdlrobson) [14:18:16] <_joe_> tarrow: what was your task about termbox timeouts? [14:20:16] 10serviceops, 10Operations, 10PHP 7.2 support: Mysterious, coordinated slowdowns every ~ 25 minutes on mw1347,mw1348 (php7 api servers) - https://phabricator.wikimedia.org/T231011 (10Joe) [14:20:36] <_joe_> found it, nevermind [14:20:51] 10serviceops, 10Operations, 10PHP 7.2 support: Mysterious, coordinated slowdowns every ~ 25 minutes on mw1347,mw1348 (php7 api servers) - https://phabricator.wikimedia.org/T231011 (10Joe) p:05Triage→03High [14:23:59] 10serviceops, 10Operations, 10PHP 7.2 support: Mysterious, coordinated slowdowns every ~ 25 minutes on mw1347,mw1348 (php7 api servers) - https://phabricator.wikimedia.org/T231011 (10jijiki) [14:36:21] 10serviceops, 10Operations, 10Traffic, 10Patch-For-Review: Migrate Failoid hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224559 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff New VMs (failoid1001 and failoid2001) have been setup and are in active use now. I'll keep the old jessie VMs around f... [14:50:31] 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) So, I've been trying explore the behaviors of `celery -A ores_celery inspect ping` to see if... [15:24:44] <_joe_> urandom: my idea is basically to go to kubestage and run perf on the process to register the syscalls [15:25:18] _joe_: +1 [15:25:42] <_joe_> I'm a bit rusty on how to calculate the duration of each [15:26:07] <_joe_> but the premise should be that we don't need to get the horrendous perf hit that strace gives you [15:27:18] _joe_: I dunno if that hit would be an issue or not, the staging pod is sufficiently small that it's already quite constrained [15:28:19] <_joe_> oh strace basically serializes your syscalls and ptrace is kinda dangerous [17:42:45] just noticed something in db-eqiad.php [17:42:48] # This key must exist for the master switch script to work, which means comment and uncomment [17:42:50] # the individual shards, but leave the 'readOnlyBySection' => [ ], alone. [17:42:54] where does this script live? it surely needs updates post-dbctl [17:43:03] mmh I will file a task [17:44:35] cdanis: that was the switcdc repo that become the basis to write spicerack, now lives as cookbooks in the cookbooks repo [17:44:44] but [17:44:58] it doesn't use anymore the readonlybysection IIRC and uses the global RO instead [17:46:03] 10serviceops, 10Operations, 10conftool: update master DC switch script for a post-dbctl world - https://phabricator.wikimedia.org/T231035 (10CDanis) [17:46:10] ahah [17:46:20] >While I'm mildly surprised that the DC switch script cares about per-section readonly status -- I had assumed it would use the other mwconfig conftool objects WMFMasterDatacenter and the per-DC ReadOnly -- if the switch script does need to edit readOnlyBySection it needs to be updated to invoke dbctl. [17:46:29] beaten by volans again [17:46:41] lol, it's not a race :D [17:46:48] side rant, we never update our documentation here [17:47:01] true that, and guilty as charged [17:47:49] 10serviceops, 10Operations, 10conftool: update master DC switch script for a post-dbctl world - https://phabricator.wikimedia.org/T231035 (10CDanis) 05Open→03Invalid `17:44:35 cdanis: that was the switcdc repo that become the basis to write spicerack, now lives as cookbooks in the cookbooks repo... [17:50:08] also haha [17:50:10] {"ReadOnly": {"val": "You can't edit now. This is because of maintenance. Copy and save your text and try again in a few minutes."}, "tags": "scope=codfw"} [17:50:36] I guess technically the migration to active-active is a form of maintenance 🤔 [17:54:07] the reason is a CLI param for the script [17:54:30] so it could be overriden :D [17:55:06] I have to confess I don't understand how the two different flavors of readonly messages interact [17:55:10] (per-section vs per-DC) [17:56:53] per DC wins [17:57:13] per section is used when we do master failover on a specific section [17:58:24] the first switchdc we did people was not sure about the global RO so we modified all the per section ones :) [17:58:34] because that was "tested" and we knew it would work [17:58:39] haha [17:58:47] are we going to do another one of those btw [17:59:04] (do we ever do any other disaster preparedness exercises?) [17:59:08] in theory yes, unless the a/a work catches up beofre that :D [17:59:48] that's a much longer discussion ;) switcdc and disaster recovery are quite different in scope and execution [18:32:02] <_joe_> cdanis: yeah that comment is clearly outdated [18:32:14] <_joe_> we use the global RO from etcd during switchovers